First Four Validations: DONE

By now we validate 4 things about the predicted genes: the length (by clusterization and ranking - the rank of the prediction among all the hits), the reading frame, whether there is a duplication in the prediction and whether the prediction is in fact a merge between multiple genes.

We adjusted our previous approach for merge detection in order to have fewer false positives. I'll briefly explain how we do the validation now:
- we plot a 2D graph by using the start/end offsets of the matched regions in the prediction on the two axis
- we draw a line obtained by linear regression 

- a gene merge is present if the slope of the line is between 0.4 and 1.2
- these thresholds were chosen empirically and will be adjusted after analyzing a larger amount of data.

Here it is a simplified drawing explaining this:


The output can be provided in html format (if you use the proper command line arguments). Here you can find sume examples, grouped by validation test:
Length validation:
Gene merge validation:
Duplication check: 

If I convinced you, try the application yourself:
Update: please clone from the developmenet branch
git clone
ruby prediction.rb -t protein --skip_blast --outfmt html gene_merge_validation/duplications/duplications.xml

Problems with predicted genes - gene merges

Apart from deciding whether the length of the prediction is similar to the majority of the lengths of the BLAST hits, there are many other validation tests that have to be applied on the predicted data. This week we have been focusing on reading frame inconsistency, detecting gene merges and duplications.
  • Gene duplications - can occur due to an error in the replication mechanism of the cell, or simply  due to a sequencing error, which we have to report. In these situations BLAST finds more high-scoring segment pairs (hsp) from a single hit. Here's an example for PB16966 gene (of Yannick's knowledge), where a sequence of 6 exons is repeated. BLAST found the following:    <Hit_num>1</Hit_num>


    Notice that in such cases hsps from the same hit do overlap and the matched regions in the predictions do not.                                                                                                          
  • Reading frame inconsistency - this case can be retrieved from the BLAST output: for a sequence of nucleotides, we use blastx to find the most similar proteins. The query sequence is translated in six reading frames, and all the resulting six-protein sequences are compared (there's no evidence where the codon starts from, so all the 3 possibilities of starting the gene translation are considered in BLAST; also another 3 possibilities when reading the sequence in reverse). There can be a problem in the predicted sequence when the hits have various reading frames, of the same sign (sequences are read in the same direction). The sequencing errors leading to this type of errors are usually reading frame shifts. For example a nucleotide was accidentally introduced at a certain point of the mRNA extraction, causing a reading frame inconsistency between the left and the right part pf the predicted gene, but still read in the same direction. 
  •  When reading frame is consistent for all the hits, we consider 3 cases:  

    • One direction gene merge (----> ----------->)   - same sign reading frames for all hits
    • Opposite direction gene merge (----><----------)   - two opposite sign reading frames for the hits
    • otherwise we have no merge

    Our approach  for identifying gene merges is to analyze the distribution of the middles of each region in the predicted gene that matches a hit gene from BLAST. You can observe in the plots below that in case of gene merge the data have a multimodal distribution, otherwise the distribution is unimodal. We calculate a score that quantifies whether the assumption of unimodality is more appropriate than the one for multimodality using the approach described in [3] for model based clustering. More plots here [2]. 

    1) Example for merged genes (bimodal distribution)

        2) No merge (unimodal distribution)

Databases for predicted genes

Just performing the length validation it becomes obvious whether data from a certain database have been curated or not.

The databases and species on which we perform our tests are:

Most of the currated data from Uniprot database (yellow starred) passes the length validation test. All the others remain our challenge.
Other databases I did't touch yet: for all kind of species [1] and plants [2].