Length validation

So far, our application looks like this:













In other words, given a FASTA file with a number of sequences of the same type (mrna/protein), the program takes (by calling blast) a set of reference sequences (the most similar with the current predicted gene). For the moment, from all the information provided by blast, we are interested only in the length of the reference/predicted sequences, in order to start the length validation of the predicted sequence.

As we observed, the length distribution does not fit a bell curve. The actual way to find the majority lengths among the reference lengths  is by a typical hierarchical clusterization. First we assume that each length belongs to a separate cluster. Each step we merge the closest two clusters, until a cluster that contains more than 50% of the reference sequences is obtained.

The result of the clusterization can be observed in the histogram (the most dense cluster is in red). The length of our predicted data is a black vertical line.

Here you are some outputs computed for some predicted protein sequences from the ant Solenopsis invicta [1]:

1) ACCEPTED predicted sequence

2) ACCEPTED predicted sequence
 3) UNACCEPTED predicted sequence 
 4) UNACCEPTED predicted sequence 

You can try the application by yourself by cloning the code from github [2] and meeting the requirements (same Ruby gams and paths to CLASSPATH must be added -- see the README). More histograms can be found here [3].

Next step is to add a confidence percentage for each length validation test.

Any feedback from you is welcome and valuable!


[1] http://www.antgenomes.org/downloads/
[2] https://github.com/monicadragan/gene_prediction
[3] https://github.com/monicadragan/gene_prediction/tree/master/test_dataset/plots




Welcome to the Gene Prediction Project's Blog!

People involved in this project:
GSoC Student: Monica Dragan
Mentors: Anurag Priyam and Yannick Wurm


Genome sequencing is now possible at almost no cost. However, obtaining accurate gene predictions remains a target hard to achieve with the existing biotechnology. The goal of this project is to create a tool that identifies potential problems with the predicted genes, in order to make evidence about how the gene curation can be made or whether a certain predicted gene may not be considered in other analysis. Also, the prediction validation could be used for improving the results of the existing gene prediction tools.
The application takes as input a collection of mRNA / protein predictions (called predicted sequences) and identifies potential problems with each sequence, by matching and comparing them with sequences available in trusted databases (called reference sequences). The tool will determine if the following errors appear in the predicted sequence:
  • whether the predicted sequence does not have an acceptable length, according to the reference sequence set.
  • the occurrence of gaps or extra sections in the predicted sequence, according to the reference sequence set.
  • some of the conserved regions among the reference sequence are absent in the predicted sequence.
The main target users of this tool are the Biologists who want to validate the data obtained in their own laboratories. In the end, the application will be be easily installable as a RubyGem.