Genome sequencing is now possible at almost no cost. However, obtaining accurate gene predictions remains a target hard to achieve with the existing biotechnology.

We designed a tool that helps identifying problems with gene predictions, based on similarities with data from public databases (like Swissprot and Uniprot). We apply a set of validation tests that provide useful information about the problems that appear in the predictions, in order to make evidence about how the gene curation can be made or whether a certain predicted gene may not be considered in other analysis.

Our main target users are the biologists, who have large amounts of genetic data that has to be manually curated, and our tool will prioritize the genes that need to be checked and suggest possible error causes.

B. Validations

In order to highlight the problems that appear in the predictions and suggest possible error causes, we developed 7 validation test:

1) Length validation via clusterization

2) Length validation via ranking

3) Duplication

4) Gene merge

5) Multiple alignment

6) Blast Reading Frame - applicable only for nucleotide sequences

7) Open reading frame - applicable only for nucleotide sequences

8) Codon usage -- under development

Data needed for each validation is retrieved from the BLAST output and used as follows (aliases and tags are those used in BLAST):

Blast param

sseqid

sacc

slen

qstart

qend

sstart

send

length

qseq

sseq

qframe

evalue

Query raw_seq

Hit raw_seq

Valid-ation

lenc

lenr

frame

merge

dup

orf

align

codon

x = mandatory parameter

~~ = May be needed

* = Nice to have

Each validation is described further in this section.

1) Length Validation via clusterization

Error causes

sequencing error: some parts of the gene were lost/added on the way

gene prediction error: after “reading” a genome sequence with new generation sequencing techniques, the start/end of the gene was not well estimated
the gene had a low expression level
the sequenced mrna incorrectly contains an some introns

Input data

lengths of the hits: this data is retrieved after parsing the blast output file
length of the prediction: this number is the length of the current query from the fasta. In case of nucleotide sequences, we are interested in the length of the corresponding protein translation of the query (which is the length of the query divided by 3, plus or minus 1 or 2 residues, depending on the reading frame -- this detail is not taken into consideration here)

Class information:

short_header: LengthCluster
header: Length Cluster
short description: Check whether the prediction length fits most of the BLAST hit lengths, by 1D hierarchical clusterization. Meaning of the output displayed: Prediction_len [Main Cluster Length Interval]
cli_name: lenc

Workflow

Aim : we are interested to find out if the length of the predicted sequence belongs to the distribution of the hit lengths (in other words, how close is the length of the prediction to the majority of the lengths of the hits)

By plotting the histogram of the length distribution of the hits we observe that the distribution does not fit a Bell Curve, so we cannot apply the classical T-test.

(1)

Our approach to find the majority of lengths among the reference lengths is by a typical hierarchical clusterization:

Firstly we assume that each length belongs to a separate cluster. Each step we merge the closest two clusters, until a cluster that contains more than 50% of the reference sequences is obtained. Each clusters is represented with a different colors in Figure 2. The one colored in red is called the “main cluster”.

(2)

Finally we are interested to check whether the length of the prediction belongs to the main cluster or not. The validation test will pass if the length of the prediction belongs to the main cluster of lengths (see Figure 3) and will fail otherwise (Figure 4).

(3) (4)

Code

#clusterization by length

contents = lst.map{ |x| x.length_protein.to_i }.sort{|a,b| a<=>b}

hc = HierarchicalClusterization.new(contents)

clusters = hc.hierarchical_clusterization

max_density = 0;

max_density_cluster_idx = 0;

clusters.each_with_index do |item, i|

if item.density > max_density

max_density = item.density

max_density_cluster_idx = i;

end

clusterization = [clusters, max_density_cluster_idx]

@clusters = clusterization[0]

@max_density_cluster = clusterization[1]

limits = @clusters[@max_density_cluster].get_limits

prediction_len = @prediction.length_protein

@validation_report = LengthClusterValidationOutput.new(prediction_len, limits)

plot1 = plot_histo_clusters

@validation_report.plot_files.push(plot1)

plot2 = plot_len_clusters

@validation_report.plot_files.push(plot2)

Plots

length distribution histogram (prediction passes the validation test in Figure 3 and fails in Figure 4)

(3) (4)

line plot representing the regions of the hits that are matched by the prediction query