[GSoC 2013] Identifying problems with gene predictions: Half of GSoC

It's been a while since my last post on the blog: one week holiday, followed by a code refactorization and bug fixing period and finaly, the GSoC mid term evaluations. Meanwhile, the application took shape, became more robust and half of the validations have been implemented. Currently, our application analysis the predicted genes and provides useful information based on the similarities to genes in public databases (e.g length validation, gene merge and sequence duplication checking, reading frame and main ORF validation). An output example is available at [1].

The results of the validations can be visualized in console, in yaml or html format. By choosing the html output format, you enable plot image generation. The distribution of the data specific to each kind of distribution can be graphically visualized and possible errors in the genes may be highlighted.

At the moment our deliverable is a ruby gem, decorated with unit and statistical tests and automatically generated documentation. It can be cloned from the 'rubygem' branch of the git repo. I'll come back soon with another post regarding the application code structure and how new validations can be added.

Validations were tested and labelled by hand. What I noticed is that even genes that passed through a first round of curation (e.g those on ncbi databases) are susceptible of not being accurately predicted. Further, we are looking for a list of recently curated genes (unreleased yet) in Hymenopteragenome database [2] to check if our tool makes evidence about some improvements among predictions from two different releases.

What's next? We start a new validation test based on multiple alignment in order highlight the extra regions/gaps in the prediction and check whether the conserved regions appear in the predicted sequence.

If you are a biocurator, you may want to use our tool to save your time and facilitate your work, by keeping track of the genes that are susceptible of having problems so that to be curated first or discarded. If you are a bioinformatics enthusiast, you may want to see if the genes form the databases you use have problems and evaluate how strong your gene analysis is, as long as the data you use is verified and validated.

As half of the GSoC already passed, I just want to say that I enjoy a lot the time spent on this project and the people I met on this occasion. What is cool about GSoC is that you work on the project you are keen on and manage your time as you wish. Also, working remotely involves additional challenges. Regarding Ruby, it's been several weeks since I started programming in this language and what I can say by now is that I got along very well with it. It's an awesome language and very intuitive to use for someone who once got in touch with Python/Haskell.

I am coming back next with a post on my first experience with YAML data representation.

[1] http://swarm.cs.pub.ro/~mdragan/gsoc2013/one_direction_gene_merge/html

[2] http://hymenopteragenome.org/

[GSoC 2013] Identifying problems with gene predictions

Half of GSoC

1 comments:

Post a Comment

Pages

Labels

Blog Archive

Meta

About