A compression-based approach for coding sequences identification in prokaryotic genomes

To identify coding regions in genomic sequences represents the first
step toward further analysis of the biological function carried on by
the different functional elements in a genome. The present paper
presents a novel method for the classification of coding and
non-coding regions in prokaryotic genomes, based on a suitable
defined compression index of a DNA sequence.
The proposed approach has been applied on some prokaryotic complete
genomes, obtaining optimal scores of correctly recognized coding and
non-coding regions. Several false-positive and false-negative cases
have been investigated in detail, discovering that this approach can
fail in the presence of highly-structured coding regions (e.g., genes
coding for modular proteins) or quasi-random non-coding regions
(regions hosting non-functional fragments of copies of functional
genes; regions hosting promoters or other protein-binding
sequences, etc.).