BpMatch: an efficient algorithm for segmenting sequences, calculating genomic distance and counting repeats.
There are several important reasons (biological, evolutionary, clinical, etc.) to give a segment-based description of genomic sequences, and, in particular, to detect repeated segments, written both direct and complemented inverted. In some applications, in particular in medical genomics, it is also necessary to count the number of occurrences of a segment. Moreover, by detecting common segments shared by two different sequences it is possible to define a sort of genomic distance between them. Here we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, allows us to achieve all these three goals (identify repeated segments, including the complemented inverted copies of them, count repeats number and calculate genomic distance) in a fast and efficient way.BpMatch is able to identify exact copies (and complemented inverted copies) of a segment. The operator should define a priori the minimum length l of a string, in order to be considered a segment, and the minimum number of occurrences minRep, so that only segments having a number of occurrences greater than minRep are considered to be significant. BpMatch is very efficient; we determined the complexity in time to calculate the self-covering of a string S,
giving l, the alphabet dimension d and n=|S|. On the worst case, assuming the alphabet dimension is a constant, the time required to calculate the coverage is O(l^2n).
On the average, using l <= 2\log_d(n), the time required to calculate the coverage is only O(n). It is important to note that this estimation includes the time required to complete all of the three different tasks: to identify copied segments, to localize them, to count the number of occurrences and to evaluate the sequence coverage.