Lossless Filter for Long Multiple Repetitions with Edit Distance

Identifying local similarity between two or more sequences, or
identifying repetitions occurring at least twice in a sequence, is an
essential part in the analysis of biological sequences and of their
phylogenetic relationship. Finding fragments that are conserved among
several given sequences, or inside a unique sequence, while allowing
for a certain number of insertions, deletions, and substitutions, is
however known to be a computationally expensive task, and consequently
exact methods can usually not be applied in practice. The filter we
introduce in this paper, called Ed'Nimbus, provides a possible
solution to this problem. It can be used as a preprocessing step to
any multiple alignment method, eliminating an important fraction of
the input that is guaranteed not to contain any approximate
repetition. It consists in the verification of a strong necessary
condition. This condition concerns the number and order of exactly
repeated words shared by the approximate repetitions. The efficiency
of the filter is due to this condition, that we show how to check in a
fast way. The speed of the method is achieved thanks also to the use
of a simple and efficient data structure, that we describe in this
paper, as well as its linear time and space construction. Our results
show that using Ed'Nimbus allows us to sensibly reduce the time
spent in a local multiple alignment.
The Ed'Nimbus software is freely available on the web:
http://igm.univ-mlv.fr/$\sim$peterlon/officiel/ednimbus/