A Comparative Study of Bases for Motif Inference

Motif inference is at the heart of several time-demanding computational tasks, such as in molecular biology, data mining and identification of structured motifs in sequences, and in data compression, to name a few. In this scenario, a motif is a pattern that appears repeated at least a certain number of times (the quorum), to be of interest. The pattern can be approximated in that some of its characters can be left unspecified (the don't cares). Motif inference is not aimed at searching a given pattern but, rather, at discovering all the possible patterns that appear as motifs in the given input string. The combinatorial explosion of these patterns makes their discover an exponential-time computation. For this, the notion of basis has been recently introduced to succinctly represent all of them within reasonable time and space bounds. The goal of the paper is to shed light on the state of the art for this emerging field and to add further properties to what is currently known.