Masking Patterns in Sequences: a New Class of Motif Discovery with Don't Cares
In this paper, we introduce a new notion of motifs, called
\emph{masks}, that succinctly represent the repeated patterns for an
input sequence $T$ of $n$ symbols drawn from an alphabet $\Sigma$.
We show how to build the set of all maximal masks of length~$L$ and
quorum~$q$, in $O(2^L n)$ time and space in the worst case. We
analytically show that our algorithms perform better than
constant-time enumerating and checking all the potential
$(|\Sigma|+1)^L$ candidate patterns in~$T$ after a polynomial-time
preprocessing of $T$. Our algorithms are also cache-friendly,
attaining $O(2^L\, \mathit{sort}(n))$ block transfers, where
$\mathit{sort}(n)$ is the cache oblivious complexity of sorting $n$
items.