Masking Patterns in Sequences: a New Class of Motif Discovery with Don't Cares

In this paper, we introduce a new notion of motifs, called \emph{masks}, that succinctly represent the repeated patterns for an input sequence $T$ of $n$ symbols drawn from an alphabet $\Sigma$. We show how to build the set of all maximal masks of length~$L$ and quorum~$q$, in $O(2^L n)$ time and space in the worst case. We analytically show that our algorithms perform better than constant-time enumerating and checking all the potential $(|\Sigma|+1)^L$ candidate patterns in~$T$ after a polynomial-time preprocessing of $T$. Our algorithms are also cache-friendly, attaining $O(2^L\, \mathit{sort}(n))$ block transfers, where $\mathit{sort}(n)$ is the cache oblivious complexity of sorting $n$ items.