On compressing and indexing data
In this paper we design two compressed data structures for the
full-text indexing problem. These data structures support efficient
substring searches within the indexed text $T$ using roughly the same
space required to store $T$ in compressed form.
Our first compressed data structure retrieves the $occ$ occurrences of
a pattern $P[1,p]$ in $T[1,n]$ in $O(p + occ\log^{1+\epsilon} n)$ time
and uses at most $5n H_k(T) + o(n)$ bits of storage, where
$H_k(T)$ is the $k$-th order empirical entropy of $T$. This space
occupancy is $Theta(n)$ bits in the worst case and $o(n)$ bits for
compressible texts. Our data structure exploits the relationship
between the suffix array and the Burrows-Wheeler
compression algorithm.
Our second compressed data structure achieves $O(p+occ)$ query time
and uses $O(n H_k(T)\log^\epsilon n) + o(n)$ bits of storage. In
the worst case the space occupancy is $o(n\log n)$ bits which is
asymptotically smaller than the space occupancy of suffix trees and
suffix arrays. This second data structure exploits the interplay
between two compressors: the Burrows-Wheeler algorithm and the LZ78 algorithm.