Opportunistic data structures with applications
There is an upsurging interest in designing succinct data structures
for basic searching problems (see [Munro99] and references
therein). The motivation has to be found in the exponential increase
of electronic data nowadays available which is even surpassing the
significant increase in memory and disk storage capacities of current
computers. Space reduction is an attractive issue because it is also
intimately related to performance improvements as noted by several
authors (e.g. Knuth [knuth3], Bentley [bentley]). In
designing these implicit data structures the goal is to reduce
as much as possible the auxiliary information kept together with the
input data without introducing a significant slowdown in the final
query performance. Yet input data are represented in their entirety
thus taking no advantage of possible repetitiveness into them. The
importance of those issues is well known to programmers who typically
use various tricks to squeeze data as much as possible and still
achieve good query performance. Their approaches, thought, boil down
to heuristics whose effectiveness is witnessed only by
experimentation.
In this paper, we address the issue of compressing and indexing data
by studying it in a theoretical framework. We devise a novel data
structure for indexing and searching whose space occupancy is a
function of the entropy of the underlying data set. The novelty
resides in the careful combination of a compression algorithm,
proposed by Burrows-Wheeler [bw], with the structural properties
of a well known indexing tool, the Suffix Array [MM93]. We call
the data structure opportunistic since its space occupancy is
decreased when the input is compressible at no significant slowdown in
the query performance and without any assumption on a particular fixed
distribution. More precisely, its space occupancy is optimal in a
information-content sense because a text $T[1,u]$ is stored using
$O(k(T)) + o(1)$ bits per input symbol, where $k(T)$ is
the $k$th order entropy of $T$ (the bound holds for any fixed
$k$). Given an arbitrary string $P[1,p]$, the opportunistic data
structure allows to search for the $occ$ occurrences of $P$ in $T$
requiring $O(p + occ \log^\epsilon u)$ time complexity (for any
fixed $\epsilon >0$). If data are non compressible, then we achieve
the best space bound currently known [GV00]; otherwise our
solution improves the succinct suffix array in [GV00] and the
classical suffix tree and suffix array data structures either in space
or in query time complexity or both.
It was a belief [Witten:1999:MGC] that some space overhead should
be paid to use full-text indices (i.e. suffix trees or suffix arrays)
with respect to the word-based indices (i.e. inverted lists). The
results in this paper show that no space overhead is needed at
all, and as an application we improve space and query time complexity
of the well-known Glimpse tool [glimpse].
We finally investigate the modifiability of our opportunistic data
structure by studying how to choreograph its basic ideas with a dynamic
setting thus achieving effective searching and updating time bounds.