Opportunistic data structures with applications

There is an upsurging interest in designing succinct data structures for basic searching problems (see [Munro99] and references therein). The motivation has to be found in the exponential increase of electronic data nowadays available which is even surpassing the significant increase in memory and disk storage capacities of current computers. Space reduction is an attractive issue because it is also intimately related to performance improvements as noted by several authors (e.g. Knuth [knuth3], Bentley [bentley]). In designing these implicit data structures the goal is to reduce as much as possible the auxiliary information kept together with the input data without introducing a significant slowdown in the final query performance. Yet input data are represented in their entirety thus taking no advantage of possible repetitiveness into them. The importance of those issues is well known to programmers who typically use various tricks to squeeze data as much as possible and still achieve good query performance. Their approaches, thought, boil down to heuristics whose effectiveness is witnessed only by experimentation. In this paper, we address the issue of compressing and indexing data by studying it in a theoretical framework. We devise a novel data structure for indexing and searching whose space occupancy is a function of the entropy of the underlying data set. The novelty resides in the careful combination of a compression algorithm, proposed by Burrows-Wheeler [bw], with the structural properties of a well known indexing tool, the Suffix Array [MM93]. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible at no significant slowdown in the query performance and without any assumption on a particular fixed distribution. More precisely, its space occupancy is optimal in a information-content sense because a text $T[1,u]$ is stored using $O(k(T)) + o(1)$ bits per input symbol, where $k(T)$ is the $k$th order entropy of $T$ (the bound holds for any fixed $k$). Given an arbitrary string $P[1,p]$, the opportunistic data structure allows to search for the $occ$ occurrences of $P$ in $T$ requiring $O(p + occ \log^\epsilon u)$ time complexity (for any fixed $\epsilon >0$). If data are non compressible, then we achieve the best space bound currently known [GV00]; otherwise our solution improves the succinct suffix array in [GV00] and the classical suffix tree and suffix array data structures either in space or in query time complexity or both. It was a belief [Witten:1999:MGC] that some space overhead should be paid to use full-text indices (i.e. suffix trees or suffix arrays) with respect to the word-based indices (i.e. inverted lists). The results in this paper show that no space overhead is needed at all, and as an application we improve space and query time complexity of the well-known Glimpse tool [glimpse]. We finally investigate the modifiability of our opportunistic data structure by studying how to choreograph its basic ideas with a dynamic setting thus achieving effective searching and updating time bounds.