Self-adjusting Data Structures for External Memory String Access

Data warehouses are increasingly storing and managing large scale string data, and dealing with large volume of transactions that update and search string databases. Motivated by this context, we initiate the study of {\em self-adjusting} data structures for string dictionary operations, that is, data structures that are designed to be efficient on an entire sequence of operations rather than individual string operations. Furthermore, we study this problem in the external memory model where string data is too massive to be stored in main memory and has to reside in disks; each access to a disk page fetches $B$ items, and the cost of the operations is the number of pages accessed. We show that given $n$ strings $S_{1},\ldots,S_{n}$ of total length $\sum_i |S_i|=N$, a sequence of $m$ string searches $S_{i_{1}},S_{i_{2}},\ldots,S_{i_{m}}$ takes $ O(\sum_{j=1}^{m}(\frac{|S_{i_{j}}|}{B})+\sum _{i=1}^{n}(n_{i}\log_{B}\frac{m}{n_{i}}))$ amortized expected I/Os, where $n_{i}$ is the number of times $S_{i}$ is queried. Inserting or deleting a string $S$ takes $O(\frac{|S|}{B}+\log_{B}n)$ amortized expected I/Os. This result is the analog of what is known as the Static Optimality Theorem~\cite{Sleator-Tarjan} proved by Sleator and Tarjan in their classic splay trees paper for a dictionary of numerical values; here, it has been generalized, for the first time to string data, to string operations, and to the external memory model. This performance is achieved not by traditional ``splay'' operations on search trees as in~\cite{Sleator-Tarjan}, but by designing a novel self-adjusting data structure based on the well-known skip lists. In addition, we introduce the paradigm of using the main memory (or a part thereof) persistently across operations, in the manner of a cache, to further improve the performance of our self-adjusting skip list. This is quite reasonable in applications where string operations are large in volume.