Self-adjusting Data Structures for External Memory String Access
Data warehouses are increasingly storing and managing large scale
string data, and dealing with large volume of transactions that update
and search string databases. Motivated by this context, we
initiate the study of {\em self-adjusting} data structures for
string dictionary operations, that is, data structures that are
designed to be efficient on an entire sequence of operations
rather than individual string operations. Furthermore, we study
this problem in the external memory model where string data is too
massive to be stored in main memory and has to reside in disks;
each access to a disk page fetches $B$ items, and the cost of the
operations is the number of pages accessed.
We show that given $n$ strings $S_{1},\ldots,S_{n}$ of total
length $\sum_i |S_i|=N$, a sequence of $m$ string searches
$S_{i_{1}},S_{i_{2}},\ldots,S_{i_{m}}$ takes $
O(\sum_{j=1}^{m}(\frac{|S_{i_{j}}|}{B})+\sum
_{i=1}^{n}(n_{i}\log_{B}\frac{m}{n_{i}}))$ amortized expected
I/Os, where $n_{i}$ is the number of times $S_{i}$ is queried.
Inserting or deleting a string $S$ takes
$O(\frac{|S|}{B}+\log_{B}n)$ amortized expected I/Os. This result
is the analog of what is known as the Static Optimality
Theorem~\cite{Sleator-Tarjan} proved by Sleator and Tarjan in
their classic splay trees paper for a dictionary of numerical
values; here, it has been generalized, for the first time to
string data, to string operations, and to the external memory model. This
performance is achieved not by traditional ``splay'' operations
on search trees as in~\cite{Sleator-Tarjan}, but by designing a
novel self-adjusting data structure based on the well-known skip
lists.
In addition, we introduce the paradigm of using the main memory (or a
part thereof) persistently across operations, in the manner of a
cache, to further improve the performance of our self-adjusting
skip list. This is quite reasonable in applications where string
operations are large in volume.