Compressed static functions with applications to other dictionary problems\vspace
Given a set of integer keys from a bounded universe along with associated data, the dictionary problem asks to answer two queries: membership and retrieval. Membership has to tell whether a given element is in the dictionary or not; retrieval has to return the data associated with the searched key.
This paper studies three well-established relaxations of this basic problem: (Compressed) Static functions, Approximate membership and Relative membership.
(Compressed) Static functions. In this relaxation, also known as retrieval-only dictionaries, we are given a set $S \subseteq U$ of $n$ integers and each of them has associated data from an alphabet of size $\sigma$. The problem asks to build a dictionary that, given a key $x\in S$, returns its associate data. Notice that, whenever $x \not \in S$, arbitrary data is returned.
This problem has been widely studied in the past.
Solutions to this problem have to carefully organize associated data, so that, they can be retrieved in constant time without the need of storing keys in $S$.
Compressed static functions move a step forward: not only do not store the keys, but also achieve space complexities bounded in term of the entropy $H_0$ of the associated data.
Such kind of solutions are very interesting mainly for two reasons. Firstly, being $nH_0$ at most $n\log \sigma$, these results are always at least as good as the uncompressed ones. Moreover, since the associated data often follow a skewed distribution, $nH_0$ could even become sublinear in $n$.
The current best solutions are by Porat and Hreinsson et al. The first one requires $n\log \sigma +o(n)$ bits of space, while the second one uses $(1+\delta)nH_0+n \cdot \min(p_0+0.086,1.82(1-p_0))$ bits of space, where $p_0$ is the probability of the most frequent symbol and $\delta$ is a constant greater than $0$. Thus, the space complexities of these solutions are incomparable:
the former has a sublinear overhead but is not compressed, while the latter is suboptimal due to the factor $(1+\delta)$ to multiply $H_0$ and has an overhead that may be $\Theta(n)$ depending on $p_0$.
Our optimal scheme achieves the best of the two being the first known solution obtaining simultaneously constant query time, compressed space ($nH_0$), and sublinear overhead ($o(n)$).
We strongly believe that these characteristics makes the use of static functions significantly more appealing for many applications.
Approximate membership. The approximate membership problem has been studied for decades and the Bloom filter data structure~is probably the most popular and widely used technique solving it.
With Bloom filters we can represent a set of $n$ integers by using $n\log\frac{1}{\epsilon} \log e$ bits of space with false positive probability ({\sf fpp}) $\epsilon$.
Both its space and time complexities are non-optimal: space is a constant factor away from optimal, and query time is logarithmic in $\frac{1}{\epsilon}$.
Constant time approximate membership data structures are able to achieve optimal $n \log \frac{1}{\epsilon} +o(n)$ bits of space only if
$\frac{1}{\epsilon}$ is a power of two.
The current best solution requires an $O(n)$ bits overhead in addition to the optimal space, for an arbitrary {\sf fpp}.
Our optimal scheme is the first known solution having $o(n)$ overhead, for any value of $\epsilon$ such that $\log \frac{1}{\epsilon} = o(\log n / \log\log n)$, namely, any reasonable choice of $\epsilon$ in practice.
Relative membership. This problem asks to solve a further relaxation of the membership query. Given two sets $S$ and $R$ with $S \subset R$, we must be able to distinguish whether a key belongs to $S$ or $R\setminus S$. Our compressed static functions are used to obtain a constant time solution for the problem achieving an optimal space complexity (up to lower order term).