Compress, Hash and Displace: CHD Algorithm


Introduction

The important performance parameters of a PHF are representation size, evaluation time and construction time. The representation size plays an important role when the whole function fits in a faster memory and the actual data is stored in a slower memory. For instace, compact PHFs can be entirely fit in a CPU cache and this makes their computation really fast by avoiding cache misses. The CHD algorithm plays an important role in this context. It was designed by Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger in [2].

The CHD algorithm permits to obtain PHFs with representation size very close to optimal while retaining O(n) construction time and O(1) evaluation time. For example, in the case m=2n we obtain a PHF that uses space 0.67 bits per key, and for m=1.23n we obtain space 1.4 bits per key, which was not achievable with previously known methods. The CHD algorithm is inspired by several known algorithms; the main new feature is that it combines a modification of Pagh's ``hash-and-displace'' approach with data compression on a sequence of hash function indices. That combination makes it possible to significantly reduce space usage while retaining linear construction time and constant query time. The CHD algorithm can also be used for k-perfect hashing, where at most k keys may be mapped to the same value. For the analysis we assume that fully random hash functions are given for free; such assumptions can be justified and were made in previous papers.

The compact PHFs generated by the CHD algorithm can be used in many applications in which we want to assign a unique identifier to each key without storing any information on the key. One of the most obvious applications of those functions (or k-perfect hash functions) is when we have a small fast memory in which we can store the perfect hash function while the keys and associated satellite data are stored in slower but larger memory. The size of a block or a transfer unit may be chosen so that k data items can be retrieved in one read access. In this case we can ensure that data associated with a key can be retrieved in a single probe to slower memory. This has been used for example in hardware routers [4].

The CHD algorithm generates the most compact PHFs and MPHFs we know of in O(n) time. The time required to evaluate the generated functions is constant (in practice less than 1.4 microseconds). The storage space of the resulting PHFs and MPHFs are distant from the information theoretic lower bound by a factor of 1.43. The closest competitor is the algorithm by Martin and Pagh [3] but their algorithm do not work in linear time. Furthermore, the CHD algorithm can be tuned to run faster than the BPZ algorithm [1] (the fastest algorithm available in the literature so far) and to obtain more compact functions. The most impressive characteristic is that it has the ability, in principle, to approximate the information theoretic lower bound while being practical. A detailed description of the CHD algorithm can be found in [2].


Experimental Results

Experimental results comparing the CHD algorithm with the BDZ algorithm and others available in the CMPH library are presented in [2].


Papers

  1. F. C. Botelho, R. Pagh, N. Ziviani. Simple and space-efficient minimal perfect hash functions. In Proceedings of the 10th International Workshop on Algorithms and Data Structures (WADs'07), Springer-Verlag Lecture Notes in Computer Science, vol. 4619, Halifax, Canada, August 2007, 139-150.

  2. F. C. Botelho, D. Belazzougui and M. Dietzfelbinger. Compress, hash and displace. In Proceedings of the 17th European Symposium on Algorithms (ESA’09). Springer LNCS, 2009.

  3. M. Dietzfelbinger and R. Pagh. Succinct data structures for retrieval and approximate membership. In Proceedings of the 35th international colloquium on Automata, Languages and Programming (ICALP’08), pages 385–396, Berlin, Heidelberg, 2008. Springer-Verlag.

  4. B. Prabhakar and F. Bonomi. Perfect hashing for network applications. In Proceedings of the IEEE International Symposium on Information Theory. IEEE Press, 2006.

Home CHD BDZ BMZ CHM BRZ FCH

Enjoy!

Davi de Castro Reis

Djamel Belazzougui

Fabiano Cupertino Botelho

Nivio Ziviani

VigLink badge