Comparison Between BMZ And CHM Algorithms

Characteristics

Table 1 presents the main characteristics of the two algorithms. The number of edges in the graph is , the number of keys in the input set . The number of vertices of is equal to and for BMZ algorithm and the CHM algorithm, respectively. This measure is related to the amount of space to store the array . This improves the space required to store a function in BMZ algorithm to of the space required by the CHM algorithm. The number of critical edges is and 0, for BMZ algorithm and the CHM algorithm, respectively. BMZ algorithm generates random graphs that necessarily contains cycles and the CHM algorithm generates acyclic random graphs. Finally, the CHM algorithm generates order preserving functions while BMZ algorithm does not preserve order.

Characteristics	Algorithms
	BMZ	CHM
	1.15	2.09
$\vert E(G)\vert$
$\vert V(G)\vert=\vert g\vert$
$\vert E(G_{\rm crit})\vert$	$0.5\vert E(G)\vert$	0
	cyclic	acyclic
Order preserving	no	yes

Table 1: Main characteristics of the algorithms.

Memory Consumption

Memory consumption to generate the minimal perfect hash function (MPHF):

Algorithm	c	Memory consumption to generate a MPHF
BMZ	0.93	24.80n + O(1)
BMZ	1.15	26.42n + O(1)
CHM	2.09	33.00n + O(1)

Table 2: Memory consumption to generate a MPHF using the algorithms BMZ and CHM.

Memory consumption to store the resulting minimal perfect hash function (MPHF):

Algorithm	c	Memory consumption to store a MPHF
BMZ	0.93	3.72n
BMZ	1.15	4.60n
CHM	2.09	8.36n

Table 3: Memory consumption to store a MPHF generated by the algorithms BMZ and CHM.

Run times

We now present some experimental results to compare the BMZ and CHM algorithms. The data consists of a collection of 100 million universe resource locations (URLs) collected from the Web. The average length of a URL in the collection is 63 bytes. All experiments were carried on a computer running the Linux operating system, version 2.6.7, with a 2.4 gigahertz processor and 4 gigabytes of main memory.

Table 4 presents time measurements. All times are in seconds. The table entries represent averages over 50 trials. The column labelled as represents the number of iterations to generate the random graph in the mapping step of the algorithms. The next columns represent the run times for the mapping plus ordering steps together and the searching step for each algorithm. The last column represents the percent gain of our algorithm over the CHM algorithm.

	BMZ				CHM algorithm				Gain
		Map+Ord	Search	Total		Map+Ord	Search	Total	(%)
1,562,500	2.28	8.54	2.37	10.91	2.70	14.56	1.57	16.13	48
3,125,000	2.16	15.92	4.88	20.80	2.85	30.36	3.20	33.56	61
6,250,000	2.20	33.09	10.48	43.57	2.90	62.26	6.76	69.02	58
12,500,000	2.00	63.26	23.04	86.30	2.60	117.99	14.94	132.92	54
25,000,000	2.00	130.79	51.55	182.34	2.80	262.05	33.68	295.73	62
50,000,000	2.07	273.75	114.12	387.87	2.90	577.59	73.97	651.56	68
100,000,000	2.07	567.47	243.13	810.60	2.80	1,131.06	157.23	1,288.29	59

Table 4: Time measurements for BMZ and the CHM algorithm.

The mapping step of the BMZ algorithm is faster because the expected number of iterations in the mapping step to generate are 2.13 and 2.92 for BMZ algorithm and the CHM algorithm, respectively (see [2] for details). The graph generated by BMZ algorithm has vertices, against for the CHM algorithm. These two facts make BMZ algorithm faster in the mapping step. The ordering step of BMZ algorithm is approximately equal to the time to check if is acyclic for the CHM algorithm. The searching step of the CHM algorithm is faster, but the total time of BMZ algorithm is, on average, approximately 59 % faster than the CHM algorithm. It is important to notice the times for the searching step: for both algorithms they are not the dominant times, and the experimental results clearly show a linear behavior for the searching step.

We now present run times for BMZ algorithm using a heuristic that reduces the space requirement to any given value between words and words. For example, for and , the analytical expected number of iterations are and , respectively (for , the number of iterations are 2.78 for and 3.04 for ). Table 5 presents the total times to construct a function for , with an increase from seconds for (see Table 4) to seconds for and to seconds for .