File Coverage

blib/lib/Tree/BPTree.pm

Criterion	Covered	Total	%
statement	295	301	98.0
branch	82	96	85.4
condition	33	42	78.5
subroutine	46	47	97.8
pod	17	17	100.0
total	473	503	94.0

line	stmt	bran	cond	sub	pod	time	code
1							package Tree::BPTree;
2
3							# $Id: BPTree.pm,v 1.4 2003/09/15 19:50:39 sterling Exp $
4
5	12			12		472882	use 5.008;
	12					48
	12					522
6	12			12		93	use strict;
	12					29
	12					457
7	12			12		67	use warnings;
	12					30
	12					366
8
9							# all the math is for indexing
10	12			12		14112	use integer;
	12					233
	12					88
11
12	12			12		526	use Carp;
	12					21
	12					1814
13
14							our $VERSION = '1.08';
15
16							=head1 NAME
17
18							Tree::BPTree - Perl implementation of B+ trees
19
20							=head1 SYNOPSIS
21
22							use Tree::BPTree;
23
24							# These arguments are actually the defaults
25							my $tree = new Tree::BPTree(
26							-n => 3,
27							-unique => 0,
28							-keycmp => sub { $_[0] cmp $_[1] },
29							-valuecmp => sub { $_[0] <=> $_[1] },
30							);
31
32							# index the entries in this string:
33							my $string = "THERE'S MORE THAN ONE WAY TO DO IT"; # TMTOWTDI
34							my $i = 0;
35							$tree->insert($_, $i++) foreach (split //, $string);
36
37							# find the index of the first 'T'
38							my $t = $tree->find('T');
39
40							# find the indexes of every 'T'
41							my @t = $tree->find('T');
42
43							# We don't like the word 'WAY ', so let's remove it
44							my $i = index $string, 'W';
45							$tree->delete($_, $i++) foreach (split //, substr($string, $i, 4));
46
47							# Reverse the sort order
48							$tree->reverse;
49
50							# Iterate through each key/value pair just like built-in each operator
51							while (my ($key, $value) = $tree->each) {
52							print "$key => $value\n";
53							}
54
55							# Reset the iterator when we quit from an "each-loop" early
56							$tree->reset;
57
58							# You might also be interested in using multiple each loops at once, which is
59							# possible through the cursor syntax. You can even delete individual pairs
60							# from the list during iteration.
61							my $cursor = $tree->new_cursor;
62							while (my ($key, $value) = $cursor->each) {
63							my $nested = $tree->new_cursor;
64							while (my ($nkey, $nvalue) = $nested->each) {
65							if ($key->shouldnt_be_in_this_tree_with($nkey)) {
66							$nested->delete;
67							}
68							}
69							}
70
71							# Iterate using an iterator subroutine
72							$tree->iterate(sub { print "$_[0] => $_[1]\n" });
73
74							# Iterate using an iterator subroutine that returns the list of return values
75							# returned by the iterator
76							print join(', ', $tree->map(sub { "$_[0] => $_[1]" })),"\n";
77
78							# Grep-like operations
79							my @pairs = $tree->grep (sub { $_[0] =~ /\S/ });
80							my @keys = $tree->grep_keys (sub { $_[0] =~ /\S/ });
81							my @values = $tree->grep_values (sub { $_[0] =~ /\S/ });
82
83							# Get all keys, values
84							my @all_keys = $tree->keys;
85							my @all_values = $tree->values;
86
87							# Clear it out and start over
88							$tree->clear;
89
90							=head1 DESCRIPTION
91
92							B+ trees are balanced trees which provide an ordered map from keys to values.
93							They are useful for indexing large bodies of data. They are similar to 2-3-4
94							Trees and Red-Black Trees. This implementation supports B+ trees using an
95							arbitrary I value.
96
97							=head2 STRUCTURE
98
99							Each node in a B+ tree contains I pointers and I keys. The pointers
100							in the node are placed between the ordered keys so that there is one pointer on
101							either end and one pointer in between each value. Searching for a key involves
102							checking to see which keys in the node the key falls between and then following
103							the corresponding pointers down the tree.
104
105							The pointers in the branches of thre tree always point to nodes deeper in the
106							tree. The leaves use all pointers but the last to point to buckets containing
107							values. The last pointer in each leaf forms a singly-linked list called the
108							linked leaf list. Iterating through this list gives us an ordered traversal of
109							all keys and/or values in the tree.
110
111							Finally, all non-root branch nodes must contain at least I pointers. If it
112							becomes necessary to add values to a node which already contains I pointers,
113							then the node will be split in half first (possibly requiring the split of
114							parents). If deletion of a node leaves a branch with fewer than I pointers,
115							the node will either be coalesced (joined to) a neigboring node or it will take
116							on a pointer from a neighbor node. Coalescing can also result in the further
117							rebalancing of the tree in parents using more coalesce or redistribute
118							operations.
119
120							Here's a diagram of a valid B+ tree when n = 3 that stores my last name,
121							"HANENKAMP":
122
123							-------
124							/ \
125							/ \
126							----
127							/ \ / \
128							/ \ / \|
129							> > > >
130							/ \ \| / \ / \
131							/ \ \| \| \| \| \
132							[1,6] [3] [0] [5][7] [2,4] [8]
133
134							Anyway, you don't need to know any of that to use this implementation. The
135							abstraction layer set on top makes it look something like a typical hash.
136							Insertion and deletion both require a specific key and value since multiple
137							values can be mapped to each key--unless the "-unique" flag has been set.
138
139							By default, the tree assumes that it is being used to map strings to indexes. I
140							chose to set this default because this is the most common use I will put it to.
141							That is, I have lists of strings that I want to index, so the keys will be the
142							strings to index and the values will be indexes into the list.
143
144							If you need to store something different, all you need to do is store a
145							reference to the objects (keys or values) and set the "-keycmp" and "-valuecmp"
146							options to appropriate values during initialization.
147
148							=head2 PERFORMANCE
149
150							At some point, I want to post the best, average, and worst-case operation speed
151							for this implementation of B+ trees, but for now we'll just have to live without
152							those stats. For raw benchmarks, you should see the L section as the
153							actual performance of this module is pretty slow.
154
155							=head2 IMPLEMENTATION
156
157							As a quick note on implementation, if you want to know how specific operations
158							work, please browse the source. I have included extensive comments within the
159							definitions of the methods themselves explaining most of the important steps. I
160							did this for my own sanity because B+ trees can be quite complicated.
161
162							This code has been optimized a bit, but I haven't nearly made as many
163							optimizations as are likely possible. I'm open to any suggestions. If you have
164							some, send me email at the address given below.
165
166							=head2 METHOD REFERENCE
167
168							=over
169
170							=cut
171
172							package Tree::BPTree::Node;
173
174	12			12		67	use integer;
	12					21
	12					51
175
176							sub new {
177	8329			8329		37146	my ($class, @data) = @_;
178	8329	100				23544	@data = ( undef ) unless @data;
179	8329		33			86496	return bless \@data, ref $class \|\| $class;
180							}
181
182							# sub key {
183							# my ($self, $k, $new) = @_;
184							# $$self[$k * 2 + 1] = $new if defined $new;
185							# return $$self[$k * 2 + 1];
186							# }
187							#
188							# sub value {
189							# my ($self, $v, $new) = @_;
190							# $$self[$v * 2] = $new if defined $new;
191							# return $$self[$v * 2];
192							# }
193							#
194							# sub last_key {
195							# my ($self, $new) = @_;
196							# $$self[-2] = $new if defined $new;
197							# return $$self[-2];
198							# }
199							#
200							# sub last_value {
201							# my ($self, $new) = @_;
202							# $$self[-1] = $new if defined $new;
203							# return $$self[-1];
204							# }
205							#
206							sub first_leaf {
207	21459			21459		34421	my ($self) = @_;
208	21459					39434	my $current = $self;
209	21459					170710	until ($current->isa('Tree::BPTree::Leaf')) {
210	20124					137171	$current = $$current[0];
211							}
212	21459					110563	return $current;
213							}
214
215							sub last_leaf {
216	536			536		849	my ($self) = @_;
217	536					555	my $current = $self;
218	536					1762	until ($current->isa('Tree::BPTree::Leaf')) {
219	110					397	$current = $$current[-1];
220							}
221	536					954	return $current;
222							}
223							#
224							# sub nkeys {
225							# my ($self) = @_;
226							# return (scalar(@$self) - 1) / 2;
227							# }
228							#
229							# sub nvalues {
230							# my ($self) = @_;
231							# return (scalar(@$self) + 1) / 2;
232							# }
233
234							# The find operation differs slightly between branch and leaf. See the comment
235							# near Tree::BPTree::Leaf::find for details.
236							sub find {
237	147499			147499		291744	my ($self, $cmp, $key) = @_;
238	147499					260714	my $nkeys = (@$self - 1) / 2;
239	147499					411784	for (my $k = 0; $k < $nkeys; $k++) {
240	263731	100				640939	if (&$cmp($key, $self->[($k) * 2 + 1]) < 0) {
241	93749					279206	return $k;
242							}
243							}
244	53750					161956	return (@$self + 1) / 2 - 1;
245							}
246
247							sub insert {
248	37980			37980		70190	my ($self, $v, $key, $value) = @_;
249	37980					4657606	splice @$self, $v * 2, 0, $value, $key;
250							}
251
252							sub split {
253	26988			26988		50552	my ($self, $n, $cmp, $key) = @_;
254
255							# find the node we're going to insert to; split that node; if it splits
256							# either incorporate the split in ourselves or split ourselves if we are
257							# full
258	26988					68110	my $v = $self->find($cmp, $key);
259	26988					77335	my $result = $self->[($v) * 2]->split($n, $cmp, $key);
260	26988	100	100			123107	if ((@$self + 1) / 2 == $n && defined $result) {
		100
261							# We're full and they split, we must split too. The way the split must
262							# be handled will depend upon whether this is a Left, Center, or Right
263							# split. That is, is the sub-split node pointer on the left side, the
264							# middle, or the right. But first, let's go ahead and split the node in
265							# half.
266							#
267							# The way a node can be split depends on the oddness of n. If n is odd
268							# (normal looking node split), then we split at index n-1 and give the
269							# new node n elements. If n is even, we split at index n and give the
270							# new node n-1 elements. The combinatorics of this solution are kind of
271							# interesting. In any case, we create the new node complete while
272							# leaving the current node with a missing end-pointer.
273	660					2525	my $new_node = Tree::BPTree::Node->new(
274							splice @$self,
275							$n - ($n % 2), # n - 1 for odd or n - 0 for even
276							$n - (($n + 1) % 2), # n - 0 for odd or n - 1 for even
277							);
278
279	660					2761	my $root_key;
280	660	100				1667	if ($v < $n / 2) {
		100
281							# This is a left split. We need to clip off the last key, insert the
282							# child's new root key and set the pointers on either side to the
283							# new root nodes. Finally, return a new root with clipped key
284							# pointing to us and the new node.
285	324					581	$root_key = pop @$self;
286	324					1746	my $i = $self->find($cmp, $result->[1]);
287	324					858	$self->insert($i, $result->[1], $result->[0]);
288	324					684	$self->[($i+1) * 2] = $result->[2];
289
290							} elsif ($v > $n / 2) {
291							# This is a right split. Same as left in reverse, basically. We do
292							# need to first shear of the first pointer to the new node and
293							# append it back onto as the last pointer of the first node first.
294	216					391	push @$self, shift @$new_node;
295	216					350	$root_key = shift @$new_node;
296	216					552	my $i = $new_node->find($cmp, $result->[1]);
297	216					1343	$new_node->[($i) * 2] = $result->[2];
298	216					536	$new_node->insert($i, $result->[1], $result->[0]);
299							} else {
300							# This is a center split. Here, we append to ourself a new pointer
301							# pointing to the new left node. We set the new node's first pointer
302							# to the new right node. And we set the new root key to the child's
303							# new root key.
304	120					225	push @$self, $result->[0];
305	120					205	$new_node->[0] = $result->[2];
306	120					216	$root_key = $result->[1];
307							}
308
309	660					1652	return Tree::BPTree::Node->new($self, $root_key, $new_node);
310							} elsif (defined $result) {
311							# We have room to accomodate their split, add the new nodes here.
312							# Regular insert will do this in the wrong order.
313							# $self->insert($v, $$result[-1]->first_leaf->[1], $$result[-1]);
314
315							# The new node will always be the last node, so we need to insert the
316							# key/pointer in reverse order from normal such that the key happens at
317							# $i and the value is at $i + 1
318	2496					6079	my $i = $self->find($cmp, $key);
319	2496					8081	splice @$self, $i * 2 + 1, 0, $$result[-1]->first_leaf->[1], $$result[-1];
320	2496					8374	return undef;
321							} else {
322							# They didn't split, so we don't have to either
323	23832					45930	return undef;
324							}
325							}
326
327							sub delete {
328	6234			6234		17642	my ($self, $n, $cmp, $key) = @_;
329
330							# Go to the bottom and drop the key from the leaf node
331	6234					22650	my $v = $self->find($cmp, $key);
332	6234					27025	my $result = $self->[($v) * 2]->delete($n, $cmp, $key);
333
334							# On our way back up, make the tree consistent; i.e., no empty leaves and no
335							# non-root nodes with less than n/2 values. If a key is deleted, but doesn't
336							# cause a coalesce or redistribute, we may keep that key in a branch node as
337							# a sort key, this shouldn't hurt us.
338	6234	100				39185	if ($self->[($v) * 2]->isa('Tree::BPTree::Leaf')) {
339							# Since this is a leaf, we only care if the leaf becomes empty. If it
340							# does, we remove the pointer to it from the current node and pass
341							# control upwards.
342	5226	100				22275	if ($result == 1) {
343							# The leaf is too small, so we need to delete it from our list. This
344							# may result in rebalancing further up the tree.
345							#
346							# NOTE: This operation will leave orphaned nodes in the linked leaf
347							# list. It is too hard to remove the orphans here. Instead, orphans
348							# should be removed by the iterators.
349	536	100				1925	if ($v == 0) {
350							# This node is the first index, so we delete it and the next key
351	316					875	splice @$self, 0, 2;
352							} else {
353							# This node is not first, so we delete it and the preceding key
354	220					1226	splice @$self, $v * 2 - 1, 2;
355							}
356							} # else no rebalancing will take place here on up
357							} else {
358							# As a branch, the child node must not have fewer than n/2 children. If
359							# it does, we need to try to coalesce it with a neighbor or redistribute
360							# the children from a neighbor to the small node.
361	1008	100				2883	if ($result <= $n / 2) {
362							# The branch is too small, we'll try to coalesce first
363	192	100	100			888	if ($v > 0 && ((@{$self->[($v - 1) * 2]} + 1) / 2) + ((@{$self->[($v ) * 2]} + 1) / 2) <= $n) {
	67	100	100			175
	67					436
	140					393
364							# We can coalesce the small node with it's left neighbor
365	28					107	$self->[($v-1) * 2]->coalesce($self->[($v) * 2]);
366
367							# The removed node (the small node) is not first, so we delete
368							# it and the preceding key
369	28					66	splice @$self, $v * 2 - 1, 2;
370	140					635	} elsif ($v < (((@$self + 1) / 2) - 1) && ((@{$self->[($v ) * 2]} + 1) / 2) + ((@{$self->[($v + 1) * 2]} + 1) / 2) <= $n) {
371							# We can coalesce the small node with it's right neighbor
372	82					328	$self->[($v) * 2]->coalesce($self->[($v+1) * 2]);
373
374							# The removed node (the right neighbor) is not first, so we
375							# delete it and the preceding key
376	82					234	splice @$self, ($v + 1) * 2 - 1, 2;
377							} else {
378							# We must redistribute, we pull the node from the left neighbor,
379							# if there is a left neighbor; otherwise, we'll pull the node
380							# from the right.
381	82	100				179	if ($v > 0) {
382	34					147	$self->[($v-1) * 2]->redistribute($self->[($v) * 2]);
383							} else {
384	48					249	$self->[($v) * 2]->redistribute($self->[($v+1) * 2]);
385							}
386
387							# Furthermore, we need to reset the key affected in this node to
388							# make sure that we don't lose sort order in the branches. (That
389							# is, we might have just moved a lower key right making this key
390							# too high or a higher key left making this key too low.
391							#
392							# We always use the latter pointer which is normally $v+1 or $v
393							# if it is already the last pointer.
394	82	100				214	if ($v > 0) {
395	34					88	$self->[($v - 1) * 2 + 1] = $self->[$v * 2]->first_leaf->[1];
396							} else {
397	48					152	$self->[($v) * 2 + 1] = $self->[($v + 1) * 2]->first_leaf->[1];
398							}
399							}
400							}
401							}
402
403							# Return the number of values remaining
404	6234					20727	return (@$self + 1) / 2;
405							}
406
407							sub coalesce {
408	110			110		181	my ($self, $that) = @_;
409	110					354	push @$self, $$that[0]->first_leaf->[1], @$that;
410	110					247	return $self;
411							}
412
413							sub redistribute {
414	82			82		126	my ($self, $that) = @_;
415
416							# Who's stealing nodes from whom? When deciding on the new index key to
417							# insert, we choose to use the first key of that, in either case, as it will
418							# always be higher than the last key of self. (The first key in that is
419							# always the key associated with the value being redistributed.)
420	82	100				282	if ((@$that + 1) / 2 < (@$self + 1) / 2) {
421							# Redistribute values from left to right
422	34					119	my @middle = splice @$self, -2, 2;
423	34					126	unshift @$that, $middle[-1], $$that[0]->first_leaf->[1];
424							} else {
425							# Redistribute values from right to left
426	48					180	my @middle = splice @$that, 0, 2;
427	48					194	push @$self, $middle[0]->first_leaf->[1], $middle[0];
428							}
429							}
430
431							sub reverse {
432	230			230		316	my ($self) = @_;
433
434							# Reverses the children, reverses the internal list, and then connects the
435							# linked-list pointer of the last_leaf of each subnode to the
436							# first_leaf of the following subnode. Finally, we need to change the
437							# index key.
438	230					1131	@$self = reverse @$self;
439	230					466	my $nvalues = (@$self + 1) / 2;
440	230					665	for (my $v = 0; $v < $nvalues; ++$v) {
441	766					1814	$self->[($v) * 2]->reverse;
442							}
443
444	230					377	my $nkeys = (@$self - 1) / 2;
445	230					584	for (my $k = 0; $k < $nkeys; ++$k) {
446							# Set the last pointer in the first node's last leaf to the first leaf
447	536					1322	$self->[($k) * 2 ]->last_leaf->[-1] = $self->[($k + 1) * 2]->first_leaf;
448
449							# Set the current key to the second node's first leaf's key
450	536					1188	$self->[($k) * 2 + 1] = $self->[($k + 1) * 2]->first_leaf->[1];
451							}
452							}
453
454							package Tree::BPTree::Leaf;
455
456	12			12		34725	use integer;
	12					35
	12					69
457
458							our @ISA = qw(Tree::BPTree::Node);
459
460							# Ordering in leaves is slightly different because we want to store the buckets
461							# for the node in the same pointer as the node when keys are equal. In branches,
462							# we want to find the value by the pointer after the node if the keys are
463							# equal.
464							sub find {
465	132624			132624		247852	my ($self, $cmp, $key) = @_;
466	132624					241721	my $nkeys = (@$self - 1) / 2;
467	132624					377277	for (my $k = 0; $k < $nkeys; $k++) {
468	1127612	100				2544324	if (&$cmp($key, $self->[($k) * 2 + 1]) <= 0) {
469	120180					1378162	return $k;
470							}
471							}
472	12444					43253	return (@$self + 1) / 2 - 1;
473							}
474
475							sub split {
476	37440			37440		67408	my ($self, $n) = @_;
477
478	37440	100				124230	if ((@$self + 1) / 2 == $n) {
479							# We're big enough, we must split in anticipation of an insert. See the
480							# comments in Tree::BPTree::split if you want to know more about why
481							# choosing where and how many nodes to splice looks so weird.
482	3216					22267	my $new_node = Tree::BPTree::Leaf->new(
483							splice @$self,
484							$n - ($n % 2), # n - 1 for odd or n - 0 for even
485							$n - (($n + 1) % 2), # n - 0 for odd or n - 1 for even
486							);
487	3216					21712	push @$self, $new_node;
488
489							# return new root, which is used or tossed depending on the needs of the
490							# caller
491	3216					9701	return Tree::BPTree::Node->new($self, $$new_node[1], $new_node);
492							} else {
493							# We're not too big, so we can take at least one more value
494	34224					84464	return undef;
495							}
496							}
497
498							sub delete {
499	6240			6240		12629	my ($self, $n, $cmp, $key) = @_;
500
501							# Find the node and delete it (we assume this node exists if we've been
502							# called!)
503	6240					26467	my $i = $self->find($cmp, $key);
504	6240					22649	splice @$self, $i * 2, 2;
505
506							# Return the number of values remaining
507	6240					22234	return (@$self + 1) / 2;
508							}
509
510							sub reverse {
511	632			632		797	my ($self) = @_;
512
513							# For leaves, we must before the reverse, then copy the value pointers
514							# backwards one position. We even reverse the buckets to create a completely
515							# symmetric reversal.
516	632					6067	@$self = reverse @$self;
517	632					1595	my $nvalues = (@$self + 1) / 2 - 1;
518	632					1457	for (my $v = 0; $v < $nvalues; ++$v) {
519	6240					6064	$self->[($v) * 2] = [ reverse @{ $self->[($v+1)*2] } ];
	6240					29646
520							}
521	632					2032	$$self[-1] = undef;
522							}
523
524							package Tree::BPTree;
525
526							=item $tree = Tree::BPTree->new(%args)
527
528							The constructor builds a new tree using the given arguments. All arguments are
529							optional and have defaults that should suit many applications. The arguments
530							include:
531
532							=over
533
534							=item -n
535
536							This sets the maximum number of pointers permitted in each node. Setting this
537							number very high will cause search operations to slow down as it will spend a
538							lot of time searching arrays incrementally--something like a binary search could
539							be used to speed these times a bit, but no such method is used at this time.
540							Setting this number very low will cause insert and delete operations to slow
541							down as they are required to split and coalesce more often. The default is the
542							minimum value of 3.
543
544							=item -unique
545
546							This determines whether keys are unique or not. If this is set, then an
547							exception will be raised whenever an insert is attempted for a key that already
548							exists in the tree.
549
550							=item -keycmp
551
552							This is a comparator function that takes two arguments and returns -1, 0, or 1
553							to indicate the result of the comparison. If the first argument is less than the
554							second, then -1 is returned. If the first argument is greater than the second,
555							then 1 is returned. If the arguments are equal, then 0 is returned. This
556							comparator should be appropriate for comparing keys. By default, the built-in
557							string comparator C is used. See L for details on C.
558
559							=item -valuecmp
560
561							This is a comparator function that takes two arguments and returns -1, 0, or 1
562							to indicate the result of the comparison--just like the "-keycmp" argument. This
563							comparator should be appropriate for comparing values. By default, the built-in
564							numeric comparator C=E> is used. See L for details on
565							C=E>.
566
567							=back
568
569							The tree created by this constructor is always initially empty.
570
571							=cut
572
573							sub new {
574	577			577	1	1890060	my ($class, %args) = @_;
575
576	577	100				3123	$args{-n} = 3 unless defined $args{-n};
577	577	50		1474347		4547	$args{-keycmp} = sub { $_[0] cmp $_[1] } unless defined $args{-keycmp};
	1474347					6330297
578	577	50		11232		3962	$args{-valuecmp} = sub { $_[0] <=> $_[1] } unless defined $args{-valuecmp};
	11232					50733
579	577	50				3144	$args{-unique} = 0 unless defined $args{-unique};
580	577					3036	$args{-root} = Tree::BPTree::Leaf->new;
581
582							# This cursor is special as it doesn't have a link back to self. It will not
583							# be released to the user to call methods on directly anyway. Having the
584							# link back to self would cause a memory leak.
585	577					2702	$args{-cursor} = bless {}, 'Tree::BPTree::Cursor';
586
587	577	50				2173	croak "Illegal value for n $args{-n}. It must be greater than or equal to 3."
588							if $args{-n} < 3;
589
590	577		33			4043	return bless \%args, ref $class \|\| $class;
591							}
592
593							sub _find_leaf {
594	126384			126384		253771	my ($self, $key) = @_;
595
596	126384					264634	my $cmp = $$self{-keycmp};
597	126384					240593	my $current = $$self{-root};
598	126384		66			1222656	while (defined $current and not $current->isa('Tree::BPTree::Leaf')) {
599	111241					325593	my $v = $current->find($cmp, $key);
600	111241					1019244	$current = $current->[$v * 2];
601							}
602
603	126384					361829	return $current;
604							}
605
606							=item $value = $tree->find($key)
607
608							=item @values = $tree->find($key)
609
610							This method attempts to find the value or values in the bucket matching C<$key>.
611							If no such C<$key> has been stored in the tree, then C is returned. If
612							the C<$key> is found, then either the first value stored in the bucket is
613							returned (in scalar context) or all values stored are returned (in list
614							context). Using scalar context is useful when the tree stores unique keys where
615							there will never be more than one value per key.
616
617							=cut
618
619							sub find {
620	20928			20928	1	67489195	my ($self, $key) = @_;
621
622	20928					53614	my $cmp = $$self{-keycmp};
623	20928					50320	my $leaf = $self->_find_leaf($key);
624	20928					60268	my $v = $leaf->find($cmp, $key);
625	20928	50				56747	if (&$cmp($leaf->[($v) * 2 + 1], $key) == 0) {
626	20928	100				47131	return wantarray ? @{ $leaf->[($v) * 2] } : ${ $leaf->[($v) * 2] }[0];
	10464					59614
	10464					46982
627							} else {
628	0					0	return undef;
629							}
630							}
631
632							=item $tree->insert($key, $value)
633
634							This method inserts the key/value pair given into the tree. If the tree requires
635							unique keys, an exception will be thrown if C<$key> is already stored.
636
637							=cut
638
639							sub insert {
640	62784			62784	1	298189	my ($self, $key, $value) = @_;
641	62784					189353	my $n = $$self{-n};
642	62784					125441	my $cmp = $$self{-keycmp};
643
644							# In the case of insert, we have three steps:
645							# 1. See if the key already exists. If so, add the value to the bucket
646							# there (or die if keys are unique). Otherwise, go to step 2.
647							# 2. Tell the tree to split if it is full along the path to where the new
648							# key will be placed.
649							# 3. Find the leaf and insert the key/value pair there.
650
651							# First, see if the value is already there
652	62784					133242	my $leaf = $self->_find_leaf($key);
653	62784					190283	my $k = $leaf->find($cmp, $key);
654	62784	100	100			276880	if (defined $leaf->[($k) * 2 + 1] && &$cmp($leaf->[($k) * 2 + 1], $key) == 0) {
655	25344	50				75979	croak "Unique key violation." if $$self{-unique};
656	25344					30057	push @{ $leaf->[($k) * 2] }, $value;
	25344					4476742
657	25344					74755	return;
658							}
659
660							# Then, tell the tree to split straight down if it will need to
661	37440					123622	my $new_root = $$self{-root}->split($n, $cmp, $key);
662	37440	100				113036	$$self{-root} = $new_root if defined $new_root;
663
664							# Next, insert the new value (we need a new leaf in case a split occurred)
665	37440					113776	$leaf = $self->_find_leaf($key);
666	37440					95364	$leaf->insert($leaf->find($cmp, $key), $key, [ $value ]);
667							}
668
669							=item $tree->delete($key, $value)
670
671							This method removes the key/value pair given from the tree. If the pair cannot
672							be found, then the tree is not changed. If C<$value> is stored multiple times at
673							C<$key>, then all values matching C<$value> will be removed.
674							=cut
675
676							sub delete {
677	5232			5232	1	9814629	my ($self, $key, $value) = @_;
678	5232					23119	my $cmp = $$self{-keycmp};
679	5232					15321	my $valcmp = $$self{-valuecmp};
680
681							# In the case of delete, we have two steps:
682							# 1. Find the leaf containing the key.
683							# a. If no matching key is found in the leaf where it should be, quit.
684							# b. If the bucket for the key found contains multiple values, remove
685							# one and quit.
686							# c. Otherwise, continue to step 2.
687							# 2. Starting at the top, tell the tree to delete the node.
688							# a. The tree will then prune off any leaves that become empty.
689							# b. The tree will prune of branches that aren't needed. This may
690							# result in branches with less than n/2 nodes, so we will need to
691							# rebalance the tree.
692							# c. The tree will perform rebalancing on it's way back up from the
693							# leaf. It will attempt to coalesce where needed and possible and
694							# redistribute if needed and coalesce won't work.
695
696							# First, find the leaf containing the key
697	5232					20114	my $leaf = $self->_find_leaf($key);
698	5232					28728	my $i = $leaf->find($cmp, $key);
699	5232	50	33			64563	if (defined $leaf->[($i) * 2 + 1] && &$cmp($leaf->[($i) * 2 + 1], $key) == 0) {
700	5232	100				7309	if (scalar(@{ $leaf->[($i) * 2] }) > 1) {
	5232	50				61201
	3120					8483
701	2112					9674	my $bucket = $leaf->[($i) * 2];
702	2112					13000	@$bucket = grep { &$valcmp($value, $_) != 0 } @$bucket;
	8112					20026
703
704							# If the bucket has more elements, we quit here. Otherwise, we need
705							# to remove the node.
706	2112	50				12791	return if @$bucket > 0;
707	3120					8326	} elsif (!grep { &$valcmp($value, $_) == 0 } @{ $leaf->[($i) * 2] }) {
708							# no match for value, let's quit
709	0					0	return;
710							}
711							} else {
712							# no match for key, let's quit
713	0					0	return;
714							}
715
716							# Then, since we're still here, we know there is a key/value match that
717							# we intend to remove. Since this removal will empty a bucket, we need to
718							# bring out the big guns. Tell the tree to take care of it and it will take
719							# care of coalescing and redistributing nodes.
720	3120					21737	my $values = $$self{-root}->delete($$self{-n}, $cmp, $key);
721
722							# if the tree contains only a single value and is a branch, then the tree is
723							# one level shallower than before the delete
724	3120	100	100			47090	$$self{-root} = $$self{-root}->[0]
725							if not $$self{-root}->isa('Tree::BPTree::Leaf') and $values == 1;
726							}
727
728							=item $tree->reverse
729
730							Reverse the sort order. This is done by reversing every key in the tree,
731							adjusting the linked leaf list, and replacing the "-keycmp" method with a new
732							one that simply negates the old one. If this method is called again, the same
733							node reversal will happen, but the original "-keycmp" will be reinstated rather
734							than doing a double negation.
735
736							=cut
737
738							sub reverse {
739	96			96	1	1311	my ($self) = @_;
740	96					452	$$self{-root}->reverse;
741	96	50				444	if (defined $$self{-reverse_keycmp}) {
742	0					0	$$self{-keycmp} = delete $$self{-reverse_keycmp};
743							} else {
744	96					461	$$self{-reverse_keycmp} = $$self{-keycmp};
745	96					206	my $cmp = $$self{-keycmp};
746	96			147458		874	$$self{-keycmp} = sub { -( &$cmp(@_) ) };
	147458					280781
747							}
748							}
749
750							=item $cursor = $tree->new_cursor
751
752							This method allows you to have multiple, simultaneous iterators through the
753							same index. If you pass the C<$cursor> value returned from C to
754							C, it will be used instead of the default internal cursor. That is,
755
756							my $c1 = $tree->new_cursor;
757							my $c2 = $tree->new_cursor;
758							while (my ($key, $values) = $tree->each($c1)) {
759							# let's go through $c1 twice as fast
760							my ($nextkey, $nextvalue) = $tree->each($c1);
761
762							# next is an alias for each
763							my ($otherkey, $othervalue) = $tree->next($c2);
764							}
765
766							# and we can reset $c2 after we're done too
767							$tree->reset($c2);
768
769							Cursors also have their own methods, so this same snippet could have been
770							written like this instead:
771
772							my $c1 = $tree->new_cursor;
773							my $c2 = $tree->new_cursor;
774							while (my ($key, $value) = $c1->each) {
775							# let's go through $c1 twice as fast
776							my ($nextkey, $nextvalue) = $c1->each;
777
778							# next is an alias for each
779							my ($otherkey, $othervalue) = $c2->each;
780							}
781
782							# and we can reset $c2 after we're done too
783							$c2->reset;
784
785							There are additional features provided with cursors that are not provided when
786							using the internal cursor. You may delete the last key/values pair returned by a
787							call to C/C by calling C on the cursor. Or, you may specify
788							a specific value in the bucket to be deleted. For example:
789
790							my $cursor = $tree->new_cursor;
791							while (my ($key, $value) = $cursor->next) {
792							# In this example, the keys are objects with a is_bad method. If "bad" is
793							# set, we want to remove the corresponding values.
794							if ($key->is_bad) {
795							$cursor->delete;
796							}
797							}
798
799							This form of delete is completely safe and will not cause the iterator to slip
800							off track as a similar operation might mess up array iteration if one isn't
801							careful.
802
803							Another feature of cursors, is that you may retrieve the previously returned
804							value by calling the C method. This will return the same result as the
805							last call to C or C. That is, unless C has been called or
806							C removed the previously returned key, then this will return an empty
807							list.
808
809							For example:
810
811							# This assumes you use the typical string keys with numeric values
812							$cursor = $tree->new_cursor;
813							while (my ($key, $value) = $cursor->next) {
814							my ($currkey, $currval) = $cursor->current;
815							die unless $key eq $currkey and $value == $currval
816							}
817
818							This example shouldn't die.
819
820							=cut
821
822							package Tree::BPTree::Cursor;
823
824							# These keep the real work in Tree::BPTree
825							sub each {
826	15744			15744		26702	my ($self) = @_;
827	15744					66141	$$self{-tree}->each($self);
828							}
829
830							sub next {
831	5280			5280		64237867	my ($self) = @_;
832	5280					39185	$$self{-tree}->each($self);
833							}
834
835							sub current {
836	5232			5232		26127	my ($self) = @_;
837	5232	50				17221	return () unless defined $$self{-last};
838							return (
839	5232					57940	$$self{-last}{-node}->[($$self{-last}{-index}) + 1],
840							$$self{-last}{-node}->[($$self{-last}{-index})][($$self{-last}{-value})],
841							);
842
843							}
844
845							sub reset {
846	5232			5232		16136840	my ($self) = @_;
847	5232					22072	$$self{-tree}->reset($self);
848							}
849
850							sub delete {
851	5232			5232		32354	my ($self) = @_;
852
853	5232	50				26994	Carp::croak "No node to delete. This has occurred because a delete was attempted before iteration started or delete was attempted twice on the same node."
854							unless defined $$self{-last};
855
856							# We must be careful as removing the node might throw off $$self{-index} if
857							# $$self{-node} == $$self{-last}{-node}. In the case that we remove the node
858							# altogether and $$self{-node} == $$self{-last}{-node}, we must decrement
859							# $$self{-index} by 2 to keep it from skipping a node or falling off the end
860							# of the node.
861	5232					68064	my $cmp = $$self{-tree}{-keycmp};
862	5232					27948	my $valcmp = $$self{-tree}{-valuecmp};
863
864	5232					13339	my $leaf = $$self{-last}{-node};
865	5232					13196	my $i = $$self{-last}{-index};
866	5232					12276	my $value = $$self{-last}{-value};
867	5232	100				6428	if (@{ $leaf->[$i] } > 1) {
	5232					19873
868							# The bucket contains more than one value. Drop the current index, keep
869							# us from calling delete again and quit.
870	2112					3769	my $bucket = $leaf->[$i];
871	2112					5551	splice @$bucket, $value, 1;
872
873							# If this node and the last node are equivalent, we need to decrement
874							# the current value to keep us from skipping nodes are falling of the
875							# end of the bucket
876	2112	50	33			18901	--$$self{-value} if defined $$self{-node} and $$self{-last}{-node} == $$self{-node};
877
878	2112					9230	delete $$self{-last};
879	2112					6979	return;
880							} # Otherwise, this value is the last in the node and we drop it entirely
881
882							# We're still here, so the $value is the only remaining value
883	3120					23717	my $values = $$self{-tree}{-root}->delete($$self{-tree}{-n}, $cmp, $leaf->[$i + 1]);
884
885							# if the tree contains only a single value and is a branch, then the tree is
886							# one level shallower than before the delete
887	3120	100	100			49250	$$self{-tree}{-root} = $$self{-tree}{-root}->[0]
888							if not $$self{-tree}{-root}->isa('Tree::BPTree::Leaf') and $values == 1;
889
890							# If this node and the last node are equivalent, we need to decrement the
891							# current index to keep the cursor going in the correct place.
892	3120	100	100			30897	$$self{-index} -= 2 if defined $$self{-node} and $$self{-last}{-node} == $$self{-node};
893
894							# We can't delete again since we've just annihilated the key
895	3120					15452	delete $$self{-last};
896							}
897
898							package Tree::BPTree;
899
900							sub new_cursor {
901	144			144	1	3029	my ($self) = @_;
902	144					870	return bless { -tree => $self }, 'Tree::BPTree::Cursor';
903							}
904
905							=item ($key, $value) = $tree->each [ ($cursor) ]
906
907							This method provides a similar facility as that of the C operator. Each
908							call will iterate through each key/value pair in sort order. After the last
909							key/value pair has been returned, C will be returned once before starting
910							again. This is useful for using within C loops:
911
912							while (my ($key, $value) = $tree->each) {
913							# do stuff
914							}
915
916							=cut
917
918							sub each {
919	797185			797185	1	14165106	my ($self, $cursor) = @_;
920	797185	100				2679932	$cursor = $$self{-cursor} unless defined $cursor;
921
922							# This method operates on a cursor in three states:
923							# 1. Fresh. $$cursor{-index} is undefined to show that we are in a fresh
924							# state and should return the very first index.
925							# 2. Iterating. $$cursor{-index} and $$cursor{-node} are defined to show
926							# that we are somewhere in the middle of the list.
927							# 3. Dead. $$cursor{-node} is undefined to show that we have reached the
928							# last node. At this point () should be returned and then
929							# $$cursor{-index} deleted to return us to Fresh state.
930							#
931							# It is possible to move directly from Fresh to Dead in one call by checking
932							# the size of $$cursor{-node}. If $$cursor{-node}->nvalues == 1, then the
933							# very first node is empty, so we immediately return that we are Dead and
934							# return to a Fresh state.
935
936							# If the cursor is empty, then they haven't ran each yet (or the last run
937							# has concluded). Set a new iteration run up.
938	797185	100				2422408	unless (defined $$cursor{-index}) {
939	17617					75172	$$cursor{-node} = $$self{-root}->first_leaf;
940	17617					50637	$$cursor{-index} = 0;
941	17617					81169	$$cursor{-value} = 0;
942							}
943
944	797185	100	100			2786347	if (defined $$cursor{-node} and @{$$cursor{-node}} > 1) {
	784897					3448033
945							# The last run didn't detect the end of the list, so give them the next
946							# value
947	784800					4520682	my @next = (
948							$$cursor{-node}->[($$cursor{-index}) + 1],
949							$$cursor{-node}->[($$cursor{-index})][($$cursor{-value})],
950							);
951
952							# Remember this position, in case we want to delete it
953	784800					2245492	$$cursor{-last}{-node} = $$cursor{-node};
954	784800					1965016	$$cursor{-last}{-index} = $$cursor{-index};
955	784800					1938440	$$cursor{-last}{-value} = $$cursor{-value};
956
957							# Increment the value point first
958	784800	100				1401404	if ($$cursor{-value} == $#{$$cursor{-node}[$$cursor{-index}]}) {
	784800					2846227
959							# In this case, we're at the end, so we need to increment in the
960							# index and return this to the first value of the next bucket
961	492576					919154	$$cursor{-value} = 0;
962
963	492576	100				956184	if ($$cursor{-index} + 2 == $#{$$cursor{-node}}) {
	492576					9653863
964							# We've reached the end of a node, move to the next
965	61545					189137	my $next_node = $$cursor{-node}->[$$cursor{-index} + 2];
966
967							# Check for orphaned nodes and remove them
968	61545		100			356736	while (defined $next_node and @$next_node == 1) {
969	257					927	$next_node = $next_node->[0];
970							}
971	61545					176170	$$cursor{-node}->[$$cursor{-index} + 2] = $next_node;
972
973							# Move to the next node
974	61545					111339	$$cursor{-node} = $next_node;
975	61545					148333	$$cursor{-index} = 0;
976							} else {
977							# We've still got more key/value pairs to read in this node
978	431031					803663	$$cursor{-index} += 2;
979							}
980
981	492576					2650250	return @next;
982							} else {
983							# We've still got more values, so we need to get ready for the next
984	292224					525109	++$$cursor{-value};
985	292224					9876549	return @next;
986							}
987							} else {
988							# The last run reached the end of the list, so delete the -index element
989							# so we can start anew and return undef once, just like the each
990							# operator.
991	12385					36970	delete $$cursor{-index};
992
993							# Also clear the last pointers so we can't call delete on the cursor
994							# until we've called each at least once.
995	12385					39831	delete $$cursor{-last};
996
997	12385					54177	return ();
998							}
999							}
1000
1001							=item $tree->reset [ ($cursor) ]
1002
1003							Reset the given cursor to a fresh state--that is, ready to return the first
1004							value on the next call to C. If no C<$cursor> is given, then the default
1005							internal cursor is reset.
1006
1007							=cut
1008
1009							sub reset {
1010	5232			5232	1	8571	my ($self, $cursor) = @_;
1011	5232	50				19828	$cursor = $$self{-cursor} unless defined $cursor;
1012	5232					47645	delete $$cursor{-index};
1013							}
1014
1015							=item $tree->iterate(\&iter)
1016
1017							For each key/value pair in the database, the function C<&iter> will be called
1018							with the key as the first argument and value as the second. Iteration will occur
1019							in sort order.
1020
1021							=cut
1022
1023							sub iterate {
1024	48			48	1	1109	my ($self, $iter) = @_;
1025
1026	48					205	while (my ($k, $v) = $self->each) {
1027	5232					13472	&$iter($k, $v);
1028							}
1029							}
1030
1031							=item @results = $tree->map(\&mapper)
1032
1033							Nearly identical to C, this method captures the return values of each
1034							call and then returns all the results as a list. The C<&mapper> function takes
1035							the same arguments as in C.
1036
1037							=cut
1038
1039							sub map {
1040	48			48	1	999	my ($self, $mapper) = @_;
1041
1042	48					104	my @result;
1043	48					212	while (my ($k, $v) = $self->each) {
1044	5232					11571	push @result, &$mapper($k, $v);
1045							}
1046
1047	48					2388	return @result;
1048							}
1049
1050							=item @pairs = $tree->grep(\&pred)
1051
1052							=item @keys = $tree->grep_keys(\&pred)
1053
1054							=item @values = $tree->grep_values(\&pred)
1055
1056							Iterates through all key/value pairs in sort order. For each key/value pair, the
1057							function C<&pred> will be called by passing the key as the first argument and
1058							the value as the second. If C<&pred> returns a true value, then the matched
1059							value will be added to the returned list.
1060
1061							C returns a list of pairs such that each element is a two-element array
1062							reference where the first element is they key and the second is the value.
1063
1064							C returns a list of keys.
1065
1066							C returns a list of values.
1067
1068							=cut
1069
1070							sub grep {
1071	480			480	1	5212386	my ($self, $pred) = @_;
1072
1073	480					1537	my @result;
1074	480					1685	while (my ($k, $v) = $self->each) {
1075	52320	100				137516	push @result, [ $k, $v ] if &$pred($k, $v);
1076							}
1077
1078	480					9276	return @result;
1079							}
1080
1081							sub grep_keys {
1082	480			480	1	1311263	my ($self, $pred) = @_;
1083
1084	480					769	my @result;
1085	480					1554	while (my ($k, $v) = $self->each) {
1086	52320	100				151283	push @result, $k if &$pred($k, $v);
1087							}
1088
1089	480					4677	return @result;
1090							}
1091
1092							sub grep_values {
1093	480			480	1	806689	my ($self, $pred) = @_;
1094
1095	480					858	my @result;
1096	480					1533	while (my ($k, $v) = $self->each) {
1097	52320	100				132491	push @result, $v if &$pred($k, $v);
1098							}
1099
1100	480					3933	return @result;
1101							}
1102
1103							=item @pairs = $tree->pairs
1104
1105							=item @keys = $tree->keys
1106
1107							=item @values = $tree->values
1108
1109							Returns all elements of the given type.
1110
1111							C returns all key/value pairs stored in the tree. Each pair is returned
1112							as an array reference contain two elements. The first element is the key. The
1113							second element is a bucket, which is an array-reference of stored values.
1114
1115							C returns all keys stored in the tree.
1116
1117							C returns all values stored in the tree.
1118
1119							=cut
1120
1121							sub pairs {
1122	5280			5280	1	35434	my ($self) = @_;
1123
1124	5280					15984	my @pairs;
1125	5280					21207	while (my ($k, $v) = $self->each) {
1126	287760					1201051	push @pairs, [ $k, $v ];
1127							}
1128
1129	5280					2308474	return @pairs;
1130							}
1131
1132							sub keys {
1133	48			48	1	118	my ($self) = @_;
1134
1135	48					86	my @keys;
1136	48					229	while (my ($k, $v) = $self->each) {
1137	5232					16601	push @keys, $k;
1138							}
1139
1140	48					2750	return @keys;
1141							}
1142
1143							sub values {
1144	5376			5376	1	24636	my ($self) = @_;
1145
1146	5376					39777	my @values;
1147	5376					17160	while (my ($k, $v) = $self->each) {
1148	298224					1142670	push @values, $v;
1149							}
1150
1151	5376					169301	return @values;
1152							}
1153
1154							=item $tree->clear
1155
1156							This method empties the tree of all values. This basically creates a new tree
1157							and allows the old tree to be garbage collected at the interpreter's leisure.
1158
1159							=cut
1160
1161							sub clear {
1162	0			0	1		my ($self) = @_;
1163	0						$$self{-root} = Tree::BPTree::Leaf->new;
1164							}
1165
1166							=back
1167
1168							=head1 CREDITS
1169
1170							The basis for B+ trees implemented here can be found in I
1171							Concepts>, 4th ed. by Silbershatz et al. published by McGraw-Hill. I have
1172							somewhat modified the structure specified there to make the code easier to read
1173							and to adapt the code to Perl.
1174
1175							In addition, while preparing to write this module I also consulted an old book
1176							of mine, I by Robert Sedgewick (Addison Wesley), for more
1177							general information on trees. I also used some ideas on how and when to perform
1178							split, coalesce, and redistribute as the Silbershatz pseudo-code is a little
1179							obfuscated--or at least, the different operations are presented monolithically
1180							so that it's difficult to digest. The sections in Sedgewick on 2-3-4 and
1181							Red-Black trees were especially helpful.
1182
1183							=head1 BUGS
1184
1185							This module is pretty slow. Better performance is possible, especially for small
1186							bodies of data, if you use a hash to do most of these operations. See
1187							F for a sample of the performance issues. There you can also find
1188							code for performing essentially the same thing using different data structures.
1189
1190							On my machine, a small benchmark showed the following:
1191
1192							Insert into B+ Trees (this implementation) is:
1193							61 times slower than hash insert and
1194							3.9 times slower than ordered list insert.
1195
1196							Ordered iteration of B+ Trees is:
1197							1.6 times slower than ordering a hash and then iterating the pairs and
1198							14 times slower than iterating through an ordered list.
1199
1200							Finding a key in B+ Trees is:
1201							34 times slower than hash fetch but
1202							1.2 times faster than searching an ordered list (with grep, which probably
1203							isn't the fastest solution, a manual binary search should be better).
1204
1205							I'm still putting together more benchmarks and looking into places where
1206							improvement is possible. Iteration of this structure should scale better than
1207							taking a hash and ordering the keys to iterate through.
1208
1209							I have made some recent headway by removing some simple functions and replacing
1210							them with raw computation. If I did this the way I'd really like to, I need to
1211							find or build a L module to perform something similar to a C
1212							C<#define> or C++ C function. However, instead I just did a search and
1213							replace with Vim.
1214
1215							I should probably port this to XS to make it really compete with built-in
1216							hashes.
1217
1218							=head1 AUTHOR
1219
1220							Andrew Sterling Hanenkamp, Ehanenkamp@users.sourceforge.netE
1221
1222							=head1 COPYRIGHT AND LICENSE
1223
1224							Copyright 2003 by Andrew Sterling Hanenkamp
1225
1226							This library is free software; you can redistribute it and/or modify
1227							it under the same terms as Perl itself.
1228
1229							=cut
1230
1231							1