File Coverage

lib/CPANPLUS/YACSmoke/ReAssemble.pm

Criterion	Covered	Total	%
statement	130	1152	11.2
branch	51	796	6.4
condition	15	177	8.4
subroutine	21	97	21.6
pod	49	49	100.0
total	266	2271	11.7

line	stmt	bran	cond	sub	pod	time	code
1							# Regexp::Assemple.pm
2							#
3							# Copyright (c) 2004-2008 David Landgren
4							# All rights reserved
5
6							package CPANPLUS::YACSmoke::ReAssemble;
7							$CPANPLUS::YACSmoke::ReAssemble::VERSION = '1.04';
8	10			10		65	use vars qw/$have_Storable $Current_Lexer $Default_Lexer $Single_Char $Always_Fail/;
	10					20
	10					742
9
10							=head1 NAME
11
12							Regexp::Assemble - Assemble multiple Regular Expressions into a single RE
13
14							=head1 SYNOPSIS
15
16							use Regexp::Assemble;
17
18							my $ra = Regexp::Assemble->new;
19							$ra->add( 'ab+c' );
20							$ra->add( 'ab+-' );
21							$ra->add( 'a\w\d+' );
22							$ra->add( 'a\d+' );
23							print $ra->re; # prints a(?:\w?\d+\|b+[-c])
24
25							=head1 DESCRIPTION
26
27							Regexp::Assemble takes an arbitrary number of regular expressions
28							and assembles them into a single regular expression (or RE) that
29							matches all that the individual REs match.
30
31							As a result, instead of having a large list of expressions to loop
32							over, a target string only needs to be tested against one expression.
33							This is interesting when you have several thousand patterns to deal
34							with. Serious effort is made to produce the smallest pattern possible.
35
36							It is also possible to track the original patterns, so that you can
37							determine which, among the source patterns that form the assembled
38							pattern, was the one that caused the match to occur.
39
40							You should realise that large numbers of alternations are processed
41							in perl's regular expression engine in O(n) time, not O(1). If you
42							are still having performance problems, you should look at using a
43							trie. Note that Perl's own regular expression engine will implement
44							trie optimisations in perl 5.10 (they are already available in
45							perl 5.9.3 if you want to try them out). C will
46							do the right thing when it knows it's running on a a trie'd perl.
47							(At least in some version after this one).
48
49							Some more examples of usage appear in the accompanying README. If
50							that file isn't easy to access locally, you can find it on a web
51							repository such as
52							L or
53							L.
54
55							=cut
56
57	10			10		73	use strict;
	10					19
	10					291
58
59	10			10		53	use constant DEBUG_ADD => 1;
	10					17
	10					790
60	10			10		59	use constant DEBUG_TAIL => 2;
	10					26
	10					531
61	10			10		59	use constant DEBUG_LEX => 4;
	10					20
	10					488
62	10			10		75	use constant DEBUG_TIME => 8;
	10					20
	10					43383
63
64							# The following patterns were generated with eg/naive
65							$Default_Lexer = qr/(?![[(\\]).(?:[+?]\??\|\{\d+(?:,\d)?\}\??)?\|\\(?:[bABCEGLQUXZ]\|[lu].\|(?:[^\w]\|[aefnrtdDwWsS]\|c.\|0\d{2}\|x(?:[\da-fA-F]{2}\|{[\da-fA-F]{4}})\|N\{\w+\}\|[Pp](?:\{\w+\}\|.))(?:[+?]\??\|\{\d+(?:,\d)?\}\??)?)\|\[.*?(?
66
67							$Single_Char = qr/^(?:\\(?:[aefnrtdDwWsS]\|c.\|[^\w\/{\|}-]\|0\d{2}\|x(?:[\da-fA-F]{2}\|{[\da-fA-F]{4}}))\|[^\$^])$/;
68
69							# the pattern to return when nothing has been added (and thus not match anything)
70							$Always_Fail = "^\\b\0";
71
72							=head1 METHODS
73
74							=over 8
75
76							=item new
77
78							Creates a new C object. The following optional
79							key/value parameters may be employed. All keys have a corresponding
80							method that can be used to change the behaviour later on. As a
81							general rule, especially if you're just starting out, you don't
82							have to bother with any of these.
83
84							B, a family of optional attributes that allow anchors
85							(C<^>, C<\b>, C<\Z>...) to be added to the resulting pattern.
86
87							B, sets the C flags to add to the assembled regular
88							expression. Warning: no error checking is done, you should ensure
89							that the flags you pass are understood by the version of Perl you
90							are using. B exists as an alias, for users familiar
91							with L.
92
93							B, controls whether the pattern should be chomped before being
94							lexed. Handy if you are reading patterns from a file. By default,
95							Cing is performed (this behaviour changed as of version 0.24,
96							prior versions did not chomp automatically).
97							See also the C attribute and the C method.
98
99							B, slurp the contents of the specified file and add them
100							to the assembly. Multiple files may be processed by using a list.
101
102							my $r = Regexp::Assemble->new(file => 're.list');
103
104							my $r = Regexp::Assemble->new(file => ['re.1', 're.2']);
105
106							If you really don't want chomping to occur, you will have to set
107							the C attribute to 0 (zero). You may also want to look at
108							the C attribute, as well.
109
110							B, controls what constitutes a record
111							separator when using the C attribute or the C
112							method. May be abbreviated to B. See the C<$/> variable in
113							L.
114
115							B, controls whether the pattern should contain zero-width
116							lookahead assertions (For instance: (?=[abc])(?:bob\|alice\|charles).
117							This is not activated by default, because in many circumstances the
118							cost of processing the assertion itself outweighs the benefit of
119							its faculty for short-circuiting a match that will fail. This is
120							sensitive to the probability of a match succeeding, so if you're
121							worried about performance you'll have to benchmark a sample population
122							of targets to see which way the benefits lie.
123
124							B, controls whether you want know which of the initial
125							patterns was the one that matched. See the C method for
126							more details. Note for version 5.8 of Perl and below, in this mode
127							of operation YOU SHOULD BE AWARE OF THE SECURITY IMPLICATIONS that
128							this entails. Perl 5.10 does not suffer from any such restriction.
129
130							B, the number of spaces used to indent nested grouping of
131							a pattern. Use this to produce a pretty-printed pattern. See the
132							C method for a more detailed explanation.
133
134							B, allows you to add a callback to enable sanity checks
135							on the pattern being loaded. This callback is triggered before the
136							pattern is split apart by the lexer. In other words, it operates
137							on the entire pattern. If you are loading patterns from a file,
138							this would be an appropriate place to remove comments.
139
140							B, allows you to add a callback to enable sanity checks on
141							the pattern being loaded. This callback is triggered after the
142							pattern has been split apart by the lexer.
143
144							B, controls whether to unroll, for example, C into
145							C, C, which may allow additional reductions in the
146							resulting assembled pattern.
147
148							B, controls whether tail reduction occurs or not. If set,
149							patterns like C will be reduced to C.
150							That is, the end of the pattern in each part of the b... and d...
151							alternations is identical, and hence is hoisted out of the alternation
152							and placed after it. On by default. Turn it off if you're really
153							pressed for short assembly times.
154
155							B, specifies the pattern used to lex the input lines into
156							tokens. You could replace the default pattern by a more sophisticated
157							version that matches arbitrarily nested parentheses, for example.
158
159							B, controls whether copious amounts of output is produced
160							during the loading stage or the reducing stage of assembly.
161
162							my $ra = Regexp::Assemble->new;
163							my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
164
165							B, controls whether new patterns can be added to the object
166							after the assembled pattern is generated. DEPRECATED.
167
168							This method/attribute will be removed in a future release. It doesn't
169							really serve any purpose, and may be more effectively replaced by
170							cloning an existing C object and spinning out a
171							pattern from that instead.
172
173							A more detailed explanation of these attributes follows.
174
175							=cut
176
177							sub new {
178	4			4	1	10	my $class = shift;
179	4					13	my %args = @_;
180
181	4					8	my $anc;
182	4					19	for $anc (qw(word line string)) {
183	12	50				36	if (exists $args{"anchor_$anc"}) {
184	0					0	my $val = delete $args{"anchor_$anc"};
185	0					0	for my $anchor ("anchor_${anc}_begin", "anchor_${anc}_end") {
186	0	0				0	$args{$anchor} = $val unless exists $args{$anchor};
187							}
188							}
189							}
190
191							# anchor_string_absolute sets anchor_string_begin and anchor_string_end_absolute
192	4	50				14	if (exists $args{anchor_string_absolute}) {
193	0					0	my $val = delete $args{anchor_string_absolute};
194	0					0	for my $anchor (qw(anchor_string_begin anchor_string_end_absolute)) {
195	0	0				0	$args{$anchor} = $val unless exists $args{$anchor};
196							}
197							}
198
199	4		50			156	exists $args{$_} or $args{$_} = 0 for qw(
200							anchor_word_begin
201							anchor_word_end
202							anchor_line_begin
203							anchor_line_end
204							anchor_string_begin
205							anchor_string_end
206							anchor_string_end_absolute
207							debug
208							dup_warn
209							indent
210							lookahead
211							mutable
212							track
213							unroll_plus
214							);
215
216	4		50			64	exists $args{$_} or $args{$_} = 1 for qw(
217							fold_meta_pairs
218							reduce
219							chomp
220							);
221
222	4					40	@args{qw(re str path)} = (undef, undef, []);
223
224	4		50			82	$args{flags} \|\|= delete $args{modifiers} \|\| '';
			33
225	4	50				14	$args{lex} = $Current_Lexer if defined $Current_Lexer;
226
227	4					19	my $self = bless \%args, $class;
228
229	4	50				41	if ($self->_debug(DEBUG_TIME)) {
230	0					0	$self->_init_time_func();
231	0					0	$self->{_begin_time} = $self->{_time_func}->();
232							}
233							$self->{input_record_separator} = delete $self->{rs}
234	4	50				13	if exists $self->{rs};
235	4	50				15	exists $self->{file} and $self->add_file($self->{file});
236
237	4					14	return $self;
238							}
239
240							sub _init_time_func {
241	0			0		0	my $self = shift;
242	0	0				0	return if exists $self->{_time_func};
243
244							# attempt to improve accuracy
245	0	0				0	if (!defined($self->{_use_time_hires})) {
246	0					0	eval {require Time::HiRes};
	0					0
247	0					0	$self->{_use_time_hires} = $@;
248							}
249							$self->{_time_func} = length($self->{_use_time_hires}) > 0
250	0			0		0	? sub { time }
251	0	0				0	: \&Time::HiRes::time
252							;
253							}
254
255							=item clone
256
257							Clones the contents of a Regexp::Assemble object and creates a new
258							object (in other words it performs a deep copy).
259
260							If the Storable module is installed, its dclone method will be used,
261							otherwise the cloning will be performed using a pure perl approach.
262
263							You can use this method to take a snapshot of the patterns that have
264							been added so far to an object, and generate an assembly from the
265							clone. Additional patterns may to be added to the original object
266							afterwards.
267
268							my $re = $main->clone->re();
269							$main->add( 'another-pattern-\\d+' );
270
271							=cut
272
273							sub clone {
274	0			0	1	0	my $self = shift;
275	0					0	my $clone;
276	0					0	my @attr = grep {$_ ne 'path'} keys %$self;
	0					0
277	0					0	@{$clone}{@attr} = @{$self}{@attr};
	0					0
	0					0
278	0					0	$clone->{path} = _path_clone($self->_path);
279	0					0	bless $clone, ref($self);
280							}
281
282							=item add(LIST)
283
284							Takes a string, breaks it apart into a set of tokens (respecting
285							meta characters) and inserts the resulting list into the C
286							object. It uses a naive regular expression to lex the string
287							that may be fooled complex expressions (specifically, it will
288							fail to lex nested parenthetical expressions such as
289							C correctly). If this is the case, the end of
290							the string will not be tokenised correctly and returned as one
291							long string.
292
293							On the one hand, this may indicate that the patterns you are
294							trying to feed the C object are too complex. Simpler
295							patterns might allow the algorithm to work more effectively and
296							perform more reductions in the resulting pattern.
297
298							On the other hand, you can supply your own pattern to perform the
299							lexing if you need. The test suite contains an example of a lexer
300							pattern that will match one level of nested parentheses.
301
302							Note that there is an internal optimisation that will bypass a
303							much of the lexing process. If a string contains no C<\>
304							(backslash), C<[> (open square bracket), C<(> (open paren),
305							C (question mark), C<+> (plus), C<*> (star) or C<{> (open
306							curly), a character split will be performed directly.
307
308							A list of strings may be supplied, thus you can pass it a file
309							handle of a file opened for reading:
310
311							$re->add( '\d+-\d+-\d+-\d+\.example\.com' );
312							$re->add( );
313
314							If the file is very large, it may be more efficient to use a
315							C loop, to read the file line-by-line:
316
317							$re->add($_) while ;
318
319							The C method will chomp the lines automatically. If you
320							do not want this to occur (you want to keep the record
321							separator), then disable Cing.
322
323							$re->chomp(0);
324							$re->add($_) while ;
325
326							This method is chainable.
327
328							=cut
329
330							sub _fastlex {
331	0			0		0	my $self = shift;
332	0					0	my $record = shift;
333	0					0	my $len = 0;
334	0					0	my @path = ();
335	0					0	my $case = '';
336	0					0	my $qm = '';
337
338	0					0	my $debug = $self->{debug} & DEBUG_LEX;
339	0					0	my $unroll_plus = $self->{unroll_plus};
340
341	0					0	my $token;
342							my $qualifier;
343	0	0				0	$debug and print "# _lex <$record>\n";
344	0					0	my $modifier = q{(?:[+?]\\??\|\\{(?:\\d+(?:,\d)?\|,\d+)\\}\\??)?};
345	0					0	my $class_matcher = qr/\[(?:\[:[a-z]+:\]\|\\?.)*?\]/;
346	0					0	my $paren_matcher = qr/\(.*?(?
347	0					0	my $misc_matcher = qr/(?:(c)(.)\|(0)(\d{2}))($modifier)/;
348	0					0	my $regular_matcher = qr/([^\\[(])($modifier)/;
349	0					0	my $qm_matcher = qr/(\\?.)/;
350
351	0					0	my $matcher = $regular_matcher;
352							{
353	0	0				0	if ($record =~ /\G$matcher/gc) {
	0	0				0
		0
		0
354							# neither a \\ nor [ nor ( followed by a modifer
355	0	0	0			0	if ($1 eq '\\E') {
		0	0
356	0	0				0	$debug and print "# E\n";
357	0					0	$case = $qm = '';
358	0					0	$matcher = $regular_matcher;
359	0					0	redo;
360							}
361							elsif ($qm and ($1 eq '\\L' or $1 eq '\\U')) {
362	0	0				0	$debug and print "# ignore \\L, \\U\n";
363	0					0	redo;
364							}
365	0					0	$token = $1;
366	0	0				0	$qualifier = defined $2 ? $2 : '';
367	0	0				0	$debug and print "# token <$token> <$qualifier>\n";
368	0	0				0	if ($qm) {
369	0					0	$token = quotemeta($token);
370	0					0	$token =~ s/^\\([^\w$()*+.?@\[\\\]^\|{}\/])$/$1/;
371							}
372							else {
373	0					0	$token =~ s{\A([][{}*+?@\\/])\Z}{\\$1};
374							}
375	0	0	0			0	if ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/) {
376	0	0				0	$1 and $qualifier .= $1;
377	0	0				0	$debug and print " unroll <$token><$token><$qualifier>\n";
378	0	0				0	$case and $token = $case eq 'L' ? lc($token) : uc($token);
		0
379	0					0	push @path, $token, "$token$qualifier";
380							}
381							else {
382	0	0				0	$debug and print " clean <$token>\n";
383	0	0				0	push @path,
		0
384							$case eq 'L' ? lc($token).$qualifier
385							: $case eq 'U' ? uc($token).$qualifier
386							: $token.$qualifier
387							;
388							}
389	0					0	redo;
390							}
391
392							elsif ($record =~ /\G\\/gc) {
393	0	0				0	$debug and print "# backslash\n";
394							# backslash
395	0	0				0	if ($record =~ /\G([sdwSDW])($modifier)/gc) {
		0
		0
		0
		0
		0
		0
		0
396	0					0	($token, $qualifier) = ($1, $2);
397	0	0				0	$debug and print "# meta <$token> <$qualifier>\n";
398	0	0	0			0	push @path, ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/)
		0
399							? ("\\$token", "\\$token$qualifier" . (defined $1 ? $1 : ''))
400							: "\\$token$qualifier";
401							}
402							elsif ($record =~ /\Gx([\da-fA-F]{2})($modifier)/gc) {
403	0	0				0	$debug and print "# x $1\n";
404	0					0	$token = quotemeta(chr(hex($1)));
405	0					0	$qualifier = $2;
406	0	0				0	$debug and print "# cooked <$token>\n";
407	0					0	$token =~ s/^\\([^\w$()*+.?\[\\\]^\|{\/])$/$1/; # } balance
408	0	0				0	$debug and print "# giving <$token>\n";
409	0	0	0			0	push @path, ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/)
		0
410							? ($token, "$token$qualifier" . (defined $1 ? $1 : ''))
411							: "$token$qualifier";
412							}
413							elsif ($record =~ /\GQ/gc) {
414	0	0				0	$debug and print "# Q\n";
415	0					0	$qm = 1;
416	0					0	$matcher = $qm_matcher;
417							}
418							elsif ($record =~ /\G([LU])/gc) {
419	0	0				0	$debug and print "# case $1\n";
420	0					0	$case = $1;
421							}
422							elsif ($record =~ /\GE/gc) {
423	0	0				0	$debug and print "# E\n";
424	0					0	$case = $qm = '';
425	0					0	$matcher = $regular_matcher;
426							}
427							elsif ($record =~ /\G([lu])(.)/gc) {
428	0	0				0	$debug and print "# case $1 to <$2>\n";
429	0	0				0	push @path, $1 eq 'l' ? lc($2) : uc($2);
430							}
431	0					0	elsif (my @arg = grep {defined} $record =~ /\G$misc_matcher/gc) {
432	0	0				0	if ($] < 5.007) {
433	0					0	my $len = 0;
434	0					0	$len += length($_) for @arg;
435	0	0				0	$debug and print "# pos ", pos($record), " fixup add $len\n";
436	0					0	pos($record) = pos($record) + $len;
437							}
438	0					0	my $directive = shift @arg;
439	0	0				0	if ($directive eq 'c') {
440	0	0				0	$debug and print "# ctrl <@arg>\n";
441	0					0	push @path, "\\c" . uc(shift @arg);
442							}
443							else { # elsif ($directive eq '0') {
444	0	0				0	$debug and print "# octal <@arg>\n";
445	0					0	my $ascii = oct(shift @arg);
446	0	0				0	push @path, ($ascii < 32)
447							? "\\c" . chr($ascii+64)
448							: chr($ascii)
449							;
450							}
451	0					0	$path[-1] .= join( '', @arg ); # if @arg;
452	0					0	redo;
453							}
454							elsif ($record =~ /\G(.)/gc) {
455	0					0	$token = $1;
456	0					0	$token =~ s{[AZabefnrtz\[\]{}()\\\$*+.?@\|/^]}{\\$token};
457	0	0				0	$debug and print "# meta <$token>\n";
458	0					0	push @path, $token;
459							}
460							else {
461	0	0				0	$debug and print "# ignore char at ", pos($record), " of <$record>\n";
462							}
463	0					0	redo;
464							}
465
466							elsif ($record =~ /\G($class_matcher)($modifier)/gc) {
467							# [class] followed by a modifer
468	0					0	my $class = $1;
469	0	0				0	my $qualifier = defined $2 ? $2 : '';
470	0	0				0	$debug and print "# class begin <$class> <$qualifier>\n";
471	0	0				0	if ($class =~ /\A\[\\?(.)]\Z/) {
472	0					0	$class = quotemeta $1;
473	0					0	$class =~ s{\A\\([!@%])\Z}{$1};
474	0	0				0	$debug and print "# class unwrap $class\n";
475							}
476	0	0				0	$debug and print "# class end <$class> <$qualifier>\n";
477	0	0	0			0	push @path, ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/)
		0
478							? ($class, "$class$qualifier" . (defined $1 ? $1 : ''))
479							: "$class$qualifier";
480	0					0	redo;
481							}
482
483							elsif ($record =~ /\G($paren_matcher)/gc) {
484	0	0				0	$debug and print "# paren <$1>\n";
485							# (paren) followed by a modifer
486	0					0	push @path, $1;
487	0					0	redo;
488							}
489
490							}
491	0					0	return \@path;
492							}
493
494							sub _lex {
495	0			0		0	my $self = shift;
496	0					0	my $record = shift;
497	0					0	my $len = 0;
498	0					0	my @path = ();
499	0					0	my $case = '';
500	0					0	my $qm = '';
501							my $re = defined $self->{lex} ? $self->{lex}
502	0	0				0	: defined $Current_Lexer ? $Current_Lexer
		0
503							: $Default_Lexer;
504	0					0	my $debug = $self->{debug} & DEBUG_LEX;
505	0	0				0	$debug and print "# _lex <$record>\n";
506	0					0	my ($token, $next_token, $diff, $token_len);
507	0					0	while( $record =~ /($re)/g ) {
508	0					0	$token = $1;
509	0					0	$token_len = length($token);
510	0	0				0	$debug and print "# lexed <$token> len=$token_len\n";
511	0	0				0	if( pos($record) - $len > $token_len ) {
512	0					0	$next_token = $token;
513	0					0	$token = substr( $record, $len, $diff = pos($record) - $len - $token_len );
514	0	0				0	$debug and print "# recover <", substr( $record, $len, $diff ), "> as <$token>, save <$next_token>\n";
515	0					0	$len += $diff;
516							}
517	0					0	$len += $token_len;
518							TOKEN: {
519	0	0				0	if( substr( $token, 0, 1 ) eq '\\' ) {
	0					0
520	0	0				0	if( $token =~ /^\\([ELQU])$/ ) {
		0
		0
521	0	0				0	if( $1 eq 'E' ) {
		0
522							$qm and $re = defined $self->{lex} ? $self->{lex}
523	0	0				0	: defined $Current_Lexer ? $Current_Lexer
		0
		0
524							: $Default_Lexer;
525	0					0	$case = $qm = '';
526							}
527							elsif( $1 eq 'Q' ) {
528	0					0	$qm = $1;
529							# switch to a more precise lexer to quotemeta individual characters
530	0					0	$re = qr/\\?./;
531							}
532							else {
533	0					0	$case = $1;
534							}
535	0	0				0	$debug and print "# state change qm=<$qm> case=<$case>\n";
536	0					0	goto NEXT_TOKEN;
537							}
538							elsif( $token =~ /^\\([lu])(.)$/ ) {
539	0	0				0	$debug and print "# apply case=<$1> to <$2>\n";
540	0	0				0	push @path, $1 eq 'l' ? lc($2) : uc($2);
541	0					0	goto NEXT_TOKEN;
542							}
543							elsif( $token =~ /^\\x([\da-fA-F]{2})$/ ) {
544	0					0	$token = quotemeta(chr(hex($1)));
545	0	0				0	$debug and print "# cooked <$token>\n";
546	0					0	$token =~ s/^\\([^\w$()*+.?@\[\\\]^\|{\/])$/$1/; # } balance
547	0	0				0	$debug and print "# giving <$token>\n";
548							}
549							else {
550	0					0	$token =~ s/^\\([^\w$()*+.?@\[\\\]^\|{\/])$/$1/; # } balance
551	0	0				0	$debug and print "# backslashed <$token>\n";
552							}
553							}
554							else {
555	0	0				0	$case and $token = $case eq 'U' ? uc($token) : lc($token);
		0
556	0	0				0	$qm and $token = quotemeta($token);
557	0	0				0	$token = '\\/' if $token eq '/';
558							}
559							# undo quotemeta's brute-force escapades
560	0	0				0	$qm and $token =~ s/^\\([^\w$()*+.?@\[\\\]^\|{}\/])$/$1/;
561	0	0				0	$debug and print "# <$token> case=<$case> qm=<$qm>\n";
562	0					0	push @path, $token;
563
564							NEXT_TOKEN:
565	0	0				0	if( defined $next_token ) {
566	0	0				0	$debug and print "# redo <$next_token>\n";
567	0					0	$token = $next_token;
568	0					0	$next_token = undef;
569	0					0	redo TOKEN;
570							}
571							}
572							}
573	0	0				0	if( $len < length($record) ) {
574							# NB: the remainder only arises in the case of degenerate lexer,
575							# and if \Q is operative, the lexer will have been switched to
576							# /\\?./, which means there can never be a remainder, so we
577							# don't have to bother about quotemeta. In other words:
578							# $qm will never be true in this block.
579	0					0	my $remain = substr($record,$len);
580	0	0				0	$case and $remain = $case eq 'U' ? uc($remain) : lc($remain);
		0
581	0	0				0	$debug and print "# add remaining <$remain> case=<$case> qm=<$qm>\n";
582	0					0	push @path, $remain;
583							}
584	0	0				0	$debug and print "# _lex out <@path>\n";
585	0					0	return \@path;
586							}
587
588							sub add {
589	4			4	1	9	my $self = shift;
590	4					11	my $record;
591	4					11	my $debug = $self->{debug} & DEBUG_LEX;
592	4					17	while( defined( $record = shift @_ )) {
593	4	50				13	CORE::chomp($record) if $self->{chomp};
594	4	50	33			25	next if $self->{pre_filter} and not $self->{pre_filter}->($record);
595	4	50				11	$debug and print "# add <$record>\n";
596	4					10	$self->{stats_raw} += length $record;
597							my $list = $record =~ /[+*?(\\\[{]/ # }]) restore equilibrium
598	4	0				76	? $self->{lex} ? $self->_lex($record) : $self->_fastlex($record)
		50
599							: [split //, $record]
600							;
601	4	50	33			18	next if $self->{filter} and not $self->{filter}->(@$list);
602	4					27	$self->_insertr( $list );
603							}
604	4					20	return $self;
605							}
606
607							=item add_file(FILENAME [...])
608
609							Takes a list of file names. Each file is opened and read
610							line by line. Each line is added to the assembly.
611
612							$r->add_file( 'file.1', 'file.2' );
613
614							If a file cannot be opened, the method will croak. If you cannot
615							afford to let this happen then you should wrap the call in a C
616							block.
617
618							Chomping happens automatically unless you the C method
619							to disable it. By default, input lines are read according to the
620							value of the C attribute (if defined), and
621							will otherwise fall back to the current setting of the system C<$/>
622							variable. The record separator may also be specified on each
623							call to C. Internally, the routine Cises the
624							value of C<$/> to whatever is required, for the duration of the
625							call.
626
627							An alternate calling mechanism using a hash reference is
628							available. The recognised keys are:
629
630							=over 4
631
632							=item file
633
634							Reference to a list of file names, or the name of a single
635							file.
636
637							$r->add_file({file => ['file.1', 'file.2', 'file.3']});
638							$r->add_file({file => 'file.n'});
639
640							=item input_record_separator
641
642							If present, indicates what constitutes a line
643
644							$r->add_file({file => 'data.txt', input_record_separator => ':' });
645
646							=item rs
647
648							An alias for input_record_separator (mnemonic: same as the
649							English variable names).
650
651							=back
652
653							$r->add_file( {
654							file => [ 'pattern.txt', 'more.txt' ],
655							input_record_separator => "\r\n",
656							});
657
658							=cut
659
660							sub add_file {
661	0			0	1	0	my $self = shift;
662	0					0	my $rs;
663							my @file;
664	0	0				0	if (ref($_[0]) eq 'HASH') {
665	0					0	my $arg = shift;
666							$rs = $arg->{rs}
667							\|\| $arg->{input_record_separator}
668							\|\| $self->{input_record_separator}
669	0		0			0	\|\| $/;
670							@file = ref($arg->{file}) eq 'ARRAY'
671	0					0	? @{$arg->{file}}
672	0	0				0	: $arg->{file};
673							}
674							else {
675	0		0			0	$rs = $self->{input_record_separator} \|\| $/;
676	0					0	@file = @_;
677							}
678	0					0	local $/ = $rs;
679	0					0	my $file;
680	0					0	for $file (@file) {
681	0	0				0	open my $fh, '<', $file or do {
682	0					0	require Carp;
683	0					0	Carp::croak("cannot open $file for input: $!");
684							};
685	0					0	while (defined (my $rec = <$fh>)) {
686	0					0	$self->add($rec);
687							}
688	0					0	close $fh;
689							}
690	0					0	return $self;
691							}
692
693							=item insert(LIST)
694
695							Takes a list of tokens representing a regular expression and
696							stores them in the object. Note: you should not pass it a bare
697							regular expression, such as C. You must pass it as
698							a list of tokens, I C<('a', 'b+', 'c?', 'd*', 'e')>.
699
700							This method is chainable, I:
701
702							my $ra = Regexp::Assemble->new
703							->insert( qw[ a b+ c? d* e ] )
704							->insert( qw[ a c+ d+ e* f ] );
705
706							Lexing complex patterns with metacharacters and so on can consume
707							a significant proportion of the overall time to build an assembly.
708							If you have the information available in a tokenised form, calling
709							C directly can be a big win.
710
711							=cut
712
713							sub insert {
714	0			0	1	0	my $self = shift;
715	0	0	0			0	return if $self->{filter} and not $self->{filter}->(@_);
716	0					0	$self->_insertr( [@_] );
717	0					0	return $self;
718							}
719
720							sub _insertr {
721	4			4		34	my $self = shift;
722	4		50			31	my $dup = $self->{stats_dup} \|\| 0;
723	4					22	$self->{path} = $self->_insert_path( $self->_path, $self->_debug(DEBUG_ADD), $_[0] );
724	4	50	33			18	if( not defined $self->{stats_dup} or $dup == $self->{stats_dup} ) {
		0
725	4					10	++$self->{stats_add};
726	4	50				7	$self->{stats_cooked} += defined($_) ? length($_) : 0 for @{$_[0]};
	4					40
727							}
728							elsif( $self->{dup_warn} ) {
729	0	0				0	if( ref $self->{dup_warn} eq 'CODE' ) {
730	0					0	$self->{dup_warn}->($self, $_[0]);
731							}
732							else {
733	0					0	my $pattern = join( '', @{$_[0]} );
	0					0
734	0					0	require Carp;
735	0					0	Carp::carp("duplicate pattern added: /$pattern/");
736							}
737							}
738	4					20	$self->{str} = $self->{re} = undef;
739							}
740
741							=item lexstr
742
743							Use the C method if you are curious to see how a pattern
744							gets tokenised. It takes a scalar on input, representing a pattern,
745							and returns a reference to an array, containing the tokenised
746							pattern. You can recover the original pattern by performing a
747							C:
748
749							my @token = $re->lexstr($pattern);
750							my $new_pattern = join( '', @token );
751
752							If the original pattern contains unnecessary backslashes, or C<\x4b>
753							escapes, or quotemeta escapes (C<\Q>...C<\E>) the resulting pattern
754							may not be identical.
755
756							Call C does not add the pattern to the object, it is merely
757							for exploratory purposes. It will, however, update various statistical
758							counters.
759
760							=cut
761
762							sub lexstr {
763	0			0	1	0	return shift->_lex(shift);
764							}
765
766							=item pre_filter(CODE)
767
768							Allows you to install a callback to check that the pattern being
769							loaded contains valid input. It receives the pattern as a whole to
770							be added, before it been tokenised by the lexer. It may to return
771							0 or C to indicate that the pattern should not be added, any
772							true value indicates that the contents are fine.
773
774							A filter to strip out trailing comments (marked by #):
775
776							$re->pre_filter( sub { $_[0] =~ s/\s#.$//; 1 } );
777
778							A filter to ignore blank lines:
779
780							$re->pre_filter( sub { length(shift) } );
781
782							If you want to remove the filter, pass C as a parameter.
783
784							$ra->pre_filter(undef);
785
786							This method is chainable.
787
788							=cut
789
790							sub pre_filter {
791	0			0	1	0	my $self = shift;
792	0					0	my $pre_filter = shift;
793	0	0	0			0	if( defined $pre_filter and ref($pre_filter) ne 'CODE' ) {
794	0					0	require Carp;
795	0					0	Carp::croak("pre_filter method not passed a coderef");
796							}
797	0					0	$self->{pre_filter} = $pre_filter;
798	0					0	return $self;
799							}
800
801
802							=item filter(CODE)
803
804							Allows you to install a callback to check that the pattern being
805							loaded contains valid input. It receives a list on input, after it
806							has been tokenised by the lexer. It may to return 0 or undef to
807							indicate that the pattern should not be added, any true value
808							indicates that the contents are fine.
809
810							If you know that all patterns you expect to assemble contain
811							a restricted set of of tokens (e.g. no spaces), you could do
812							the following:
813
814							$ra->filter(sub { not grep { / / } @_ });
815
816							or
817
818							sub only_spaces_and_digits {
819							not grep { ![\d ] } @_
820							}
821							$ra->filter( \&only_spaces_and_digits );
822
823							These two examples will silently ignore faulty patterns, If you
824							want the user to be made aware of the problem you should raise an
825							error (via C or C), log an error message, whatever is
826							best. If you want to remove a filter, pass C as a parameter.
827
828							$ra->filter(undef);
829
830							This method is chainable.
831
832							=cut
833
834							sub filter {
835	0			0	1	0	my $self = shift;
836	0					0	my $filter = shift;
837	0	0	0			0	if( defined $filter and ref($filter) ne 'CODE' ) {
838	0					0	require Carp;
839	0					0	Carp::croak("filter method not passed a coderef");
840							}
841	0					0	$self->{filter} = $filter;
842	0					0	return $self;
843							}
844
845							=item as_string
846
847							Assemble the expression and return it as a string. You may want to do
848							this if you are writing the pattern to a file. The following arguments
849							can be passed to control the aspect of the resulting pattern:
850
851							B, the number of spaces used to indent nested grouping of
852							a pattern. Use this to produce a pretty-printed pattern (for some
853							definition of "pretty"). The resulting output is rather verbose. The
854							reason is to ensure that the metacharacters C<(?:> and C<)> always
855							occur on otherwise empty lines. This allows you grep the result for an
856							even more synthetic view of the pattern:
857
858							egrep -v '^ *[()]'
859
860							The result of the above is quite readable. Remember to backslash the
861							spaces appearing in your own patterns if you wish to use an indented
862							pattern in an C construct. Indenting is ignored if tracking
863							is enabled.
864
865							The B argument takes precedence over the C
866							method/attribute of the object.
867
868							Calling this
869							method will drain the internal data structure. Large numbers of patterns
870							can eat a significant amount of memory, and this lets perl recover the
871							memory used for other purposes.
872
873							If you want to reduce the pattern I continue to add new patterns,
874							clone the object and reduce the clone, leaving the original object intact.
875
876							=cut
877
878							sub as_string {
879	3			3	1	13	my $self = shift;
880	3	50				12	if( not defined $self->{str} ) {
881	3	50				26	if( $self->{track} ) {
882	0					0	$self->{m} = undef;
883	0					0	$self->{mcount} = 0;
884	0					0	$self->{mlist} = [];
885	0					0	$self->{str} = _re_path_track($self, $self->_path, '', '');
886							}
887							else {
888	3	50	33			73	$self->_reduce unless ($self->{mutable} or not $self->{reduce});
889	3					6	my $arg = {@_};
890							$arg->{indent} = $self->{indent}
891	3	50	33			27	if not exists $arg->{indent} and $self->{indent} > 0;
892	3	50	33			32	if( exists $arg->{indent} and $arg->{indent} > 0 ) {
		50
893	0					0	$arg->{depth} = 0;
894	0					0	$self->{str} = _re_path_pretty($self, $self->_path, $arg);
895							}
896							elsif( $self->{lookahead} ) {
897	0					0	$self->{str} = _re_path_lookahead($self, $self->_path);
898							}
899							else {
900	3					19	$self->{str} = _re_path($self, $self->_path);
901							}
902							}
903	3	50				10	if (not length $self->{str}) {
904							# explicitly fail to match anything if no pattern was generated
905	0					0	$self->{str} = $Always_Fail;
906							}
907							else {
908							my $begin =
909							$self->{anchor_word_begin} ? '\\b'
910							: $self->{anchor_line_begin} ? '^'
911	3	50				25	: $self->{anchor_string_begin} ? '\A'
		50
		50
912							: ''
913							;
914							my $end =
915							$self->{anchor_word_end} ? '\\b'
916							: $self->{anchor_line_end} ? '$'
917							: $self->{anchor_string_end} ? '\Z'
918	3	50				29	: $self->{anchor_string_end_absolute} ? '\z'
		50
		50
		50
919							: ''
920							;
921	3					11	$self->{str} = "$begin$self->{str}$end";
922							}
923	3	50				15	$self->{path} = [] unless $self->{mutable};
924							}
925	3					14	return $self->{str};
926							}
927
928							=item re
929
930							Assembles the pattern and return it as a compiled RE, using the
931							C operator.
932
933							As with C, calling this method will reset the internal data
934							structures to free the memory used in assembling the RE.
935
936							The B attribute, documented in the C method, can be
937							used here (it will be ignored if tracking is enabled).
938
939							With method chaining, it is possible to produce a RE without having
940							a temporary C object lying around, I:
941
942							my $re = Regexp::Assemble->new
943							->add( q[ab+cd+e] )
944							->add( q[ac\\d+e] )
945							->add( q[c\\d+e] )
946							->re;
947
948							The C<$re> variable now contains a Regexp object that can be used
949							directly:
950
951							while( <> ) {
952							/$re/ and print "Something in [$_] matched\n";
953							)
954
955							The C method is called when the object is used in string context
956							(hence, within an C operator), so by and large you do not even
957							need to save the RE in a separate variable. The following will work
958							as expected:
959
960							my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
961							while( ) {
962							if( /($re)/ ) {
963							print "Here be giants: $1\n";
964							}
965							}
966
967							This approach does not work with tracked patterns. The
968							C and C methods must be used instead, see below.
969
970							=cut
971
972							sub re {
973	4			4	1	7	my $self = shift;
974	4	50				12	$self->_build_re($self->as_string(@_)) unless defined $self->{re};
975	4					60	return $self->{re};
976							}
977
978							use overload '""' => sub {
979	8			8		38	my $self = shift;
980	8	100				89	return $self->{re} if $self->{re};
981	3					31	$self->_build_re($self->as_string());
982	3					19	return $self->{re};
983	10			10		112	};
	10					33
	10					132
984
985							sub _build_re {
986	3			3		5	my $self = shift;
987	3					7	my $str = shift;
988	3	50				12	if( $self->{track} ) {
989	10			10		1036	use re 'eval';
	10					43
	10					103552
990							$self->{re} = length $self->{flags}
991	0	0				0	? qr/(?$self->{flags}:$str)/
992							: qr/$str/
993							;
994							}
995							else {
996							# how could I not repeat myself?
997							$self->{re} = length $self->{flags}
998	3	50				74	? qr/(?$self->{flags}:$str)/
999							: qr/$str/
1000							;
1001							}
1002							}
1003
1004							=item match(SCALAR)
1005
1006							The following information applies to Perl 5.8 and below. See
1007							the section that follows for information on Perl 5.10.
1008
1009							If pattern tracking is in use, you must C in order
1010							to make things work correctly. At a minimum, this will make your
1011							code look like this:
1012
1013							my $did_match = do { use re 'eval'; $target =~ /$ra/ }
1014							if( $did_match ) {
1015							print "matched ", $ra->matched, "\n";
1016							}
1017
1018							(The main reason is that the C<$^R> variable is currently broken
1019							and an ugly workaround that runs some Perl code during the match
1020							is required, in order to simulate what C<$^R> should be doing. See
1021							Perl bug #32840 for more information if you are curious. The README
1022							also contains more information). This bug has been fixed in 5.10.
1023
1024							The important thing to note is that with C, THERE
1025							ARE SECURITY IMPLICATIONS WHICH YOU IGNORE AT YOUR PERIL. The problem
1026							is this: if you do not have strict control over the patterns being
1027							fed to C when tracking is enabled, and someone
1028							slips you a pattern such as C and you
1029							attempt to match a string against the resulting pattern, you will
1030							know Fear and Loathing.
1031
1032							What is more, the C<$^R> workaround means that that tracking does
1033							not work if you perform a bare C pattern match as shown
1034							above. You have to instead call the C method, in order to
1035							supply the necessary context to take care of the tracking housekeeping
1036							details.
1037
1038							if( defined( my $match = $ra->match($_)) ) {
1039							print " $_ matched by $match\n";
1040							}
1041
1042							In the case of a successful match, the original matched pattern
1043							is returned directly. The matched pattern will also be available
1044							through the C method.
1045
1046							(Except that the above is not true for 5.6.0: the C method
1047							returns true or undef, and the C method always returns
1048							undef).
1049
1050							If you are capturing parts of the pattern I C
1051							you will want to get at the captures. See the C, C,
1052							C and C methods. If you are not using captures
1053							then you may safely ignore this section.
1054
1055							In 5.10, since the bug concerning C<$^R> has been resolved, there
1056							is no need to use C and the assembled pattern does
1057							not require any Perl code to be executed during the match.
1058
1059							=cut
1060
1061							sub match {
1062	0			0	1	0	my $self = shift;
1063	0					0	my $target = shift;
1064	0	0				0	$self->_build_re($self->as_string(@_)) unless defined $self->{re};
1065	0					0	$self->{m} = undef;
1066	0					0	$self->{mvar} = [];
1067	0	0				0	if( not $target =~ /$self->{re}/ ) {
1068	0					0	$self->{mbegin} = [];
1069	0					0	$self->{mend} = [];
1070	0					0	return undef;
1071							}
1072	0	0				0	$self->{m} = $^R if $] >= 5.009005;
1073	0					0	$self->{mbegin} = _path_copy([@-]);
1074	0					0	$self->{mend} = _path_copy([@+]);
1075	0					0	my $n = 0;
1076	0					0	for( my $n = 0; $n < @-; ++$n ) {
1077	0	0	0			0	push @{$self->{mvar}}, substr($target, $-[$n], $+[$n] - $-[$n])
	0					0
1078							if defined $-[$n] and defined $+[$n];
1079							}
1080	0	0				0	if( $self->{track} ) {
1081	0	0				0	return defined $self->{m} ? $self->{mlist}[$self->{m}] : 1;
1082							}
1083							else {
1084	0					0	return 1;
1085							}
1086							}
1087
1088							=item source
1089
1090							When using tracked mode, after a successful match is made, returns
1091							the original source pattern that caused the match. In Perl 5.10,
1092							the C<$^R> variable can be used to as an index to fetch the correct
1093							pattern from the object.
1094
1095							If no successful match has been performed, or the object is not in
1096							tracked mode, this method returns C.
1097
1098							my $r = Regexp::Assemble->new->track(1)->add(qw(foo? bar{2} [Rr]at));
1099
1100							for my $w (qw(this food is rather barren)) {
1101							if ($w =~ /$r/) {
1102							print "$w matched by ", $r->source($^R), $/;
1103							}
1104							else {
1105							print "$w no match\n";
1106							}
1107							}
1108
1109							=cut
1110
1111							sub source {
1112	0			0	1	0	my $self = shift;
1113	0	0				0	return unless $self->{track};
1114	0	0				0	defined($_[0]) and return $self->{mlist}[$_[0]];
1115	0	0				0	return unless defined $self->{m};
1116	0					0	return $self->{mlist}[$self->{m}];
1117							}
1118
1119							=item mbegin
1120
1121							This method returns a copy of C<@-> at the moment of the
1122							last match. You should ordinarily not need to bother with
1123							this, C should be able to supply all your needs.
1124
1125							=cut
1126
1127							sub mbegin {
1128	0			0	1	0	my $self = shift;
1129	0	0				0	return exists $self->{mbegin} ? $self->{mbegin} : [];
1130							}
1131
1132							=item mend
1133
1134							This method returns a copy of C<@+> at the moment of the
1135							last match.
1136
1137							=cut
1138
1139							sub mend {
1140	0			0	1	0	my $self = shift;
1141	0	0				0	return exists $self->{mend} ? $self->{mend} : [];
1142							}
1143
1144							=item mvar(NUMBER)
1145
1146							The C method returns the captures of the last match.
1147							C corresponds to $1, C to $2, and so on.
1148							C happens to return the target string matched,
1149							as a byproduct of walking down the C<@-> and C<@+> arrays
1150							after the match.
1151
1152							If called without a parameter, C will return a
1153							reference to an array containing all captures.
1154
1155							=cut
1156
1157							sub mvar {
1158	0			0	1	0	my $self = shift;
1159	0	0				0	return undef unless exists $self->{mvar};
1160	0	0				0	return defined($_[0]) ? $self->{mvar}[$_[0]] : $self->{mvar};
1161							}
1162
1163							=item capture
1164
1165							The C method returns the the captures of the last
1166							match as an array. Unlink C, this method does not
1167							include the matched string. It is equivalent to getting an
1168							array back that contains C<$1, $2, $3, ...>.
1169
1170							If no captures were found in the match, an empty array is
1171							returned, rather than C. You are therefore guaranteed
1172							to be able to use C<< for my $c ($re->capture) { ... >>
1173							without have to check whether anything was captured.
1174
1175							=cut
1176
1177							sub capture {
1178	0			0	1	0	my $self = shift;
1179	0	0				0	if( $self->{mvar} ) {
1180	0					0	my @capture = @{$self->{mvar}};
	0					0
1181	0					0	shift @capture;
1182	0					0	return @capture;
1183							}
1184	0					0	return ();
1185							}
1186
1187							=item matched
1188
1189							If pattern tracking has been set, via the C attribute,
1190							or through the C method, this method will return the
1191							original pattern of the last successful match. Returns undef
1192							match has yet been performed, or tracking has not been enabled.
1193
1194							See below in the NOTES section for additional subtleties of
1195							which you should be aware of when tracking patterns.
1196
1197							Note that this method is not available in 5.6.0, due to
1198							limitations in the implementation of C<(?{...})> at the time.
1199
1200							=cut
1201
1202							sub matched {
1203	0			0	1	0	my $self = shift;
1204	0	0				0	return defined $self->{m} ? $self->{mlist}[$self->{m}] : undef;
1205							}
1206
1207							=back
1208
1209							=head2 Statistics/Reporting routines
1210
1211							=over 8
1212
1213							=item stats_add
1214
1215							Returns the number of patterns added to the assembly (whether
1216							by C or C). Duplicate patterns are not included
1217							in this total.
1218
1219							=cut
1220
1221							sub stats_add {
1222	0			0	1	0	my $self = shift;
1223	0		0			0	return $self->{stats_add} \|\| 0;
1224							}
1225
1226							=item stats_dup
1227
1228							Returns the number of duplicate patterns added to the assembly.
1229							If non-zero, this may be a sign that something is wrong with
1230							your data (or at the least, some needless redundancy). This may
1231							occur when you have two patterns (for instance, C and
1232							C) which map to the same result.
1233
1234							=cut
1235
1236							sub stats_dup {
1237	0			0	1	0	my $self = shift;
1238	0		0			0	return $self->{stats_dup} \|\| 0;
1239							}
1240
1241							=item stats_raw
1242
1243							Returns the raw number of bytes in the patterns added to the
1244							assembly. This includes both original and duplicate patterns.
1245							For instance, adding the two patterns C and C will
1246							count as 4 bytes.
1247
1248							=cut
1249
1250							sub stats_raw {
1251	0			0	1	0	my $self = shift;
1252	0		0			0	return $self->{stats_raw} \|\| 0;
1253							}
1254
1255							=item stats_cooked
1256
1257							Return the true number of bytes added to the assembly. This
1258							will not include duplicate patterns. Furthermore, it may differ
1259							from the raw bytes due to quotemeta treatment. For instance,
1260							C will count as 7 (not 8) bytes, because C<\,> will
1261							be stored as C<,>. Also, C<\Qa.b\E> is 7 bytes long, however,
1262							after the quotemeta directives are processed, C will be
1263							stored, for a total of 4 bytes.
1264
1265							=cut
1266
1267							sub stats_cooked {
1268	0			0	1	0	my $self = shift;
1269	0		0			0	return $self->{stats_cooked} \|\| 0;
1270							}
1271
1272							=item stats_length
1273
1274							Returns the length of the resulting assembled expression.
1275							Until C or C have been called, the length
1276							will be 0 (since the assembly will have not yet been
1277							performed). The length includes only the pattern, not the
1278							additional (C<(?-xism...>) fluff added by the compilation.
1279
1280							=cut
1281
1282							sub stats_length {
1283	0			0	1	0	my $self = shift;
1284	0	0	0			0	return (defined $self->{str} and $self->{str} ne $Always_Fail) ? length $self->{str} : 0;
1285							}
1286
1287							=item dup_warn(NUMBER\|CODEREF)
1288
1289							Turns warnings about duplicate patterns on or off. By
1290							default, no warnings are emitted. If the method is
1291							called with no parameters, or a true parameter,
1292							the object will carp about patterns it has
1293							already seen. To turn off the warnings, use 0 as a
1294							parameter.
1295
1296							$r->dup_warn();
1297
1298							The method may also be passed a code block. In this case
1299							the code will be executed and it will receive a reference
1300							to the object in question, and the lexed pattern.
1301
1302							$r->dup_warn(
1303							sub {
1304							my $self = shift;
1305							print $self->stats_add, " patterns added at line $.\n",
1306							join( '', @_ ), " added previously\n";
1307							}
1308							)
1309
1310							=cut
1311
1312							sub dup_warn {
1313	0			0	1	0	my $self = shift;
1314	0	0				0	$self->{dup_warn} = defined($_[0]) ? $_[0] : 1;
1315	0					0	return $self;
1316							}
1317
1318							=back
1319
1320							=head2 Anchor routines
1321
1322							Suppose you wish to assemble a series of patterns that all begin
1323							with C<^> and end with C<$> (anchor pattern to the beginning and
1324							end of line). Rather than add the anchors to each and every pattern
1325							(and possibly forget to do so when a new entry is added), you may
1326							specify the anchors in the object, and they will appear in the
1327							resulting pattern, and you no longer need to (or should) put them
1328							in your source patterns. For example, the two following snippets
1329							will produce identical patterns:
1330
1331							$r->add(qw(^this ^that ^them))->as_string;
1332
1333							$r->add(qw(this that them))->anchor_line_begin->as_string;
1334
1335							# both techniques will produce ^th(?:at\|em\|is)
1336
1337							All anchors are possible word (C<\b>) boundaries, line
1338							boundaries (C<^> and C<$>) and string boundaries (C<\A>
1339							and C<\Z> (or C<\z> if you absolutely need it)).
1340
1341							The shortcut C> implies both
1342							C_begin> C_end>
1343							is also available. If different anchors are specified
1344							the most specific anchor wins. For instance, if both
1345							C and C are
1346							specified, C takes precedence.
1347
1348							All the anchor methods are chainable.
1349
1350							=over 8
1351
1352							=item anchor_word_begin
1353
1354							The resulting pattern will be prefixed with a C<\b>
1355							word boundary assertion when the value is true. Set
1356							to 0 to disable.
1357
1358							$r->add('pre')->anchor_word_begin->as_string;
1359							# produces '\bpre'
1360
1361							=cut
1362
1363							sub anchor_word_begin {
1364	0			0	1	0	my $self = shift;
1365	0	0				0	$self->{anchor_word_begin} = defined($_[0]) ? $_[0] : 1;
1366	0					0	return $self;
1367							}
1368
1369							=item anchor_word_end
1370
1371							The resulting pattern will be suffixed with a C<\b>
1372							word boundary assertion when the value is true. Set
1373							to 0 to disable.
1374
1375							$r->add(qw(ing tion))
1376							->anchor_word_end
1377							->as_string; # produces '(?:tion\|ing)\b'
1378
1379							=cut
1380
1381							sub anchor_word_end {
1382	0			0	1	0	my $self = shift;
1383	0	0				0	$self->{anchor_word_end} = defined($_[0]) ? $_[0] : 1;
1384	0					0	return $self;
1385							}
1386
1387							=item anchor_word
1388
1389							The resulting pattern will be have C<\b>
1390							word boundary assertions at the beginning and end
1391							of the pattern when the value is true. Set
1392							to 0 to disable.
1393
1394							$r->add(qw(cat carrot)
1395							->anchor_word(1)
1396							->as_string; # produces '\bca(?:rro)t\b'
1397
1398							=cut
1399
1400							sub anchor_word {
1401	0			0	1	0	my $self = shift;
1402	0					0	my $state = shift;
1403	0					0	$self->anchor_word_begin($state)->anchor_word_end($state);
1404	0					0	return $self;
1405							}
1406
1407							=item anchor_line_begin
1408
1409							The resulting pattern will be prefixed with a C<^>
1410							line boundary assertion when the value is true. Set
1411							to 0 to disable.
1412
1413							$r->anchor_line_begin;
1414							# or
1415							$r->anchor_line_begin(1);
1416
1417							=cut
1418
1419							sub anchor_line_begin {
1420	0			0	1	0	my $self = shift;
1421	0	0				0	$self->{anchor_line_begin} = defined($_[0]) ? $_[0] : 1;
1422	0					0	return $self;
1423							}
1424
1425							=item anchor_line_end
1426
1427							The resulting pattern will be suffixed with a C<$>
1428							line boundary assertion when the value is true. Set
1429							to 0 to disable.
1430
1431							# turn it off
1432							$r->anchor_line_end(0);
1433
1434							=cut
1435
1436							sub anchor_line_end {
1437	0			0	1	0	my $self = shift;
1438	0	0				0	$self->{anchor_line_end} = defined($_[0]) ? $_[0] : 1;
1439	0					0	return $self;
1440							}
1441
1442							=item anchor_line
1443
1444							The resulting pattern will be have the C<^> and C<$>
1445							line boundary assertions at the beginning and end
1446							of the pattern, respectively, when the value is true. Set
1447							to 0 to disable.
1448
1449							$r->add(qw(cat carrot)
1450							->anchor_line
1451							->as_string; # produces '^ca(?:rro)t$'
1452
1453							=cut
1454
1455							sub anchor_line {
1456	0			0	1	0	my $self = shift;
1457	0					0	my $state = shift;
1458	0					0	$self->anchor_line_begin($state)->anchor_line_end($state);
1459	0					0	return $self;
1460							}
1461
1462							=item anchor_string_begin
1463
1464							The resulting pattern will be prefixed with a C<\A>
1465							string boundary assertion when the value is true. Set
1466							to 0 to disable.
1467
1468							$r->anchor_string_begin(1);
1469
1470							=cut
1471
1472							sub anchor_string_begin {
1473	0			0	1	0	my $self = shift;
1474	0	0				0	$self->{anchor_string_begin} = defined($_[0]) ? $_[0] : 1;
1475	0					0	return $self;
1476							}
1477
1478							=item anchor_string_end
1479
1480							The resulting pattern will be suffixed with a C<\Z>
1481							string boundary assertion when the value is true. Set
1482							to 0 to disable.
1483
1484							# disable the string boundary end anchor
1485							$r->anchor_string_end(0);
1486
1487							=cut
1488
1489							sub anchor_string_end {
1490	0			0	1	0	my $self = shift;
1491	0	0				0	$self->{anchor_string_end} = defined($_[0]) ? $_[0] : 1;
1492	0					0	return $self;
1493							}
1494
1495							=item anchor_string_end_absolute
1496
1497							The resulting pattern will be suffixed with a C<\z>
1498							string boundary assertion when the value is true. Set
1499							to 0 to disable.
1500
1501							# disable the string boundary absolute end anchor
1502							$r->anchor_string_end_absolute(0);
1503
1504							If you don't understand the difference between
1505							C<\Z> and C<\z>, the former will probably do what
1506							you want.
1507
1508							=cut
1509
1510							sub anchor_string_end_absolute {
1511	0			0	1	0	my $self = shift;
1512	0	0				0	$self->{anchor_string_end_absolute} = defined($_[0]) ? $_[0] : 1;
1513	0					0	return $self;
1514							}
1515
1516							=item anchor_string
1517
1518							The resulting pattern will be have the C<\A> and C<\Z>
1519							string boundary assertions at the beginning and end
1520							of the pattern, respectively, when the value is true. Set
1521							to 0 to disable.
1522
1523							$r->add(qw(cat carrot)
1524							->anchor_string
1525							->as_string; # produces '\Aca(?:rro)t\Z'
1526
1527							=cut
1528
1529							sub anchor_string {
1530	0			0	1	0	my $self = shift;
1531	0	0				0	my $state = defined($_[0]) ? $_[0] : 1;
1532	0					0	$self->anchor_string_begin($state)->anchor_string_end($state);
1533	0					0	return $self;
1534							}
1535
1536							=item anchor_string_absolute
1537
1538							The resulting pattern will be have the C<\A> and C<\z>
1539							string boundary assertions at the beginning and end
1540							of the pattern, respectively, when the value is true. Set
1541							to 0 to disable.
1542
1543							$r->add(qw(cat carrot)
1544							->anchor_string_absolute
1545							->as_string; # produces '\Aca(?:rro)t\z'
1546
1547							=cut
1548
1549							sub anchor_string_absolute {
1550	0			0	1	0	my $self = shift;
1551	0	0				0	my $state = defined($_[0]) ? $_[0] : 1;
1552	0					0	$self->anchor_string_begin($state)->anchor_string_end_absolute($state);
1553	0					0	return $self;
1554							}
1555
1556							=back
1557
1558							=over 8
1559
1560							=item debug(NUMBER)
1561
1562							Turns debugging on or off. Statements are printed
1563							to the currently selected file handle (STDOUT by default).
1564							If you are already using this handle, you will have to
1565							arrange to select an output handle to a file of your own
1566							choosing, before call the C, C or C)
1567							functions, otherwise it will scribble all over your
1568							carefully formatted output.
1569
1570							=over 8
1571
1572							=item C<0>
1573
1574							Off. Turns off all debugging output.
1575
1576							=item C<1>
1577
1578							Add. Trace the addition of patterns.
1579
1580							=item C<2>
1581
1582							Reduce. Trace the process of reduction and assembly.
1583
1584							=item C<4>
1585
1586							Lex. Trace the lexing of the input patterns into its constituent
1587							tokens.
1588
1589							=item C<8>
1590
1591							Time. Print to STDOUT the time taken to load all the patterns. This is
1592							nothing more than the difference between the time the object was
1593							instantiated and the time reduction was initiated.
1594
1595							# load=
1596
1597							Any lengthy computation performed in the client code will be reflected
1598							in this value. Another line will be printed after reduction is
1599							complete.
1600
1601							# reduce=
1602
1603							The above output lines will be changed to C and
1604							C if the internal state of the object is corrupted
1605							and the initial timestamp is lost.
1606
1607							The code attempts to load L in order to report fractional
1608							seconds. If this is not successful, the elapsed time is displayed
1609							in whole seconds.
1610
1611							=back
1612
1613							Values can be added (or or'ed together) to trace everything
1614
1615							$r->debug(7)->add( '\\d+abc' );
1616
1617							Calling C with no arguments turns debugging off.
1618
1619							=cut
1620
1621							sub debug {
1622	0			0	1	0	my $self = shift;
1623	0	0				0	$self->{debug} = defined($_[0]) ? $_[0] : 0;
1624	0	0				0	if ($self->_debug(DEBUG_TIME)) {
1625							# hmm, debugging time was switched on after instantiation
1626	0					0	$self->_init_time_func;
1627	0					0	$self->{_begin_time} = $self->{_time_func}->();
1628							}
1629	0					0	return $self;
1630							}
1631
1632							=item dump
1633
1634							Produces a synthetic view of the internal data structure. How
1635							to interpret the results is left as an exercise to the reader.
1636
1637							print $r->dump;
1638
1639							=cut
1640
1641							sub dump {
1642	0			0	1	0	return _dump($_[0]->_path);
1643							}
1644
1645							=item chomp(0\|1)
1646
1647							Turns chomping on or off.
1648
1649							IMPORTANT: As of version 0.24, chomping is now on by default as it
1650							makes C Just Work. The only time you may run into trouble
1651							is with C. So don't do that, or else explicitly turn
1652							off chomping.
1653
1654							To avoid incorporating (spurious)
1655							record separators (such as "\n" on Unix) when reading from a file,
1656							C Cs its input. If you don't want this to happen,
1657							call C with a false value.
1658
1659							$re->chomp(0); # really want the record separators
1660							$re->add();
1661
1662							=cut
1663
1664							sub chomp {
1665	0			0	1	0	my $self = shift;
1666	0	0				0	$self->{chomp} = defined($_[0]) ? $_[0] : 1;
1667	0					0	return $self;
1668							}
1669
1670							=item fold_meta_pairs(NUMBER)
1671
1672							Determines whether C<\s>, C<\S> and C<\w>, C<\W> and C<\d>, C<\D>
1673							are folded into a C<.> (dot). Folding happens by default (for
1674							reasons of backwards compatibility, even though it is wrong when
1675							the C expression modifier is active).
1676
1677							Call this method with a false value to prevent this behaviour (which
1678							is only a problem when dealing with C<\n> if the C expression
1679							modifier is also set).
1680
1681							$re->add( '\\w', '\\W' );
1682							my $clone = $re->clone;
1683
1684							$clone->fold_meta_pairs(0);
1685							print $clone->as_string; # prints '.'
1686							print $re->as_string; # print '[\W\w]'
1687
1688							=cut
1689
1690							sub fold_meta_pairs {
1691	0			0	1	0	my $self = shift;
1692	0	0				0	$self->{fold_meta_pairs} = defined($_[0]) ? $_[0] : 1;
1693	0					0	return $self;
1694							}
1695
1696							=item indent(NUMBER)
1697
1698							Sets the level of indent for pretty-printing nested groups
1699							within a pattern. See the C method for more details.
1700							When called without a parameter, no indenting is performed.
1701
1702							$re->indent( 4 );
1703							print $re->as_string;
1704
1705							=cut
1706
1707							sub indent {
1708	0			0	1	0	my $self = shift;
1709	0	0				0	$self->{indent} = defined($_[0]) ? $_[0] : 0;
1710	0					0	return $self;
1711							}
1712
1713							=item lookahead(0\|1)
1714
1715							Turns on zero-width lookahead assertions. This is usually
1716							beneficial when you expect that the pattern will usually fail.
1717							If you expect that the pattern will usually match you will
1718							probably be worse off.
1719
1720							=cut
1721
1722							sub lookahead {
1723	0			0	1	0	my $self = shift;
1724	0	0				0	$self->{lookahead} = defined($_[0]) ? $_[0] : 1;
1725	0					0	return $self;
1726							}
1727
1728							=item flags(STRING)
1729
1730							Sets the flags that govern how the pattern behaves (for
1731							versions of Perl up to 5.9 or so, these are C). By
1732							default no flags are enabled.
1733
1734
1735							=item modifiers(STRING)
1736
1737							An alias of the C method, for users familiar with
1738							C.
1739
1740							=cut
1741
1742							sub flags {
1743	0			0	1	0	my $self = shift;
1744	0	0				0	$self->{flags} = defined($_[0]) ? $_[0] : '';
1745	0					0	return $self;
1746							}
1747
1748							sub modifiers {
1749	0			0	1	0	my $self = shift;
1750	0					0	return $self->flags(@_);
1751							}
1752
1753							=item track(0\|1)
1754
1755							Turns tracking on or off. When this attribute is enabled,
1756							additional housekeeping information is inserted into the
1757							assembled expression using C<({...}> embedded code
1758							constructs. This provides the necessary information to
1759							determine which, of the original patterns added, was the
1760							one that caused the match.
1761
1762							$re->track( 1 );
1763							if( $target =~ /$re/ ) {
1764							print "$target matched by ", $re->matched, "\n";
1765							}
1766
1767							Note that when this functionality is enabled, no
1768							reduction is performed and no character classes are
1769							generated. In other words, C is not
1770							reduced down to C<(?:br\|t)ag> and C is not
1771							reduced to C.
1772
1773							=cut
1774
1775							sub track {
1776	0			0	1	0	my $self = shift;
1777	0	0				0	$self->{track} = defined($_[0]) ? $_[0] : 1;
1778	0					0	return $self;
1779							}
1780
1781							=item unroll_plus(0\|1)
1782
1783							Turns the unrolling of plus metacharacters on or off. When
1784							a pattern is broken up, C becomes C, C (and
1785							C becomes C, C. This may allow the freed C
1786							to assemble with other patterns. Not enabled by default.
1787
1788							=cut
1789
1790							sub unroll_plus {
1791	0			0	1	0	my $self = shift;
1792	0	0				0	$self->{unroll_plus} = defined($_[0]) ? $_[0] : 1;
1793	0					0	return $self;
1794							}
1795
1796							=item lex(SCALAR)
1797
1798							Change the pattern used to break a string apart into tokens.
1799							You can examine the C script as a starting point.
1800
1801							=cut
1802
1803							sub lex {
1804	0			0	1	0	my $self = shift;
1805	0					0	$self->{lex} = qr($_[0]);
1806	0					0	return $self;
1807							}
1808
1809							=item reduce(0\|1)
1810
1811							Turns pattern reduction on or off. A reduced pattern may
1812							be considerably shorter than an unreduced pattern. Consider
1813							C I C. An unreduced
1814							pattern will be very similar to those produced by
1815							C. Reduction is on by default. Turning
1816							it off speeds assembly (but assembly is pretty fast -- it's
1817							the breaking up of the initial patterns in the lexing stage
1818							that can consume a non-negligible amount of time).
1819
1820							=cut
1821
1822							sub reduce {
1823	0			0	1	0	my $self = shift;
1824	0	0				0	$self->{reduce} = defined($_[0]) ? $_[0] : 1;
1825	0					0	return $self;
1826							}
1827
1828							=item mutable(0\|1)
1829
1830							This method has been marked as DEPRECATED. It will be removed
1831							in a future release. See the C method for a technique
1832							to replace its functionality.
1833
1834							=cut
1835
1836							sub mutable {
1837	0			0	1	0	my $self = shift;
1838	0	0				0	$self->{mutable} = defined($_[0]) ? $_[0] : 1;
1839	0					0	return $self;
1840							}
1841
1842							=item reset
1843
1844							Empties out the patterns that have been Ced or C-ed
1845							into the object. Does not modify the state of controller attributes
1846							such as C, C, C and the like.
1847
1848							=cut
1849
1850							sub reset {
1851							# reinitialise the internal state of the object
1852	0			0	1	0	my $self = shift;
1853	0					0	$self->{path} = [];
1854	0					0	$self->{re} = undef;
1855	0					0	$self->{str} = undef;
1856	0					0	return $self;
1857							}
1858
1859							=item Default_Lexer
1860
1861							B the C function is a class method, not
1862							an object method. It is a fatal error to call it as an object
1863							method.
1864
1865							The C method lets you replace the default pattern
1866							used for all subsequently created C objects. It
1867							will not have any effect on existing objects. (It is also possible
1868							to override the lexer pattern used on a per-object basis).
1869
1870							The parameter should be an ordinary scalar, not a compiled
1871							pattern. If the pattern fails to match all parts of the string,
1872							the missing parts will be returned as single chunks. Therefore
1873							the following pattern is legal (albeit rather cork-brained):
1874
1875							Regexp::Assemble::Default_Lexer( '\\d' );
1876
1877							The above pattern will split up input strings digit by digit, and
1878							all non-digit characters as single chunks.
1879
1880							=cut
1881
1882							sub Default_Lexer {
1883	0	0		0	1	0	if( $_[0] ) {
1884	0	0				0	if( my $refname = ref($_[0]) ) {
1885	0					0	require Carp;
1886	0					0	Carp::croak("Cannot pass a $refname to Default_Lexer");
1887							}
1888	0					0	$Current_Lexer = $_[0];
1889							}
1890	0	0				0	return defined $Current_Lexer ? $Current_Lexer : $Default_Lexer;
1891							}
1892
1893							# --- no user serviceable parts below ---
1894
1895							# -- debug helpers
1896
1897							sub _debug {
1898	17			17		38	my $self = shift;
1899	17	50				178	return $self->{debug} & shift() ? 1 : 0;
1900							}
1901
1902							# -- helpers
1903
1904							sub _path {
1905							# access the path
1906	10			10		99	return $_[0]->{path};
1907							}
1908
1909							# -- the heart of the matter
1910
1911							$have_Storable = do {
1912							eval {
1913							require Storable;
1914							import Storable 'dclone';
1915							};
1916							$@ ? 0 : 1;
1917							};
1918
1919							sub _path_clone {
1920	0	0		0		0	$have_Storable ? dclone($_[0]) : _path_copy($_[0]);
1921							}
1922
1923							sub _path_copy {
1924	0			0		0	my $path = shift;
1925	0					0	my $new = [];
1926	0					0	for( my $p = 0; $p < @$path; ++$p ) {
1927	0	0				0	if( ref($path->[$p]) eq 'HASH' ) {
		0
1928	0					0	push @$new, _node_copy($path->[$p]);
1929							}
1930							elsif( ref($path->[$p]) eq 'ARRAY' ) {
1931	0					0	push @$new, _path_copy($path->[$p]);
1932							}
1933							else {
1934	0					0	push @$new, $path->[$p];
1935							}
1936							}
1937	0					0	return $new;
1938							}
1939
1940							sub _node_copy {
1941	0			0		0	my $node = shift;
1942	0					0	my $new = {};
1943	0					0	while( my( $k, $v ) = each %$node ) {
1944	0	0				0	$new->{$k} = defined($v)
1945							? _path_copy($v)
1946							: undef
1947							;
1948							}
1949	0					0	return $new;
1950							}
1951
1952							sub _insert_path {
1953	4			4		10	my $self = shift;
1954	4					10	my $list = shift;
1955	4					30	my $debug = shift;
1956	4					8	my @in = @{shift()}; # create a new copy
	4					15
1957	4	50				14	if( @$list == 0 ) { # special case the first time
1958	4	50	0			36	if( @in == 0 or (@in == 1 and (not defined $in[0] or $in[0] eq ''))) {
			33
			33
1959	0					0	return [{'' => undef}];
1960							}
1961							else {
1962	4					27	return \@in;
1963							}
1964							}
1965	0	0				0	$debug and print "# _insert_path @{[_dump(\@in)]} into @{[_dump($list)]}\n";
	0					0
	0					0
1966	0					0	my $path = $list;
1967	0					0	my $offset = 0;
1968	0					0	my $token;
1969	0	0				0	if( not @in ) {
1970	0	0				0	if( ref($list->[0]) ne 'HASH' ) {
1971	0					0	return [ { '' => undef, $list->[0] => $list } ];
1972							}
1973							else {
1974	0					0	$list->[0]{''} = undef;
1975	0					0	return $list;
1976							}
1977							}
1978	0					0	while( defined( $token = shift @in )) {
1979	0	0				0	if( ref($token) eq 'HASH' ) {
1980	0	0				0	$debug and print "# p0=", _dump($path), "\n";
1981	0					0	$path = $self->_insert_node( $path, $offset, $token, $debug, @in );
1982	0	0				0	$debug and print "# p1=", _dump($path), "\n";
1983	0					0	last;
1984							}
1985	0	0				0	if( ref($path->[$offset]) eq 'HASH' ) {
1986	0	0				0	$debug and print "# at (off=$offset len=@{[scalar @$path]}) ", _dump($path->[$offset]), "\n";
	0					0
1987	0					0	my $node = $path->[$offset];
1988	0	0				0	if( exists( $node->{$token} )) {
1989	0	0				0	if ($offset < $#$path) {
1990							my $new = {
1991							$token => [$token, @in],
1992	0					0	_re_path($self, [$node]) => [@{$path}[$offset..$#$path]],
	0					0
1993							};
1994	0					0	splice @$path, $offset, @$path-$offset, $new;
1995	0					0	last;
1996							}
1997							else {
1998	0	0				0	$debug and print "# descend key=$token @{[_dump($node->{$token})]}\n";
	0					0
1999	0					0	$path = $node->{$token};
2000	0					0	$offset = 0;
2001	0					0	redo;
2002							}
2003							}
2004							else {
2005	0	0				0	$debug and print "# add path ($token:@{[_dump(\@in)]}) into @{[_dump($path)]} at off=$offset to end=@{[scalar $#$path]}\n";
	0					0
	0					0
	0					0
2006	0	0				0	if( $offset == $#$path ) {
2007	0					0	$node->{$token} = [ $token, @in ];
2008							}
2009							else {
2010							my $new = {
2011							_node_key($token) => [ $token, @in ],
2012	0					0	_node_key($node) => [@{$path}[$offset..$#{$path}]],
	0					0
	0					0
2013							};
2014	0					0	splice( @$path, $offset, @$path - $offset, $new );
2015	0	0				0	$debug and print "# fused node=@{[_dump($new)]} path=@{[_dump($path)]}\n";
	0					0
	0					0
2016							}
2017	0					0	last;
2018							}
2019							}
2020
2021	0	0				0	if( $debug ) {
2022	0					0	my $msg = '';
2023	0					0	my $n;
2024	0					0	for( $n = 0; $n < @$path; ++$n ) {
2025	0	0				0	$msg .= ' ' if $n;
2026							my $atom = ref($path->[$n]) eq 'HASH'
2027	0	0				0	? '{'.join( ' ', keys(%{$path->[$n]})).'}'
	0					0
2028							: $path->[$n]
2029							;
2030	0	0				0	$msg .= $n == $offset ? "<$atom>" : $atom;
2031							}
2032	0					0	print "# at path ($msg)\n";
2033							}
2034
2035	0	0				0	if( $offset >= @$path ) {
		0
		0
2036	0					0	push @$path, { $token => [ $token, @in ], '' => undef };
2037	0	0				0	$debug and print "# added remaining @{[_dump($path)]}\n";
	0					0
2038	0					0	last;
2039							}
2040							elsif( $token ne $path->[$offset] ) {
2041	0	0				0	$debug and print "# token $token not present\n";
2042							splice @$path, $offset, @$path-$offset, {
2043							length $token
2044							? ( _node_key($token) => [$token, @in])
2045							: ( '' => undef )
2046							,
2047	0	0				0	$path->[$offset] => [@{$path}[$offset..$#{$path}]],
	0					0
	0					0
2048							};
2049	0	0				0	$debug and print "# path=@{[_dump($path)]}\n";
	0					0
2050	0					0	last;
2051							}
2052							elsif( not @in ) {
2053	0	0				0	$debug and print "# last token to add\n";
2054	0	0				0	if( defined( $path->[$offset+1] )) {
2055	0					0	++$offset;
2056	0	0				0	if( ref($path->[$offset]) eq 'HASH' ) {
2057	0	0				0	$debug and print "# add sentinel to node\n";
2058	0					0	$path->[$offset]{''} = undef;
2059							}
2060							else {
2061	0	0				0	$debug and print "# convert <$path->[$offset]> to node for sentinel\n";
2062							splice @$path, $offset, @$path-$offset, {
2063							'' => undef,
2064	0					0	$path->[$offset] => [ @{$path}[$offset..$#{$path}] ],
	0					0
	0					0
2065							};
2066							}
2067							}
2068							else {
2069							# already seen this pattern
2070	0					0	++$self->{stats_dup};
2071							}
2072	0					0	last;
2073							}
2074							# if we get here then @_ still contains a token
2075	0					0	++$offset;
2076							}
2077	0					0	$list;
2078							}
2079
2080							sub _insert_node {
2081	0			0		0	my $self = shift;
2082	0					0	my $path = shift;
2083	0					0	my $offset = shift;
2084	0					0	my $token = shift;
2085	0					0	my $debug = shift;
2086	0					0	my $path_end = [@{$path}[$offset..$#{$path}]];
	0					0
	0					0
2087							# NB: $path->[$offset] and $[path_end->[0] are equivalent
2088	0					0	my $token_key = _re_path($self, [$token]);
2089	0	0				0	$debug and print "# insert node(@{[_dump($token)]}:@{[_dump(\@_)]}) (key=$token_key)",
	0					0
	0					0
2090	0					0	" at path=@{[_dump($path_end)]}\n";
2091	0	0				0	if( ref($path_end->[0]) eq 'HASH' ) {
2092	0	0				0	if( exists($path_end->[0]{$token_key}) ) {
		0
2093	0	0				0	if( @$path_end > 1 ) {
2094	0					0	my $path_key = _re_path($self, [$path_end->[0]]);
2095	0					0	my $new = {
2096							$path_key => [ @$path_end ],
2097							$token_key => [ $token, @_ ],
2098							};
2099	0	0				0	$debug and print "# +bifurcate new=@{[_dump($new)]}\n";
	0					0
2100	0					0	splice( @$path, $offset, @$path_end, $new );
2101							}
2102							else {
2103	0					0	my $old_path = $path_end->[0]{$token_key};
2104	0					0	my $new_path = [];
2105	0		0			0	while( @$old_path and _node_eq( $old_path->[0], $token )) {
2106	0	0				0	$debug and print "# identical nodes in sub_path ",
		0
2107							ref($token) ? _dump($token) : $token, "\n";
2108	0					0	push @$new_path, shift(@$old_path);
2109	0					0	$token = shift @_;
2110							}
2111	0	0				0	if( @$new_path ) {
2112	0					0	my $new;
2113	0					0	my $token_key = $token;
2114	0	0				0	if( @_ ) {
2115	0					0	$new = {
2116							_re_path($self, $old_path) => $old_path,
2117							$token_key => [$token, @_],
2118							};
2119	0	0				0	$debug and print "# insert_node(bifurc) n=@{[_dump([$new])]}\n";
	0					0
2120							}
2121							else {
2122	0	0				0	$debug and print "# insert $token into old path @{[_dump($old_path)]}\n";
	0					0
2123	0	0				0	if( @$old_path ) {
2124	0					0	$new = ($self->_insert_path( $old_path, $debug, [$token] ))->[0];
2125							}
2126							else {
2127	0					0	$new = { '' => undef, $token => [$token] };
2128							}
2129							}
2130	0					0	push @$new_path, $new;
2131							}
2132	0					0	$path_end->[0]{$token_key} = $new_path;
2133	0	0				0	$debug and print "# +_insert_node result=@{[_dump($path_end)]}\n";
	0					0
2134	0					0	splice( @$path, $offset, @$path_end, @$path_end );
2135							}
2136							}
2137							elsif( not _node_eq( $path_end->[0], $token )) {
2138	0	0				0	if( @$path_end > 1 ) {
2139	0					0	my $path_key = _re_path($self, [$path_end->[0]]);
2140	0					0	my $new = {
2141							$path_key => [ @$path_end ],
2142							$token_key => [ $token, @_ ],
2143							};
2144	0	0				0	$debug and print "# path->node1 at $path_key/$token_key @{[_dump($new)]}\n";
	0					0
2145	0					0	splice( @$path, $offset, @$path_end, $new );
2146							}
2147							else {
2148	0	0				0	$debug and print "# next in path is node, trivial insert at $token_key\n";
2149	0					0	$path_end->[0]{$token_key} = [$token, @_];
2150	0					0	splice( @$path, $offset, @$path_end, @$path_end );
2151							}
2152							}
2153							else {
2154	0		0			0	while( @$path_end and _node_eq( $path_end->[0], $token )) {
2155	0	0				0	$debug and print "# identical nodes @{[_dump([$token])]}\n";
	0					0
2156	0					0	shift @$path_end;
2157	0					0	$token = shift @_;
2158	0					0	++$offset;
2159							}
2160	0	0				0	if( @$path_end ) {
2161	0	0				0	$debug and print "# insert at $offset $token:@{[_dump(\@_)]} into @{[_dump($path_end)]}\n";
	0					0
	0					0
2162	0					0	$path_end = $self->_insert_path( $path_end, $debug, [$token, @_] );
2163	0	0				0	$debug and print "# got off=$offset s=@{[scalar @_]} path_add=@{[_dump($path_end)]}\n";
	0					0
	0					0
2164	0					0	splice( @$path, $offset, @$path - $offset, @$path_end );
2165	0	0				0	$debug and print "# got final=@{[_dump($path)]}\n";
	0					0
2166							}
2167							else {
2168	0					0	$token_key = _node_key($token);
2169	0					0	my $new = {
2170							'' => undef,
2171							$token_key => [ $token, @_ ],
2172							};
2173	0	0				0	$debug and print "# convert opt @{[_dump($new)]}\n";
	0					0
2174	0					0	push @$path, $new;
2175							}
2176							}
2177							}
2178							else {
2179	0	0				0	if( @$path_end ) {
2180	0					0	my $new = {
2181							$path_end->[0] => [ @$path_end ],
2182							$token_key => [ $token, @_ ],
2183							};
2184	0	0				0	$debug and print "# atom->node @{[_dump($new)]}\n";
	0					0
2185	0					0	splice( @$path, $offset, @$path_end, $new );
2186	0	0				0	$debug and print "# out=@{[_dump($path)]}\n";
	0					0
2187							}
2188							else {
2189	0	0				0	$debug and print "# add opt @{[_dump([$token,@_])]} via $token_key\n";
	0					0
2190	0					0	push @$path, {
2191							'' => undef,
2192							$token_key => [ $token, @_ ],
2193							};
2194							}
2195							}
2196	0					0	$path;
2197							}
2198
2199							sub _reduce {
2200	3			3		19	my $self = shift;
2201	3					47	my $context = { debug => $self->_debug(DEBUG_TAIL), depth => 0 };
2202
2203	3	50				21	if ($self->_debug(DEBUG_TIME)) {
2204	0					0	$self->_init_time_func;
2205	0					0	my $now = $self->{_time_func}->();
2206	0	0				0	if (exists $self->{_begin_time}) {
2207	0					0	printf "# load=%0.6f\n", $now - $self->{_begin_time};
2208							}
2209							else {
2210	0					0	printf "# load-epoch=%0.6f\n", $now;
2211							}
2212	0					0	$self->{_begin_time} = $self->{_time_func}->();
2213							}
2214
2215	3					18	my ($head, $tail) = _reduce_path( $self->_path, $context );
2216	3	50				14	$context->{debug} and print "# final head=", _dump($head), ' tail=', _dump($tail), "\n";
2217	3	50				15	if( !@$head ) {
2218	3					14	$self->{path} = $tail;
2219							}
2220							else {
2221							$self->{path} = [
2222	0					0	@{_unrev_path( $tail, $context )},
2223	0					0	@{_unrev_path( $head, $context )},
	0					0
2224							];
2225							}
2226
2227	3	50				20	if ($self->_debug(DEBUG_TIME)) {
2228	0					0	my $now = $self->{_time_func}->();
2229	0	0				0	if (exists $self->{_begin_time}) {
2230	0					0	printf "# reduce=%0.6f\n", $now - $self->{_begin_time};
2231							}
2232							else {
2233	0					0	printf "# reduce-epoch=%0.6f\n", $now;
2234							}
2235	0					0	$self->{_begin_time} = $self->{_time_func}->();
2236							}
2237
2238	3	50				10	$context->{debug} and print "# final path=", _dump($self->{path}), "\n";
2239	3					9	return $self;
2240							}
2241
2242							sub _remove_optional {
2243	0	0		0		0	if( exists $_[0]->{''} ) {
2244	0					0	delete $_[0]->{''};
2245	0					0	return 1;
2246							}
2247	0					0	return 0;
2248							}
2249
2250							sub _reduce_path {
2251	3			3		18	my ($path, $ctx) = @_;
2252	3					24	my $indent = ' ' x $ctx->{depth};
2253	3					8	my $debug = $ctx->{debug};
2254	3	50				14	$debug and print "#$indent _reduce_path $ctx->{depth} ", _dump($path), "\n";
2255	3					9	my $new;
2256	3					12	my $head = [];
2257	3					12	my $tail = [];
2258	3					25	while( defined( my $p = pop @$path )) {
2259	22	50				44	if( ref($p) eq 'HASH' ) {
2260	0					0	my ($node_head, $node_tail) = _reduce_node($p, _descend($ctx) );
2261	0	0				0	$debug and print "#$indent\| head=", _dump($node_head), " tail=", _dump($node_tail), "\n";
2262	0	0				0	push @$head, @$node_head if scalar @$node_head;
2263	0	0				0	push @$tail, ref($node_tail) eq 'HASH' ? $node_tail : @$node_tail;
2264							}
2265							else {
2266	22	50				41	if( @$head ) {
2267	0	0				0	$debug and print "#$indent\| push $p leaves @{[_dump($path)]}\n";
	0					0
2268	0					0	push @$tail, $p;
2269							}
2270							else {
2271	22	50				47	$debug and print "#$indent\| unshift $p\n";
2272	22					97	unshift @$tail, $p;
2273							}
2274							}
2275							}
2276	0					0	$debug and print "#$indent\| tail nr=@{[scalar @$tail]} t0=", ref($tail->[0]),
2277	3	0				13	(ref($tail->[0]) eq 'HASH' ? " n=" . scalar(keys %{$tail->[0]}) : '' ),
	0	50				0
2278							"\n";
2279	3	50	33			47	if( @$tail > 1
			33
2280							and ref($tail->[0]) eq 'HASH'
2281	0					0	and keys %{$tail->[0]} == 2
2282							) {
2283	0					0	my $opt;
2284							my $fixed;
2285	0					0	while( my ($key, $path) = each %{$tail->[0]} ) {
	0					0
2286	0	0				0	$debug and print "#$indent\| scan k=$key p=@{[_dump($path)]}\n";
	0					0
2287	0	0				0	next unless $path;
2288	0	0	0			0	if (@$path == 1 and ref($path->[0]) eq 'HASH') {
2289	0					0	$opt = $path->[0];
2290							}
2291							else {
2292	0					0	$fixed = $path;
2293							}
2294							}
2295	0	0				0	if( exists $tail->[0]{''} ) {
2296	0					0	my $path = [@{$tail}[1..$#{$tail}]];
	0					0
	0					0
2297	0					0	$tail = $tail->[0];
2298	0					0	($head, $tail, $path) = _slide_tail( $head, $tail, $path, _descend($ctx) );
2299	0					0	$tail = [$tail, @$path];
2300							}
2301							}
2302	3	50				10	$debug and print "#$indent _reduce_path $ctx->{depth} out head=", _dump($head), ' tail=', _dump($tail), "\n";
2303	3					21	return ($head, $tail);
2304							}
2305
2306							sub _reduce_node {
2307	0			0		0	my ($node, $ctx) = @_;
2308	0					0	my $indent = ' ' x $ctx->{depth};
2309	0					0	my $debug = $ctx->{debug};
2310	0					0	my $optional = _remove_optional($node);
2311	0	0				0	$debug and print "#$indent _reduce_node $ctx->{depth} in @{[_dump($node)]} opt=$optional\n";
	0					0
2312	0	0	0			0	if( $optional and scalar keys %$node == 1 ) {
2313	0					0	my $path = (values %$node)[0];
2314	0	0				0	if( not grep { ref($_) eq 'HASH' } @$path ) {
	0					0
2315							# if we have removed an optional, and there is only one path
2316							# left then there is nothing left to compare. Because of the
2317							# optional it cannot participate in any further reductions.
2318							# (unless we test for equality among sub-trees).
2319	0					0	my $result = {
2320							'' => undef,
2321							$path->[0] => $path
2322							};
2323	0	0				0	$debug and print "#$indent\| fast fail @{[_dump($result)]}\n";
	0					0
2324	0					0	return [], $result;
2325							}
2326							}
2327
2328	0					0	my( $fail, $reduce ) = _scan_node( $node, _descend($ctx) );
2329
2330	0	0				0	$debug and print "#$indent\|_scan_node done opt=$optional reduce=@{[_dump($reduce)]} fail=@{[_dump($fail)]}\n";
	0					0
	0					0
2331
2332							# We now perform tail reduction on each of the nodes in the reduce
2333							# hash. If we have only one key, we know we will have a successful
2334							# reduction (since everything that was inserted into the node based
2335							# on the value of the last token of each path all mapped to the same
2336							# value).
2337
2338	0	0	0			0	if( @$fail == 0 and keys %$reduce == 1 and not $optional) {
			0
2339							# every path shares a common path
2340	0					0	my $path = (values %$reduce)[0];
2341	0					0	my ($common, $tail) = _do_reduce( $path, _descend($ctx) );
2342	0	0				0	$debug and print "#$indent\|_reduce_node $ctx->{depth} common=@{[_dump($common)]} tail=", _dump($tail), "\n";
	0					0
2343	0					0	return( $common, $tail );
2344							}
2345
2346							# this node resulted in a list of paths, game over
2347	0					0	$ctx->{indent} = $indent;
2348	0					0	return _reduce_fail( $reduce, $fail, $optional, _descend($ctx) );
2349							}
2350
2351							sub _reduce_fail {
2352	0			0		0	my( $reduce, $fail, $optional, $ctx ) = @_;
2353	0					0	my( $debug, $depth, $indent ) = @{$ctx}{qw(debug depth indent)};
	0					0
2354	0					0	my %result;
2355	0	0				0	$result{''} = undef if $optional;
2356	0					0	my $p;
2357	0					0	for $p (keys %$reduce) {
2358	0					0	my $path = $reduce->{$p};
2359	0	0				0	if( scalar @$path == 1 ) {
2360	0					0	$path = $path->[0];
2361	0	0				0	$debug and print "#$indent\| -simple opt=$optional unrev @{[_dump($path)]}\n";
	0					0
2362	0					0	$path = _unrev_path($path, _descend($ctx) );
2363	0					0	$result{_node_key($path->[0])} = $path;
2364							}
2365							else {
2366	0	0				0	$debug and print "#$indent\| _do_reduce(@{[_dump($path)]})\n";
	0					0
2367	0					0	my ($common, $tail) = _do_reduce( $path, _descend($ctx) );
2368							$path = [
2369							(
2370							ref($tail) eq 'HASH'
2371							? _unrev_node($tail, _descend($ctx) )
2372							: _unrev_path($tail, _descend($ctx) )
2373							),
2374	0	0				0	@{_unrev_path($common, _descend($ctx) )}
	0					0
2375							];
2376	0	0				0	$debug and print "#$indent\| +reduced @{[_dump($path)]}\n";
	0					0
2377	0					0	$result{_node_key($path->[0])} = $path;
2378							}
2379							}
2380	0					0	my $f;
2381	0					0	for $f( @$fail ) {
2382	0	0				0	$debug and print "#$indent\| +fail @{[_dump($f)]}\n";
	0					0
2383	0					0	$result{$f->[0]} = $f;
2384							}
2385	0	0				0	$debug and print "#$indent _reduce_fail $depth fail=@{[_dump(\%result)]}\n";
	0					0
2386	0					0	return ( [], \%result );
2387							}
2388
2389							sub _scan_node {
2390	0			0		0	my( $node, $ctx ) = @_;
2391	0					0	my $indent = ' ' x $ctx->{depth};
2392	0					0	my $debug = $ctx->{debug};
2393
2394							# For all the paths in the node, reverse them. If the first token
2395							# of the path is a scalar, push it onto an array in a hash keyed by
2396							# the value of the scalar.
2397							#
2398							# If it is a node, call _reduce_node on this node beforehand. If we
2399							# get back a common head, all of the paths in the subnode shared a
2400							# common tail. We then store the common part and the remaining node
2401							# of paths (which is where the paths diverged from the end and install
2402							# this into the same hash. At this point both the common and the tail
2403							# are in reverse order, just as simple scalar paths are.
2404							#
2405							# On the other hand, if there were no common path returned then all
2406							# the paths of the sub-node diverge at the end character. In this
2407							# case the tail cannot participate in any further reductions and will
2408							# appear in forward order.
2409							#
2410							# certainly the hurgliest function in the whole file :(
2411
2412							# $debug = 1 if $depth >= 8;
2413	0					0	my @fail;
2414							my %reduce;
2415
2416	0					0	my $n;
2417	0					0	for $n(
2418	0					0	map { substr($_, index($_, '#')+1) }
2419							sort
2420							map {
2421							join( '\|' =>
2422	0					0	scalar(grep {ref($_) eq 'HASH'} @{$node->{$_}}),
	0					0
2423							_node_offset($node->{$_}),
2424	0					0	scalar @{$node->{$_}},
	0					0
2425							)
2426							. "#$_"
2427							}
2428							keys %$node ) {
2429	0					0	my( $end, @path ) = reverse @{$node->{$n}};
	0					0
2430	0	0				0	if( ref($end) ne 'HASH' ) {
2431	0	0				0	$debug and print "# $indent\|_scan_node push reduce ($end:@{[_dump(\@path)]})\n";
	0					0
2432	0					0	push @{$reduce{$end}}, [ $end, @path ];
	0					0
2433							}
2434							else {
2435	0	0				0	$debug and print "# $indent\|_scan_node head=", _dump(\@path), ' tail=', _dump($end), "\n";
2436	0					0	my $new_path;
2437							# deal with sing, singing => s(?:ing)?ing
2438	0	0	0			0	if( keys %$end == 2 and exists $end->{''} ) {
2439	0					0	my ($key, $opt_path) = each %$end;
2440	0	0				0	($key, $opt_path) = each %$end if $key eq '';
2441	0					0	$opt_path = [reverse @{$opt_path}];
	0					0
2442	0	0				0	$debug and print "# $indent\| check=", _dump($opt_path), "\n";
2443	0					0	my $end = { '' => undef, $opt_path->[0] => [@$opt_path] };
2444	0					0	my $head = [];
2445	0					0	my $path = [@path];
2446	0					0	($head, my $slide, $path) = _slide_tail( $head, $end, $path, $ctx );
2447	0	0				0	if( @$head ) {
2448	0					0	$new_path = [ @$head, $slide, @$path ];
2449							}
2450							}
2451	0	0				0	if( $new_path ) {
2452	0	0				0	$debug and print "# $indent\|_scan_node slid=", _dump($new_path), "\n";
2453	0					0	push @{$reduce{$new_path->[0]}}, $new_path;
	0					0
2454							}
2455							else {
2456	0					0	my( $common, $tail ) = _reduce_node( $end, _descend($ctx) );
2457	0	0				0	if( not @$common ) {
2458	0	0				0	$debug and print "# $indent\| +failed $n\n";
2459	0					0	push @fail, [reverse(@path), $tail];
2460							}
2461							else {
2462	0					0	my $path = [@path];
2463	0	0				0	$debug and print "# $indent\|_scan_node ++recovered common=@{[_dump($common)]} tail=",
	0					0
2464	0					0	_dump($tail), " path=@{[_dump($path)]}\n";
2465	0	0	0			0	if( ref($tail) eq 'HASH'
2466							and keys %$tail == 2
2467							) {
2468	0	0				0	if( exists $tail->{''} ) {
2469	0					0	($common, $tail, $path) = _slide_tail( $common, $tail, $path, $ctx );
2470							}
2471							}
2472	0	0				0	push @{$reduce{$common->[0]}}, [
	0					0
2473							@$common,
2474							(ref($tail) eq 'HASH' ? $tail : @$tail ),
2475							@$path
2476							];
2477							}
2478							}
2479							}
2480							}
2481	0	0				0	$debug and print
2482	0					0	"# $indent\|_scan_node counts: reduce=@{[scalar keys %reduce]} fail=@{[scalar @fail]}\n";
	0					0
2483	0					0	return( \@fail, \%reduce );
2484							}
2485
2486							sub _do_reduce {
2487	0			0		0	my ($path, $ctx) = @_;
2488	0					0	my $indent = ' ' x $ctx->{depth};
2489	0					0	my $debug = $ctx->{debug};
2490	0					0	my $ra = CPANPLUS::YACSmoke::ReAssemble->new(chomp=>0);
2491	0					0	$ra->debug($debug);
2492	0	0				0	$debug and print "# $indent\| do @{[_dump($path)]}\n";
	0					0
2493	0					0	$ra->_insertr( $_ ) for
2494							# When nodes come into the picture, we have to be careful
2495							# about how we insert the paths into the assembly.
2496							# Paths with nodes first, then closest node to front
2497							# then shortest path. Merely because if we can control
2498							# order in which paths containing nodes get inserted,
2499							# then we can make a couple of assumptions that simplify
2500							# the code in _insert_node.
2501							sort {
2502	0					0	scalar(grep {ref($_) eq 'HASH'} @$a)
2503	0	0	0			0	<=> scalar(grep {ref($_) eq 'HASH'} @$b)
	0					0
2504							\|\|
2505							_node_offset($b) <=> _node_offset($a)
2506							\|\|
2507							scalar @$a <=> scalar @$b
2508							}
2509							@$path
2510							;
2511	0					0	$path = $ra->_path;
2512	0					0	my $common = [];
2513	0					0	push @$common, shift @$path while( ref($path->[0]) ne 'HASH' );
2514	0	0				0	my $tail = scalar( @$path ) > 1 ? [@$path] : $path->[0];
2515	0	0				0	$debug and print "# $indent\| _do_reduce common=@{[_dump($common)]} tail=@{[_dump($tail)]}\n";
	0					0
	0					0
2516	0					0	return ($common, $tail);
2517							}
2518
2519							sub _node_offset {
2520							# return the offset that the first node is found, or -ve
2521							# optimised for speed
2522	0			0		0	my $nr = @{$_[0]};
	0					0
2523	0					0	my $atom = -1;
2524	0		0			0	ref($_[0]->[$atom]) eq 'HASH' and return $atom while ++$atom < $nr;
2525	0					0	return -1;
2526							}
2527
2528							sub _slide_tail {
2529	0			0		0	my $head = shift;
2530	0					0	my $tail = shift;
2531	0					0	my $path = shift;
2532	0					0	my $ctx = shift;
2533	0					0	my $indent = ' ' x $ctx->{depth};
2534	0					0	my $debug = $ctx->{debug};
2535	0	0				0	$debug and print "# $indent\| slide in h=", _dump($head),
2536							' t=', _dump($tail), ' p=', _dump($path), "\n";
2537	0					0	my $slide_path = (each %$tail)[-1];
2538	0	0				0	$slide_path = (each %$tail)[-1] unless defined $slide_path;
2539	0	0				0	$debug and print "# $indent\| slide potential ", _dump($slide_path), " over ", _dump($path), "\n";
2540	0		0			0	while( defined $path->[0] and $path->[0] eq $slide_path->[0] ) {
2541	0	0				0	$debug and print "# $indent\| slide=tail=$slide_path->[0]\n";
2542	0					0	my $slide = shift @$path;
2543	0					0	shift @$slide_path;
2544	0					0	push @$slide_path, $slide;
2545	0					0	push @$head, $slide;
2546							}
2547	0	0				0	$debug and print "# $indent\| slide path ", _dump($slide_path), "\n";
2548	0					0	my $slide_node = {
2549							'' => undef,
2550							_node_key($slide_path->[0]) => $slide_path,
2551							};
2552	0	0				0	$debug and print "# $indent\| slide out h=", _dump($head),
2553							' s=', _dump($slide_node), ' p=', _dump($path), "\n";
2554	0					0	return ($head, $slide_node, $path);
2555							}
2556
2557							sub _unrev_path {
2558	0			0		0	my ($path, $ctx) = @_;
2559	0					0	my $indent = ' ' x $ctx->{depth};
2560	0					0	my $debug = $ctx->{debug};
2561	0					0	my $new;
2562	0	0				0	if( not grep { ref($_) } @$path ) {
	0					0
2563	0	0				0	$debug and print "# ${indent}_unrev path fast ", _dump($path);
2564	0					0	$new = [reverse @$path];
2565	0	0				0	$debug and print "# -> ", _dump($new), "\n";
2566	0					0	return $new;
2567							}
2568	0	0				0	$debug and print "# ${indent}unrev path in ", _dump($path), "\n";
2569	0					0	while( defined( my $p = pop @$path )) {
2570	0	0				0	push @$new,
		0
2571							ref($p) eq 'HASH' ? _unrev_node($p, _descend($ctx) )
2572							: ref($p) eq 'ARRAY' ? _unrev_path($p, _descend($ctx) )
2573							: $p
2574							;
2575							}
2576	0	0				0	$debug and print "# ${indent}unrev path out ", _dump($new), "\n";
2577	0					0	return $new;
2578							}
2579
2580							sub _unrev_node {
2581	0			0		0	my ($node, $ctx ) = @_;
2582	0					0	my $indent = ' ' x $ctx->{depth};
2583	0					0	my $debug = $ctx->{debug};
2584	0					0	my $optional = _remove_optional($node);
2585	0	0				0	$debug and print "# ${indent}unrev node in ", _dump($node), " opt=$optional\n";
2586	0					0	my $new;
2587	0	0				0	$new->{''} = undef if $optional;
2588	0					0	my $n;
2589	0					0	for $n( keys %$node ) {
2590	0					0	my $path = _unrev_path($node->{$n}, _descend($ctx) );
2591	0					0	$new->{_node_key($path->[0])} = $path;
2592							}
2593	0	0				0	$debug and print "# ${indent}unrev node out ", _dump($new), "\n";
2594	0					0	return $new;
2595							}
2596
2597							sub _node_key {
2598	0			0		0	my $node = shift;
2599	0	0				0	return _node_key($node->[0]) if ref($node) eq 'ARRAY';
2600	0	0				0	return $node unless ref($node) eq 'HASH';
2601	0					0	my $key = '';
2602	0					0	my $k;
2603	0					0	for $k( keys %$node ) {
2604	0	0				0	next if $k eq '';
2605	0	0	0			0	$key = $k if $key eq '' or $key gt $k;
2606							}
2607	0					0	return $key;
2608							}
2609
2610							sub _descend {
2611							# Take a context object, and increase the depth by one.
2612							# By creating a fresh hash each time, we don't have to
2613							# bother adding make-work code to decrease the depth
2614							# when we return from what we called.
2615	0			0		0	my $ctx = shift;
2616	0					0	return {%$ctx, depth => $ctx->{depth}+1};
2617							}
2618
2619							#####################################################################
2620
2621							sub _make_class {
2622	0			0		0	my $self = shift;
2623	0					0	my %set = map { ($_,1) } @_;
	0					0
2624	0	0				0	delete $set{'\\d'} if exists $set{'\\w'};
2625	0	0				0	delete $set{'\\D'} if exists $set{'\\W'};
2626							return '.' if exists $set{'.'}
2627							or ($self->{fold_meta_pairs} and (
2628							(exists $set{'\\d'} and exists $set{'\\D'})
2629							or (exists $set{'\\s'} and exists $set{'\\S'})
2630	0	0	0			0	or (exists $set{'\\w'} and exists $set{'\\W'})
			0
			0
2631							))
2632							;
2633	0					0	for my $meta( q/\\d/, q/\\D/, q/\\s/, q/\\S/, q/\\w/, q/\\W/ ) {
2634	0	0				0	if( exists $set{$meta} ) {
2635	0					0	my $re = qr/$meta/;
2636	0					0	my @delete;
2637	0		0			0	$_ =~ /^$re$/ and push @delete, $_ for keys %set;
2638	0	0				0	delete @set{@delete} if @delete;
2639							}
2640							}
2641	0	0				0	return (keys %set)[0] if keys %set == 1;
2642	0					0	for my $meta( '.', '+', '*', '?', '(', ')', '^', '@', '$', '[', '/', ) {
2643	0	0				0	exists $set{"\\$meta"} and $set{$meta} = delete $set{"\\$meta"};
2644							}
2645	0	0				0	my $dash = exists $set{'-'} ? do { delete($set{'-'}), '-' } : '';
	0					0
2646	0	0				0	my $caret = exists $set{'^'} ? do { delete($set{'^'}), '^' } : '';
	0					0
2647	0					0	my $class = join( '' => sort keys %set );
2648	0	0	0			0	$class =~ s/0123456789/\\d/ and $class eq '\\d' and return $class;
2649	0					0	return "[$dash$class$caret]";
2650							}
2651
2652							sub _re_sort {
2653	0		0	0		0	return length $b <=> length $a \|\| $a cmp $b
2654							}
2655
2656							sub _combine {
2657	0			0		0	my $self = shift;
2658	0					0	my $type = shift;
2659							# print "c in = @{[_dump(\@_)]}\n";
2660							# my $combine =
2661							return '('
2662							. $type
2663	0					0	. do {
2664	0					0	my( @short, @long );
2665	0	0				0	push @{ /^$Single_Char$/ ? \@short : \@long}, $_ for @_;
	0					0
2666	0	0				0	if( @short == 1 ) {
		0
2667	0					0	@long = sort _re_sort @long, @short;
2668							}
2669							elsif( @short > 1 ) {
2670							# yucky but true
2671	0					0	my @combine = (_make_class($self, @short), sort _re_sort @long);
2672	0					0	@long = @combine;
2673							}
2674							else {
2675	0					0	@long = sort _re_sort @long;
2676							}
2677	0					0	join( '\|', @long );
2678							}
2679							. ')';
2680							# print "combine <$combine>\n";
2681							# $combine;
2682							}
2683
2684							sub _combine_new {
2685	0			0		0	my $self = shift;
2686	0					0	my( @short, @long );
2687	0	0				0	push @{ /^$Single_Char$/ ? \@short : \@long}, $_ for @_;
	0					0
2688	0	0	0			0	if( @short == 1 and @long == 0 ) {
		0	0
2689	0					0	return $short[0];
2690							}
2691							elsif( @short > 1 and @short == @_ ) {
2692	0					0	return _make_class($self, @short);
2693							}
2694							else {
2695	0	0				0	return '(?:'
2696							. join( '\|' =>
2697							@short > 1
2698							? ( _make_class($self, @short), sort _re_sort @long)
2699							: ( (sort _re_sort( @long )), @short )
2700							)
2701							. ')';
2702							}
2703							}
2704
2705							sub _re_path {
2706	3			3		11	my $self = shift;
2707							# in shorter assemblies, _re_path() is the second hottest
2708							# routine. after insert(), so make it fast.
2709
2710	3	50				12	if ($self->{unroll_plus}) {
2711							# but we can't easily make this blockless
2712	0					0	my @arr = @{$_[0]};
	0					0
2713	0					0	my $str = '';
2714	0					0	my $skip = 0;
2715	0					0	for my $i (0..$#arr) {
2716	0	0	0			0	if (ref($arr[$i]) eq 'ARRAY') {
		0
		0
		0
2717	0					0	$str .= _re_path($self, $arr[$i]);
2718							}
2719							elsif (ref($arr[$i]) eq 'HASH') {
2720							$str .= exists $arr[$i]->{''}
2721							? _combine_new( $self,
2722	0					0	map { _re_path( $self, $arr[$i]->{$_} ) } grep { $_ ne '' } keys %{$arr[$i]}
	0					0
	0					0
2723							) . '?'
2724	0	0				0	: _combine_new($self, map { _re_path( $self, $arr[$i]->{$_} ) } keys %{$arr[$i]})
	0					0
	0					0
2725							;
2726							}
2727							elsif ($i < $#arr and $arr[$i+1] =~ /\A$arr[$i]\*(\??)\Z/) {
2728	0	0				0	$str .= "$arr[$i]+" . (defined $1 ? $1 : '');
2729	0					0	++$skip;
2730							}
2731							elsif ($skip) {
2732	0					0	$skip = 0;
2733							}
2734							else {
2735	0					0	$str .= $arr[$i];
2736							}
2737							}
2738	0					0	return $str;
2739							}
2740
2741	3	50				8	return join( '', @_ ) unless grep { length ref $_ } @_;
	3					21
2742	3					6	my $p;
2743							return join '', map {
2744							ref($_) eq '' ? $_
2745	22	0				75	: ref($_) eq 'HASH' ? do {
		50
2746							# In the case of a node, see whether there's a '' which
2747							# indicates that the whole thing is optional and thus
2748							# requires a trailing ?
2749							# Unroll the two different paths to avoid the needless
2750							# grep when it isn't necessary.
2751	0					0	$p = $_;
2752							exists $_->{''}
2753							? _combine_new( $self,
2754	0					0	map { _re_path( $self, $p->{$_} ) } grep { $_ ne '' } keys %$_
	0					0
2755							) . '?'
2756	0	0				0	: _combine_new($self, map { _re_path( $self, $p->{$_} ) } keys %$_ )
	0					0
2757							}
2758							: _re_path($self, $_) # ref($_) eq 'ARRAY'
2759	3					6	} @{$_[0]}
	3					7
2760							}
2761
2762							sub _lookahead {
2763	0			0			my $in = shift;
2764	0						my %head;
2765							my $path;
2766	0						for $path( keys %$in ) {
2767	0	0					next unless defined $in->{$path};
2768							# print "look $path: ", ref($in->{$path}[0]), ".\n";
2769	0	0					if( ref($in->{$path}[0]) eq 'HASH' ) {
		0
2770	0						my $next = 0;
2771	0		0				while( ref($in->{$path}[$next]) eq 'HASH' and @{$in->{$path}} > $next + 1 ) {
	0
2772	0	0					if( exists $in->{$path}[$next]{''} ) {
2773	0						++$head{$in->{$path}[$next+1]};
2774							}
2775	0						++$next;
2776							}
2777	0						my $inner = _lookahead( $in->{$path}[0] );
2778	0						@head{ keys %$inner } = (values %$inner);
2779							}
2780							elsif( ref($in->{$path}[0]) eq 'ARRAY' ) {
2781	0						my $subpath = $in->{$path}[0];
2782	0						for( my $sp = 0; $sp < @$subpath; ++$sp ) {
2783	0	0					if( ref($subpath->[$sp]) eq 'HASH' ) {
2784	0						my $follow = _lookahead( $subpath->[$sp] );
2785	0						@head{ keys %$follow } = (values %$follow);
2786	0	0					last unless exists $subpath->[$sp]{''};
2787							}
2788							else {
2789	0						++$head{$subpath->[$sp]};
2790	0						last;
2791							}
2792							}
2793							}
2794							else {
2795	0						++$head{ $in->{$path}[0] };
2796							}
2797							}
2798							# print "_lookahead ", _dump($in), '==>', _dump([keys %head]), "\n";
2799	0						return \%head;
2800							}
2801
2802							sub _re_path_lookahead {
2803	0			0			my $self = shift;
2804	0						my $in = shift;
2805							# print "_re_path_la in ", _dump($in), "\n";
2806	0						my $out = '';
2807	0						for( my $p = 0; $p < @$in; ++$p ) {
2808	0	0					if( ref($in->[$p]) eq '' ) {
		0
2809	0						$out .= $in->[$p];
2810	0						next;
2811							}
2812							elsif( ref($in->[$p]) eq 'ARRAY' ) {
2813	0						$out .= _re_path_lookahead($self, $in->[$p]);
2814	0						next;
2815							}
2816							# print "$p ", _dump($in->[$p]), "\n";
2817							my $path = [
2818	0						map { _re_path_lookahead($self, $in->[$p]{$_} ) }
2819	0						grep { $_ ne '' }
2820	0						keys %{$in->[$p]}
	0
2821							];
2822	0						my $ahead = _lookahead($in->[$p]);
2823	0						my $more = 0;
2824	0	0	0				if( exists $in->[$p]{''} and $p + 1 < @$in ) {
2825	0						my $next = 1;
2826	0						while( $p + $next < @$in ) {
2827	0	0					if( ref( $in->[$p+$next] ) eq 'HASH' ) {
2828	0						my $follow = _lookahead( $in->[$p+$next] );
2829	0						@{$ahead}{ keys %$follow } = (values %$follow);
	0
2830							}
2831							else {
2832	0						++$ahead->{$in->[$p+$next]};
2833	0						last;
2834							}
2835	0						++$next;
2836							}
2837	0						$more = 1;
2838							}
2839	0						my $nr_one = grep { /^$Single_Char$/ } @$path;
	0
2840	0						my $nr = @$path;
2841	0	0	0				if( $nr_one > 1 and $nr_one == $nr ) {
2842	0						$out .= _make_class($self, @$path);
2843	0	0					$out .= '?' if exists $in->[$p]{''};
2844							}
2845							else {
2846							my $zwla = keys(%$ahead) > 1
2847	0	0					? _combine($self, '?=', grep { s/\+$//; $_ } keys %$ahead )
	0
	0
2848							: '';
2849	0	0					my $patt = $nr > 1 ? _combine($self, '?:', @$path ) : $path->[0];
2850							# print "have nr=$nr n1=$nr_one n=", _dump($in->[$p]), ' a=', _dump([keys %$ahead]), " zwla=$zwla patt=$patt @{[_dump($path)]}\n";
2851	0	0					if( exists $in->[$p]{''} ) {
2852	0	0					$out .= $more ? "$zwla(?:$patt)?" : "(?:$zwla$patt)?";
2853							}
2854							else {
2855	0						$out .= "$zwla$patt";
2856							}
2857							}
2858							}
2859	0						return $out;
2860							}
2861
2862							sub _re_path_track {
2863	0			0			my $self = shift;
2864	0						my $in = shift;
2865	0						my $normal = shift;
2866	0						my $augmented = shift;
2867	0						my $o;
2868	0						my $simple = '';
2869	0						my $augment = '';
2870	0						for( my $n = 0; $n < @$in; ++$n ) {
2871	0	0					if( ref($in->[$n]) eq '' ) {
2872	0						$o = $in->[$n];
2873	0						$simple .= $o;
2874	0						$augment .= $o;
2875	0	0	0				if( (
			0
			0
2876							$n < @$in - 1
2877							and ref($in->[$n+1]) eq 'HASH' and exists $in->[$n+1]{''}
2878							)
2879							or $n == @$in - 1
2880							) {
2881	0						push @{$self->{mlist}}, $normal . $simple ;
	0
2882	0	0					$augment .= $] < 5.009005
2883							? "(?{\$self->{m}=$self->{mcount}})"
2884							: "(?{$self->{mcount}})"
2885							;
2886	0						++$self->{mcount};
2887							}
2888							}
2889							else {
2890							my $path = [
2891	0						map { $self->_re_path_track( $in->[$n]{$_}, $normal.$simple , $augmented.$augment ) }
2892	0						grep { $_ ne '' }
2893	0						keys %{$in->[$n]}
	0
2894							];
2895	0						$o = '(?:' . join( '\|' => sort _re_sort @$path ) . ')';
2896	0	0					$o .= '?' if exists $in->[$n]{''};
2897	0						$simple .= $o;
2898	0						$augment .= $o;
2899							}
2900							}
2901	0						return $augment;
2902							}
2903
2904							sub _re_path_pretty {
2905	0			0			my $self = shift;
2906	0						my $in = shift;
2907	0						my $arg = shift;
2908	0						my $pre = ' ' x (($arg->{depth}+0) * $arg->{indent});
2909	0						my $indent = ' ' x (($arg->{depth}+1) * $arg->{indent});
2910	0						my $out = '';
2911	0						$arg->{depth}++;
2912	0						my $prev_was_paren = 0;
2913	0						for( my $p = 0; $p < @$in; ++$p ) {
2914	0	0					if( ref($in->[$p]) eq '' ) {
		0
2915	0	0					$out .= "\n$pre" if $prev_was_paren;
2916	0						$out .= $in->[$p];
2917	0						$prev_was_paren = 0;
2918							}
2919							elsif( ref($in->[$p]) eq 'ARRAY' ) {
2920	0						$out .= _re_path($self, $in->[$p]);
2921							}
2922							else {
2923							my $path = [
2924	0						map { _re_path_pretty($self, $in->[$p]{$_}, $arg ) }
2925	0						grep { $_ ne '' }
2926	0						keys %{$in->[$p]}
	0
2927							];
2928	0						my $nr = @$path;
2929	0						my( @short, @long );
2930	0	0					push @{/^$Single_Char$/ ? \@short : \@long}, $_ for @$path;
	0
2931	0	0					if( @short == $nr ) {
2932	0	0					$out .= $nr == 1 ? $path->[0] : _make_class($self, @short);
2933	0	0					$out .= '?' if exists $in->[$p]{''};
2934							}
2935							else {
2936	0	0					$out .= "\n" if length $out;
2937	0	0					$out .= $pre if $p;
2938	0						$out .= "(?:\n$indent";
2939	0	0					if( @short < 2 ) {
2940	0						my $r = 0;
2941							$out .= join( "\n$indent\|" => map {
2942	0	0					$r++ and $_ =~ s/^\(\?:/\n$indent(?:/;
	0
2943	0						$_
2944							}
2945							sort _re_sort @$path
2946							);
2947							}
2948							else {
2949	0						$out .= join( "\n$indent\|" => ( (sort _re_sort @long), _make_class($self, @short) ));
2950							}
2951	0						$out .= "\n$pre)";
2952	0	0					if( exists $in->[$p]{''} ) {
2953	0						$out .= "\n$pre?";
2954	0						$prev_was_paren = 0;
2955							}
2956							else {
2957	0						$prev_was_paren = 1;
2958							}
2959							}
2960							}
2961							}
2962	0						$arg->{depth}--;
2963	0						return $out;
2964							}
2965
2966							sub _node_eq {
2967	0	0	0	0			return 0 if not defined $_[0] or not defined $_[1];
2968	0	0					return 0 if ref $_[0] ne ref $_[1];
2969							# Now that we have determined that the reference of each
2970							# argument are the same, we only have to test the first
2971							# one, which gives us a nice micro-optimisation.
2972	0	0					if( ref($_[0]) eq 'HASH' ) {
		0
2973	0						keys %{$_[0]} == keys %{$_[1]}
	0
2974							and
2975							# does this short-circuit to avoid _re_path() cost more than it saves?
2976	0	0	0				join( '\|' => sort keys %{$_[0]}) eq join( '\|' => sort keys %{$_[1]})
	0
	0
2977							and
2978							_re_path(undef, [$_[0]] ) eq _re_path(undef, [$_[1]] );
2979							}
2980							elsif( ref($_[0]) eq 'ARRAY' ) {
2981	0	0					scalar @{$_[0]} == scalar @{$_[1]}
	0
	0
2982							and
2983							_re_path(undef, $_[0]) eq _re_path(undef, $_[1]);
2984							}
2985							else {
2986	0						$_[0] eq $_[1];
2987							}
2988							}
2989
2990							sub _pretty_dump {
2991	0			0			return sprintf "\\x%02x", ord(shift);
2992							}
2993
2994							sub _dump {
2995	0			0			my $path = shift;
2996	0	0					return _dump_node($path) if ref($path) eq 'HASH';
2997	0						my $dump = '[';
2998	0						my $d;
2999	0						my $nr = 0;
3000	0						for $d( @$path ) {
3001	0	0					$dump .= ' ' if $nr++;
3002	0	0					if( ref($d) eq 'HASH' ) {
		0
		0
3003	0						$dump .= _dump_node($d);
3004							}
3005							elsif( ref($d) eq 'ARRAY' ) {
3006	0						$dump .= _dump($d);
3007							}
3008							elsif( defined $d ) {
3009							# D::C indicates the second test is redundant
3010							# $dump .= ( $d =~ /\s/ or not length $d )
3011	0	0					$dump .= (
		0
3012							$d =~ /\s/ ? qq{'$d'} :
3013							$d =~ /^[\x00-\x1f]$/ ? _pretty_dump($d) :
3014							$d
3015							);
3016							}
3017							else {
3018	0						$dump .= '*';
3019							}
3020							}
3021	0						return $dump . ']';
3022							}
3023
3024							sub _dump_node {
3025	0			0			my $node = shift;
3026	0						my $dump = '{';
3027	0						my $nr = 0;
3028	0						my $n;
3029	0						for $n (sort keys %$node) {
3030	0	0					$dump .= ' ' if $nr++;
3031							# Devel::Cover shows this to test to be redundant
3032							# $dump .= ( $n eq '' and not defined $node->{$n} )
3033							$dump .= $n eq ''
3034							? '*'
3035							: ($n =~ /^[\x00-\x1f]$/ ? _pretty_dump($n) : $n)
3036	0	0					. "=>" . _dump($node->{$n})
		0
3037							;
3038							}
3039	0						return $dump . '}';
3040							}
3041
3042							=back
3043
3044							=head1 DIAGNOSTICS
3045
3046							"Cannot pass a C to Default_Lexer"
3047
3048							You tried to replace the default lexer pattern with an object
3049							instead of a scalar. Solution: You probably tried to call
3050							C<< $obj->Default_Lexer >>. Call the qualified class method instead
3051							C.
3052
3053							"filter method not passed a coderef"
3054
3055							"pre_filter method not passed a coderef"
3056
3057							A reference to a subroutine (anonymous or otherwise) was expected.
3058							Solution: read the documentation for the C method.
3059
3060							"duplicate pattern added: /.../"
3061
3062							The C attribute is active, and a duplicate pattern was
3063							added (well duh!). Solution: clean your data.
3064
3065							"cannot open [file] for input: [reason]"
3066
3067							The C method was unable to open the specified file for
3068							whatever reason. Solution: make sure the file exists and the script
3069							has the required privileges to read it.
3070
3071							=head1 NOTES
3072
3073							This module has been tested successfully with a range of versions
3074							of perl, from 5.005_03 to 5.9.3. Use of 5.6.0 is not recommended.
3075
3076							The expressions produced by this module can be used with the PCRE
3077							library.
3078
3079							Remember to "double up" your backslashes if the patterns are
3080							hard-coded as constants in your program. That is, you should
3081							literally C rather than C. It
3082							usually will work either way, but it's good practice to do so.
3083
3084							Where possible, supply the simplest tokens possible. Don't add
3085							C when C will do. The reason is that
3086							if you also add C the resulting assembly changes
3087							dramatically: C I
3088							C. Since R::A doesn't perform enough analysis,
3089							it won't "unroll" the C<{2}> quantifier, and will fail to notice
3090							the divergence after the first C<-d\d+>.
3091
3092							Furthermore, when the string 'X-123000P' is matched against the
3093							first assembly, the regexp engine will have to backtrack over each
3094							alternation (the one that ends in Y B the one that ends in Z)
3095							before determining that there is no match. No such backtracking
3096							occurs in the second pattern: as soon as the engine encounters the
3097							'P' in the target string, neither of the alternations at that point
3098							(C<-\d+Y> or C) could succeed and so the match fails.
3099
3100							C does, however, know how to build character
3101							classes. Given C, C and C, it will assemble these
3102							into C. When C<-> (dash) appears as a candidate for a
3103							character class it will be the first character in the class. When
3104							C<^> (circumflex) appears as a candidate for a character class it
3105							will be the last character in the class.
3106
3107							It also knows about meta-characters than can "absorb" regular
3108							characters. For instance, given C and C, it knows that
3109							C<5> can be represented by C<\d> and so the assembly is just C.
3110							The "absorbent" meta-characters it deals with are C<.>, C<\d>, C<\s>
3111							and C<\W> and their complements. It will replace C<\d>/C<\D>,
3112							C<\s>/C<\S> and C<\w>/C<\W> by C<.> (dot), and it will drop C<\d>
3113							if C<\w> is also present (as will C<\D> in the presence of C<\W>).
3114
3115							C deals correctly with C's propensity
3116							to backslash many characters that have no need to be. Backslashes on
3117							non-metacharacters will be removed. Similarly, in character classes,
3118							a number of characters lose their magic and so no longer need to be
3119							backslashed within a character class. Two common examples are C<.>
3120							(dot) and C<$>. Such characters will lose their backslash.
3121
3122							At the same time, it will also process C<\Q...\E> sequences. When
3123							such a sequence is encountered, the inner section is extracted and
3124							C is applied to the section. The resulting quoted text
3125							is then used in place of the original unquoted text, and the C<\Q>
3126							and C<\E> metacharacters are thrown away. Similar processing occurs
3127							with the C<\U...\E> and C<\L...\E> sequences. This may have surprising
3128							effects when using a dispatch table. In this case, you will need
3129							to know exactly what the module makes of your input. Use the C
3130							method to find out what's going on:
3131
3132							$pattern = join( '', @{$re->lexstr($pattern)} );
3133
3134							If all the digits 0..9 appear in a character class, C
3135							will replace them by C<\d>. I'd do it for letters as well, but
3136							thinking about accented characters and other glyphs hurts my head.
3137
3138							In an alternation, the longest paths are chosen first (for example,
3139							C). When two paths have the same length, the path
3140							with the most subpaths will appear first. This aims to put the
3141							"busiest" paths to the front of the alternation. For example, the
3142							list C, C, C, C and C will produce the
3143							pattern C<(?:f(?:ew\|ig\|un)\|b(?:ad\|it))>. See F for a
3144							real-world example of how alternations are sorted. Once you have
3145							looked at that, everything should be crystal clear.
3146
3147							When tracking is in use, no reduction is performed. nor are
3148							character classes formed. The reason is that it is
3149							too difficult to determine the original pattern afterwards. Consider the
3150							two patterns C and C. These should be reduced to
3151							C. The final character matches one of two possibilities.
3152							To resolve whether it matched an C<'e'> or C<'m'> would require
3153							keeping track of the fact that the pattern finished up in a character
3154							class, which would the require a whole lot more work to figure out
3155							which character of the class matched. Without character classes
3156							it becomes much easier. Instead, C is produced, which
3157							lets us find out more simply where we ended up.
3158
3159							Similarly, C and C should form C<(?:dog\|sea)food>.
3160							When the pattern is being assembled, the tracking decision needs
3161							to be made at the end of the grouping, but the tail of the pattern
3162							has not yet been visited. Deferring things to make this work correctly
3163							is a vast hassle. In this case, the pattern becomes merely
3164							C<(?:dogfood\|seafood>. Tracked patterns will therefore be bulkier than
3165							simple patterns.
3166
3167							There is an open bug on this issue:
3168
3169							L
3170
3171							If this bug is ever resolved, tracking would become much easier to
3172							deal with (none of the C hassle would be required - you could
3173							just match like a regular RE and it would Just Work).
3174
3175							=head1 SEE ALSO
3176
3177							=over 8
3178
3179							=item L
3180
3181							General information about Perl's regular expressions.
3182
3183							=item L
3184
3185							Specific information about C.
3186
3187							=item Regex::PreSuf
3188
3189							C takes a string and chops it itself into tokens of
3190							length 1. Since it can't deal with tokens of more than one character,
3191							it can't deal with meta-characters and thus no regular expressions.
3192							Which is the main reason why I wrote this module.
3193
3194							=item Regexp::Optimizer
3195
3196							C produces regular expressions that are similar to
3197							those produced by R::A with reductions switched off. It's biggest
3198							drawback is that it is exponentially slower than Regexp::Assemble on
3199							very large sets of patterns.
3200
3201							=item Regexp::Parser
3202
3203							Fine grained analysis of regular expressions.
3204
3205							=item Regexp::Trie
3206
3207							Funnily enough, this was my working name for C
3208							during its developement. I changed the name because I thought it
3209							was too obscure. Anyway, C does much the same as
3210							C and C except that it runs
3211							much faster (according to the author). It does not recognise
3212							meta characters (that is, 'a+b' is interpreted as 'a\+b').
3213
3214							=item Text::Trie
3215
3216							C is well worth investigating. Tries can outperform very
3217							bushy (read: many alternations) patterns.
3218
3219							=item Tree::Trie
3220
3221							C is another module that builds tries. The algorithm that
3222							C uses appears to be quite similar to the
3223							algorithm described therein, except that C solves its
3224							end-marker problem without having to rewrite the leaves.
3225
3226							=back
3227
3228							=head1 LIMITATIONS
3229
3230							C does not attempt to find common substrings. For
3231							instance, it will not collapse C down to C.
3232							If there's a module out there that performs this sort of string
3233							analysis I'd like to know about it. But keep in mind that the
3234							algorithms that do this are very expensive: quadratic or worse.
3235
3236							C does not interpret meta-character modifiers.
3237							For instance, if the following two patterns are
3238							given: C and C, it will not determine that C<\d> can be
3239							matched by C<\d+>. Instead, it will produce C. Along
3240							a similar line of reasoning, it will not determine that C and
3241							C is equivalent to C (It will produce C
3242							instead).
3243
3244							You cannot remove a pattern that has been added to an object. You'll
3245							just have to start over again. Adding a pattern is difficult enough,
3246							I'd need a solid argument to convince me to add a C method.
3247							If you need to do this you should read the documentation for the
3248							C method.
3249
3250							C does not (yet)? employ the C<(?E...)>
3251							construct.
3252
3253							The module does not produce POSIX-style regular expressions. This
3254							would be quite easy to add, if there was a demand for it.
3255
3256							=head1 BUGS
3257
3258							Patterns that generate look-ahead assertions sometimes produce
3259							incorrect patterns in certain obscure corner cases. If you
3260							suspect that this is occurring in your pattern, disable
3261							lookaheads.
3262
3263							Tracking doesn't really work at all with 5.6.0. It works better
3264							in subsequent 5.6 releases. For maximum reliability, the use of
3265							a 5.8 release is strongly recommended. Tracking barely works with
3266							5.005_04. Of note, using C<\d>-style meta-characters invariably
3267							causes panics. Tracking really comes into its own in Perl 5.10.
3268
3269							If you feed C patterns with nested parentheses,
3270							there is a chance that the resulting pattern will be uncompilable
3271							due to mismatched parentheses (not enough closing parentheses). This
3272							is normal, so long as the default lexer pattern is used. If you want
3273							to find out which pattern among a list of 3000 patterns are to blame
3274							(speaking from experience here), the F script offers
3275							a strategy for pinpointing the pattern at fault. While you may not
3276							be able to use the script directly, the general approach is easy to
3277							implement.
3278
3279							The algorithm used to assemble the regular expressions makes extensive
3280							use of mutually-recursive functions (that is, A calls B, B calls
3281							A, ...) For deeply similar expressions, it may be possible to provoke
3282							"Deep recursion" warnings.
3283
3284							The module has been tested extensively, and has an extensive test
3285							suite (that achieves close to 100% statement coverage), but you
3286							never know... A bug may manifest itself in two ways: creating a
3287							pattern that cannot be compiled, such as C, or a pattern
3288							that compiles correctly but that either matches things it shouldn't,
3289							or doesn't match things it should. It is assumed that Such problems
3290							will occur when the reduction algorithm encounters some sort of
3291							edge case. A temporary work-around is to disable reductions:
3292
3293							my $pattern = $assembler->reduce(0)->re;
3294
3295							A discussion about implementation details and where bugs might lurk
3296							appears in the README file. If this file is not available locally,
3297							you should be able to find a copy on the Web at your nearest CPAN
3298							mirror.
3299
3300							Seriously, though, a number of people have been using this module to
3301							create expressions anywhere from 140Kb to 600Kb in size, and it seems to
3302							be working according to spec. Thus, I don't think there are any serious
3303							bugs remaining.
3304
3305							If you are feeling brave, extensive debugging traces are available to
3306							figure out where assembly goes wrong.
3307
3308							Please report all bugs at
3309							L
3310
3311							Make sure you include the output from the following two commands:
3312
3313							perl -MRegexp::Assemble -le 'print $Regexp::Assemble::VERSION'
3314							perl -V
3315
3316							There is a mailing list for the discussion of C.
3317							Subscription details are available at
3318							L.
3319
3320							=head1 ACKNOWLEDGEMENTS
3321
3322							This module grew out of work I did building access maps for Postfix,
3323							a modern SMTP mail transfer agent. See L
3324							for more information. I used Perl to build large regular expressions
3325							for blocking dynamic/residential IP addresses to cut down on spam
3326							and viruses. Once I had the code running for this, it was easy to
3327							start adding stuff to block really blatant spam subject lines, bogus
3328							HELO strings, spammer mailer-ids and more...
3329
3330							I presented the work at the French Perl Workshop in 2004, and the
3331							thing most people asked was whether the underlying mechanism for
3332							assembling the REs was available as a module. At that time it was
3333							nothing more that a twisty maze of scripts, all different. The
3334							interest shown indicated that a module was called for. I'd like to
3335							thank the people who showed interest. Hey, it's going to make I
3336							messy scripts smaller, in any case.
3337
3338							Thomas Drugeon was a valuable sounding board for trying out
3339							early ideas. Jean Forget and Philippe Blayo looked over an early
3340							version. H.Merijn Brandt stopped over in Paris one evening, and
3341							discussed things over a few beers.
3342
3343							Nicholas Clark pointed out that while what this module does
3344							(?:c\|sh)ould be done in perl's core, as per the 2004 TODO, he
3345							encouraged me to continue with the development of this module. In
3346							any event, this module allows one to gauge the difficulty of
3347							undertaking the endeavour in C. I'd rather gouge my eyes out with
3348							a blunt pencil.
3349
3350							Paul Johnson settled the question as to whether this module should
3351							live in the Regex:: namespace, or Regexp:: namespace. If you're
3352							not convinced, try running the following one-liner:
3353
3354							perl -le 'print ref qr//'
3355
3356							Philippe Bruhat found a couple of corner cases where this module
3357							could produce incorrect results. Such feedback is invaluable,
3358							and only improves the module's quality.
3359
3360							=head1 AUTHOR
3361
3362							David Landgren
3363
3364							Copyright (C) 2004-2008. All rights reserved.
3365
3366							http://www.landgren.net/perl/
3367
3368							If you use this module, I'd love to hear about what you're using
3369							it for. If you want to be informed of updates, send me a note.
3370
3371							You can look at the latest working copy in the following
3372							Subversion repository:
3373
3374							http://svnweb.mongueurs.net/Regexp-Assemble
3375
3376							=head1 LICENSE
3377
3378							This library is free software; you can redistribute it and/or modify
3379							it under the same terms as Perl itself.
3380
3381							=cut
3382
3383							'The Lusty Decadent Delights of Imperial Pompeii';
3384							__END__