File Coverage

lib/CPANPLUS/YACSmoke/ReAssemble.pm
Criterion Covered Total %
statement 130 1152 11.2
branch 51 796 6.4
condition 15 177 8.4
subroutine 21 97 21.6
pod 49 49 100.0
total 266 2271 11.7


line stmt bran cond sub pod time code
1             # Regexp::Assemple.pm
2             #
3             # Copyright (c) 2004-2008 David Landgren
4             # All rights reserved
5              
6             package CPANPLUS::YACSmoke::ReAssemble;
7             $CPANPLUS::YACSmoke::ReAssemble::VERSION = '1.08';
8 12     12   88 use vars qw/$have_Storable $Current_Lexer $Default_Lexer $Single_Char $Always_Fail/;
  12         32  
  12         924  
9              
10             =head1 NAME
11              
12             Regexp::Assemble - Assemble multiple Regular Expressions into a single RE
13              
14             =head1 SYNOPSIS
15              
16             use Regexp::Assemble;
17            
18             my $ra = Regexp::Assemble->new;
19             $ra->add( 'ab+c' );
20             $ra->add( 'ab+-' );
21             $ra->add( 'a\w\d+' );
22             $ra->add( 'a\d+' );
23             print $ra->re; # prints a(?:\w?\d+|b+[-c])
24              
25             =head1 DESCRIPTION
26              
27             Regexp::Assemble takes an arbitrary number of regular expressions
28             and assembles them into a single regular expression (or RE) that
29             matches all that the individual REs match.
30              
31             As a result, instead of having a large list of expressions to loop
32             over, a target string only needs to be tested against one expression.
33             This is interesting when you have several thousand patterns to deal
34             with. Serious effort is made to produce the smallest pattern possible.
35              
36             It is also possible to track the original patterns, so that you can
37             determine which, among the source patterns that form the assembled
38             pattern, was the one that caused the match to occur.
39              
40             You should realise that large numbers of alternations are processed
41             in perl's regular expression engine in O(n) time, not O(1). If you
42             are still having performance problems, you should look at using a
43             trie. Note that Perl's own regular expression engine will implement
44             trie optimisations in perl 5.10 (they are already available in
45             perl 5.9.3 if you want to try them out). C will
46             do the right thing when it knows it's running on a a trie'd perl.
47             (At least in some version after this one).
48              
49             Some more examples of usage appear in the accompanying README. If
50             that file isn't easy to access locally, you can find it on a web
51             repository such as
52             L or
53             L.
54              
55             =cut
56              
57 12     12   75 use strict;
  12         24  
  12         380  
58              
59 12     12   79 use constant DEBUG_ADD => 1;
  12         23  
  12         971  
60 12     12   80 use constant DEBUG_TAIL => 2;
  12         33  
  12         657  
61 12     12   119 use constant DEBUG_LEX => 4;
  12         28  
  12         647  
62 12     12   96 use constant DEBUG_TIME => 8;
  12         24  
  12         55341  
63              
64             # The following patterns were generated with eg/naive
65             $Default_Lexer = qr/(?![[(\\]).(?:[*+?]\??|\{\d+(?:,\d*)?\}\??)?|\\(?:[bABCEGLQUXZ]|[lu].|(?:[^\w]|[aefnrtdDwWsS]|c.|0\d{2}|x(?:[\da-fA-F]{2}|{[\da-fA-F]{4}})|N\{\w+\}|[Pp](?:\{\w+\}|.))(?:[*+?]\??|\{\d+(?:,\d*)?\}\??)?)|\[.*?(?
66              
67             $Single_Char = qr/^(?:\\(?:[aefnrtdDwWsS]|c.|[^\w\/{|}-]|0\d{2}|x(?:[\da-fA-F]{2}|{[\da-fA-F]{4}}))|[^\$^])$/;
68              
69             # the pattern to return when nothing has been added (and thus not match anything)
70             $Always_Fail = "^\\b\0";
71              
72             =head1 METHODS
73              
74             =over 8
75              
76             =item new
77              
78             Creates a new C object. The following optional
79             key/value parameters may be employed. All keys have a corresponding
80             method that can be used to change the behaviour later on. As a
81             general rule, especially if you're just starting out, you don't
82             have to bother with any of these.
83              
84             B, a family of optional attributes that allow anchors
85             (C<^>, C<\b>, C<\Z>...) to be added to the resulting pattern.
86              
87             B, sets the C flags to add to the assembled regular
88             expression. Warning: no error checking is done, you should ensure
89             that the flags you pass are understood by the version of Perl you
90             are using. B exists as an alias, for users familiar
91             with L.
92              
93             B, controls whether the pattern should be chomped before being
94             lexed. Handy if you are reading patterns from a file. By default,
95             Cing is performed (this behaviour changed as of version 0.24,
96             prior versions did not chomp automatically).
97             See also the C attribute and the C method.
98              
99             B, slurp the contents of the specified file and add them
100             to the assembly. Multiple files may be processed by using a list.
101              
102             my $r = Regexp::Assemble->new(file => 're.list');
103              
104             my $r = Regexp::Assemble->new(file => ['re.1', 're.2']);
105              
106             If you really don't want chomping to occur, you will have to set
107             the C attribute to 0 (zero). You may also want to look at
108             the C attribute, as well.
109              
110             B, controls what constitutes a record
111             separator when using the C attribute or the C
112             method. May be abbreviated to B. See the C<$/> variable in
113             L.
114              
115             B, controls whether the pattern should contain zero-width
116             lookahead assertions (For instance: (?=[abc])(?:bob|alice|charles).
117             This is not activated by default, because in many circumstances the
118             cost of processing the assertion itself outweighs the benefit of
119             its faculty for short-circuiting a match that will fail. This is
120             sensitive to the probability of a match succeeding, so if you're
121             worried about performance you'll have to benchmark a sample population
122             of targets to see which way the benefits lie.
123              
124             B, controls whether you want know which of the initial
125             patterns was the one that matched. See the C method for
126             more details. Note for version 5.8 of Perl and below, in this mode
127             of operation YOU SHOULD BE AWARE OF THE SECURITY IMPLICATIONS that
128             this entails. Perl 5.10 does not suffer from any such restriction.
129              
130             B, the number of spaces used to indent nested grouping of
131             a pattern. Use this to produce a pretty-printed pattern. See the
132             C method for a more detailed explanation.
133              
134             B, allows you to add a callback to enable sanity checks
135             on the pattern being loaded. This callback is triggered before the
136             pattern is split apart by the lexer. In other words, it operates
137             on the entire pattern. If you are loading patterns from a file,
138             this would be an appropriate place to remove comments.
139              
140             B, allows you to add a callback to enable sanity checks on
141             the pattern being loaded. This callback is triggered after the
142             pattern has been split apart by the lexer.
143              
144             B, controls whether to unroll, for example, C into
145             C, C, which may allow additional reductions in the
146             resulting assembled pattern.
147              
148             B, controls whether tail reduction occurs or not. If set,
149             patterns like C will be reduced to C.
150             That is, the end of the pattern in each part of the b... and d...
151             alternations is identical, and hence is hoisted out of the alternation
152             and placed after it. On by default. Turn it off if you're really
153             pressed for short assembly times.
154              
155             B, specifies the pattern used to lex the input lines into
156             tokens. You could replace the default pattern by a more sophisticated
157             version that matches arbitrarily nested parentheses, for example.
158              
159             B, controls whether copious amounts of output is produced
160             during the loading stage or the reducing stage of assembly.
161              
162             my $ra = Regexp::Assemble->new;
163             my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );
164              
165             B, controls whether new patterns can be added to the object
166             after the assembled pattern is generated. DEPRECATED.
167              
168             This method/attribute will be removed in a future release. It doesn't
169             really serve any purpose, and may be more effectively replaced by
170             cloning an existing C object and spinning out a
171             pattern from that instead.
172              
173             A more detailed explanation of these attributes follows.
174              
175             =cut
176              
177             sub new {
178 4     4 1 13 my $class = shift;
179 4         10 my %args = @_;
180              
181 4         9 my $anc;
182 4         21 for $anc (qw(word line string)) {
183 12 50       41 if (exists $args{"anchor_$anc"}) {
184 0         0 my $val = delete $args{"anchor_$anc"};
185 0         0 for my $anchor ("anchor_${anc}_begin", "anchor_${anc}_end") {
186 0 0       0 $args{$anchor} = $val unless exists $args{$anchor};
187             }
188             }
189             }
190              
191             # anchor_string_absolute sets anchor_string_begin and anchor_string_end_absolute
192 4 50       32 if (exists $args{anchor_string_absolute}) {
193 0         0 my $val = delete $args{anchor_string_absolute};
194 0         0 for my $anchor (qw(anchor_string_begin anchor_string_end_absolute)) {
195 0 0       0 $args{$anchor} = $val unless exists $args{$anchor};
196             }
197             }
198              
199 4   50     216 exists $args{$_} or $args{$_} = 0 for qw(
200             anchor_word_begin
201             anchor_word_end
202             anchor_line_begin
203             anchor_line_end
204             anchor_string_begin
205             anchor_string_end
206             anchor_string_end_absolute
207             debug
208             dup_warn
209             indent
210             lookahead
211             mutable
212             track
213             unroll_plus
214             );
215              
216 4   50     73 exists $args{$_} or $args{$_} = 1 for qw(
217             fold_meta_pairs
218             reduce
219             chomp
220             );
221              
222 4         40 @args{qw(re str path)} = (undef, undef, []);
223              
224 4   50     89 $args{flags} ||= delete $args{modifiers} || '';
      33        
225 4 50       12 $args{lex} = $Current_Lexer if defined $Current_Lexer;
226              
227 4         21 my $self = bless \%args, $class;
228              
229 4 50       52 if ($self->_debug(DEBUG_TIME)) {
230 0         0 $self->_init_time_func();
231 0         0 $self->{_begin_time} = $self->{_time_func}->();
232             }
233             $self->{input_record_separator} = delete $self->{rs}
234 4 50       15 if exists $self->{rs};
235 4 50       12 exists $self->{file} and $self->add_file($self->{file});
236              
237 4         17 return $self;
238             }
239              
240             sub _init_time_func {
241 0     0   0 my $self = shift;
242 0 0       0 return if exists $self->{_time_func};
243              
244             # attempt to improve accuracy
245 0 0       0 if (!defined($self->{_use_time_hires})) {
246 0         0 eval {require Time::HiRes};
  0         0  
247 0         0 $self->{_use_time_hires} = $@;
248             }
249             $self->{_time_func} = length($self->{_use_time_hires}) > 0
250 0     0   0 ? sub { time }
251 0 0       0 : \&Time::HiRes::time
252             ;
253             }
254              
255             =item clone
256              
257             Clones the contents of a Regexp::Assemble object and creates a new
258             object (in other words it performs a deep copy).
259              
260             If the Storable module is installed, its dclone method will be used,
261             otherwise the cloning will be performed using a pure perl approach.
262              
263             You can use this method to take a snapshot of the patterns that have
264             been added so far to an object, and generate an assembly from the
265             clone. Additional patterns may to be added to the original object
266             afterwards.
267              
268             my $re = $main->clone->re();
269             $main->add( 'another-pattern-\\d+' );
270              
271             =cut
272              
273             sub clone {
274 0     0 1 0 my $self = shift;
275 0         0 my $clone;
276 0         0 my @attr = grep {$_ ne 'path'} keys %$self;
  0         0  
277 0         0 @{$clone}{@attr} = @{$self}{@attr};
  0         0  
  0         0  
278 0         0 $clone->{path} = _path_clone($self->_path);
279 0         0 bless $clone, ref($self);
280             }
281              
282             =item add(LIST)
283              
284             Takes a string, breaks it apart into a set of tokens (respecting
285             meta characters) and inserts the resulting list into the C
286             object. It uses a naive regular expression to lex the string
287             that may be fooled complex expressions (specifically, it will
288             fail to lex nested parenthetical expressions such as
289             C correctly). If this is the case, the end of
290             the string will not be tokenised correctly and returned as one
291             long string.
292              
293             On the one hand, this may indicate that the patterns you are
294             trying to feed the C object are too complex. Simpler
295             patterns might allow the algorithm to work more effectively and
296             perform more reductions in the resulting pattern.
297              
298             On the other hand, you can supply your own pattern to perform the
299             lexing if you need. The test suite contains an example of a lexer
300             pattern that will match one level of nested parentheses.
301              
302             Note that there is an internal optimisation that will bypass a
303             much of the lexing process. If a string contains no C<\>
304             (backslash), C<[> (open square bracket), C<(> (open paren),
305             C (question mark), C<+> (plus), C<*> (star) or C<{> (open
306             curly), a character split will be performed directly.
307              
308             A list of strings may be supplied, thus you can pass it a file
309             handle of a file opened for reading:
310              
311             $re->add( '\d+-\d+-\d+-\d+\.example\.com' );
312             $re->add( );
313              
314             If the file is very large, it may be more efficient to use a
315             C loop, to read the file line-by-line:
316              
317             $re->add($_) while ;
318              
319             The C method will chomp the lines automatically. If you
320             do not want this to occur (you want to keep the record
321             separator), then disable Cing.
322              
323             $re->chomp(0);
324             $re->add($_) while ;
325              
326             This method is chainable.
327              
328             =cut
329              
330             sub _fastlex {
331 0     0   0 my $self = shift;
332 0         0 my $record = shift;
333 0         0 my $len = 0;
334 0         0 my @path = ();
335 0         0 my $case = '';
336 0         0 my $qm = '';
337              
338 0         0 my $debug = $self->{debug} & DEBUG_LEX;
339 0         0 my $unroll_plus = $self->{unroll_plus};
340              
341 0         0 my $token;
342             my $qualifier;
343 0 0       0 $debug and print "# _lex <$record>\n";
344 0         0 my $modifier = q{(?:[*+?]\\??|\\{(?:\\d+(?:,\d*)?|,\d+)\\}\\??)?};
345 0         0 my $class_matcher = qr/\[(?:\[:[a-z]+:\]|\\?.)*?\]/;
346 0         0 my $paren_matcher = qr/\(.*?(?
347 0         0 my $misc_matcher = qr/(?:(c)(.)|(0)(\d{2}))($modifier)/;
348 0         0 my $regular_matcher = qr/([^\\[(])($modifier)/;
349 0         0 my $qm_matcher = qr/(\\?.)/;
350              
351 0         0 my $matcher = $regular_matcher;
352             {
353 0 0       0 if ($record =~ /\G$matcher/gc) {
  0 0       0  
    0          
    0          
354             # neither a \\ nor [ nor ( followed by a modifer
355 0 0 0     0 if ($1 eq '\\E') {
    0 0        
356 0 0       0 $debug and print "# E\n";
357 0         0 $case = $qm = '';
358 0         0 $matcher = $regular_matcher;
359 0         0 redo;
360             }
361             elsif ($qm and ($1 eq '\\L' or $1 eq '\\U')) {
362 0 0       0 $debug and print "# ignore \\L, \\U\n";
363 0         0 redo;
364             }
365 0         0 $token = $1;
366 0 0       0 $qualifier = defined $2 ? $2 : '';
367 0 0       0 $debug and print "# token <$token> <$qualifier>\n";
368 0 0       0 if ($qm) {
369 0         0 $token = quotemeta($token);
370 0         0 $token =~ s/^\\([^\w$()*+.?@\[\\\]^|{}\/])$/$1/;
371             }
372             else {
373 0         0 $token =~ s{\A([][{}*+?@\\/])\Z}{\\$1};
374             }
375 0 0 0     0 if ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/) {
376 0 0       0 $1 and $qualifier .= $1;
377 0 0       0 $debug and print " unroll <$token><$token><$qualifier>\n";
378 0 0       0 $case and $token = $case eq 'L' ? lc($token) : uc($token);
    0          
379 0         0 push @path, $token, "$token$qualifier";
380             }
381             else {
382 0 0       0 $debug and print " clean <$token>\n";
383 0 0       0 push @path,
    0          
384             $case eq 'L' ? lc($token).$qualifier
385             : $case eq 'U' ? uc($token).$qualifier
386             : $token.$qualifier
387             ;
388             }
389 0         0 redo;
390             }
391              
392             elsif ($record =~ /\G\\/gc) {
393 0 0       0 $debug and print "# backslash\n";
394             # backslash
395 0 0       0 if ($record =~ /\G([sdwSDW])($modifier)/gc) {
    0          
    0          
    0          
    0          
    0          
    0          
    0          
396 0         0 ($token, $qualifier) = ($1, $2);
397 0 0       0 $debug and print "# meta <$token> <$qualifier>\n";
398 0 0 0     0 push @path, ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/)
    0          
399             ? ("\\$token", "\\$token$qualifier" . (defined $1 ? $1 : ''))
400             : "\\$token$qualifier";
401             }
402             elsif ($record =~ /\Gx([\da-fA-F]{2})($modifier)/gc) {
403 0 0       0 $debug and print "# x $1\n";
404 0         0 $token = quotemeta(chr(hex($1)));
405 0         0 $qualifier = $2;
406 0 0       0 $debug and print "# cooked <$token>\n";
407 0         0 $token =~ s/^\\([^\w$()*+.?\[\\\]^|{\/])$/$1/; # } balance
408 0 0       0 $debug and print "# giving <$token>\n";
409 0 0 0     0 push @path, ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/)
    0          
410             ? ($token, "$token$qualifier" . (defined $1 ? $1 : ''))
411             : "$token$qualifier";
412             }
413             elsif ($record =~ /\GQ/gc) {
414 0 0       0 $debug and print "# Q\n";
415 0         0 $qm = 1;
416 0         0 $matcher = $qm_matcher;
417             }
418             elsif ($record =~ /\G([LU])/gc) {
419 0 0       0 $debug and print "# case $1\n";
420 0         0 $case = $1;
421             }
422             elsif ($record =~ /\GE/gc) {
423 0 0       0 $debug and print "# E\n";
424 0         0 $case = $qm = '';
425 0         0 $matcher = $regular_matcher;
426             }
427             elsif ($record =~ /\G([lu])(.)/gc) {
428 0 0       0 $debug and print "# case $1 to <$2>\n";
429 0 0       0 push @path, $1 eq 'l' ? lc($2) : uc($2);
430             }
431 0         0 elsif (my @arg = grep {defined} $record =~ /\G$misc_matcher/gc) {
432 0 0       0 if ($] < 5.007) {
433 0         0 my $len = 0;
434 0         0 $len += length($_) for @arg;
435 0 0       0 $debug and print "# pos ", pos($record), " fixup add $len\n";
436 0         0 pos($record) = pos($record) + $len;
437             }
438 0         0 my $directive = shift @arg;
439 0 0       0 if ($directive eq 'c') {
440 0 0       0 $debug and print "# ctrl <@arg>\n";
441 0         0 push @path, "\\c" . uc(shift @arg);
442             }
443             else { # elsif ($directive eq '0') {
444 0 0       0 $debug and print "# octal <@arg>\n";
445 0         0 my $ascii = oct(shift @arg);
446 0 0       0 push @path, ($ascii < 32)
447             ? "\\c" . chr($ascii+64)
448             : chr($ascii)
449             ;
450             }
451 0         0 $path[-1] .= join( '', @arg ); # if @arg;
452 0         0 redo;
453             }
454             elsif ($record =~ /\G(.)/gc) {
455 0         0 $token = $1;
456 0         0 $token =~ s{[AZabefnrtz\[\]{}()\\\$*+.?@|/^]}{\\$token};
457 0 0       0 $debug and print "# meta <$token>\n";
458 0         0 push @path, $token;
459             }
460             else {
461 0 0       0 $debug and print "# ignore char at ", pos($record), " of <$record>\n";
462             }
463 0         0 redo;
464             }
465              
466             elsif ($record =~ /\G($class_matcher)($modifier)/gc) {
467             # [class] followed by a modifer
468 0         0 my $class = $1;
469 0 0       0 my $qualifier = defined $2 ? $2 : '';
470 0 0       0 $debug and print "# class begin <$class> <$qualifier>\n";
471 0 0       0 if ($class =~ /\A\[\\?(.)]\Z/) {
472 0         0 $class = quotemeta $1;
473 0         0 $class =~ s{\A\\([!@%])\Z}{$1};
474 0 0       0 $debug and print "# class unwrap $class\n";
475             }
476 0 0       0 $debug and print "# class end <$class> <$qualifier>\n";
477 0 0 0     0 push @path, ($unroll_plus and $qualifier =~ s/\A\+(\?)?\Z/*/)
    0          
478             ? ($class, "$class$qualifier" . (defined $1 ? $1 : ''))
479             : "$class$qualifier";
480 0         0 redo;
481             }
482              
483             elsif ($record =~ /\G($paren_matcher)/gc) {
484 0 0       0 $debug and print "# paren <$1>\n";
485             # (paren) followed by a modifer
486 0         0 push @path, $1;
487 0         0 redo;
488             }
489              
490             }
491 0         0 return \@path;
492             }
493              
494             sub _lex {
495 0     0   0 my $self = shift;
496 0         0 my $record = shift;
497 0         0 my $len = 0;
498 0         0 my @path = ();
499 0         0 my $case = '';
500 0         0 my $qm = '';
501             my $re = defined $self->{lex} ? $self->{lex}
502 0 0       0 : defined $Current_Lexer ? $Current_Lexer
    0          
503             : $Default_Lexer;
504 0         0 my $debug = $self->{debug} & DEBUG_LEX;
505 0 0       0 $debug and print "# _lex <$record>\n";
506 0         0 my ($token, $next_token, $diff, $token_len);
507 0         0 while( $record =~ /($re)/g ) {
508 0         0 $token = $1;
509 0         0 $token_len = length($token);
510 0 0       0 $debug and print "# lexed <$token> len=$token_len\n";
511 0 0       0 if( pos($record) - $len > $token_len ) {
512 0         0 $next_token = $token;
513 0         0 $token = substr( $record, $len, $diff = pos($record) - $len - $token_len );
514 0 0       0 $debug and print "# recover <", substr( $record, $len, $diff ), "> as <$token>, save <$next_token>\n";
515 0         0 $len += $diff;
516             }
517 0         0 $len += $token_len;
518             TOKEN: {
519 0 0       0 if( substr( $token, 0, 1 ) eq '\\' ) {
  0         0  
520 0 0       0 if( $token =~ /^\\([ELQU])$/ ) {
    0          
    0          
521 0 0       0 if( $1 eq 'E' ) {
    0          
522             $qm and $re = defined $self->{lex} ? $self->{lex}
523 0 0       0 : defined $Current_Lexer ? $Current_Lexer
    0          
    0          
524             : $Default_Lexer;
525 0         0 $case = $qm = '';
526             }
527             elsif( $1 eq 'Q' ) {
528 0         0 $qm = $1;
529             # switch to a more precise lexer to quotemeta individual characters
530 0         0 $re = qr/\\?./;
531             }
532             else {
533 0         0 $case = $1;
534             }
535 0 0       0 $debug and print "# state change qm=<$qm> case=<$case>\n";
536 0         0 goto NEXT_TOKEN;
537             }
538             elsif( $token =~ /^\\([lu])(.)$/ ) {
539 0 0       0 $debug and print "# apply case=<$1> to <$2>\n";
540 0 0       0 push @path, $1 eq 'l' ? lc($2) : uc($2);
541 0         0 goto NEXT_TOKEN;
542             }
543             elsif( $token =~ /^\\x([\da-fA-F]{2})$/ ) {
544 0         0 $token = quotemeta(chr(hex($1)));
545 0 0       0 $debug and print "# cooked <$token>\n";
546 0         0 $token =~ s/^\\([^\w$()*+.?@\[\\\]^|{\/])$/$1/; # } balance
547 0 0       0 $debug and print "# giving <$token>\n";
548             }
549             else {
550 0         0 $token =~ s/^\\([^\w$()*+.?@\[\\\]^|{\/])$/$1/; # } balance
551 0 0       0 $debug and print "# backslashed <$token>\n";
552             }
553             }
554             else {
555 0 0       0 $case and $token = $case eq 'U' ? uc($token) : lc($token);
    0          
556 0 0       0 $qm and $token = quotemeta($token);
557 0 0       0 $token = '\\/' if $token eq '/';
558             }
559             # undo quotemeta's brute-force escapades
560 0 0       0 $qm and $token =~ s/^\\([^\w$()*+.?@\[\\\]^|{}\/])$/$1/;
561 0 0       0 $debug and print "# <$token> case=<$case> qm=<$qm>\n";
562 0         0 push @path, $token;
563              
564             NEXT_TOKEN:
565 0 0       0 if( defined $next_token ) {
566 0 0       0 $debug and print "# redo <$next_token>\n";
567 0         0 $token = $next_token;
568 0         0 $next_token = undef;
569 0         0 redo TOKEN;
570             }
571             }
572             }
573 0 0       0 if( $len < length($record) ) {
574             # NB: the remainder only arises in the case of degenerate lexer,
575             # and if \Q is operative, the lexer will have been switched to
576             # /\\?./, which means there can never be a remainder, so we
577             # don't have to bother about quotemeta. In other words:
578             # $qm will never be true in this block.
579 0         0 my $remain = substr($record,$len);
580 0 0       0 $case and $remain = $case eq 'U' ? uc($remain) : lc($remain);
    0          
581 0 0       0 $debug and print "# add remaining <$remain> case=<$case> qm=<$qm>\n";
582 0         0 push @path, $remain;
583             }
584 0 0       0 $debug and print "# _lex out <@path>\n";
585 0         0 return \@path;
586             }
587              
588             sub add {
589 4     4 1 12 my $self = shift;
590 4         7 my $record;
591 4         13 my $debug = $self->{debug} & DEBUG_LEX;
592 4         20 while( defined( $record = shift @_ )) {
593 4 50       18 CORE::chomp($record) if $self->{chomp};
594 4 50 33     15 next if $self->{pre_filter} and not $self->{pre_filter}->($record);
595 4 50       11 $debug and print "# add <$record>\n";
596 4         19 $self->{stats_raw} += length $record;
597             my $list = $record =~ /[+*?(\\\[{]/ # }]) restore equilibrium
598 4 0       85 ? $self->{lex} ? $self->_lex($record) : $self->_fastlex($record)
    50          
599             : [split //, $record]
600             ;
601 4 50 33     22 next if $self->{filter} and not $self->{filter}->(@$list);
602 4         29 $self->_insertr( $list );
603             }
604 4         24 return $self;
605             }
606              
607             =item add_file(FILENAME [...])
608              
609             Takes a list of file names. Each file is opened and read
610             line by line. Each line is added to the assembly.
611              
612             $r->add_file( 'file.1', 'file.2' );
613              
614             If a file cannot be opened, the method will croak. If you cannot
615             afford to let this happen then you should wrap the call in a C
616             block.
617              
618             Chomping happens automatically unless you the C method
619             to disable it. By default, input lines are read according to the
620             value of the C attribute (if defined), and
621             will otherwise fall back to the current setting of the system C<$/>
622             variable. The record separator may also be specified on each
623             call to C. Internally, the routine Cises the
624             value of C<$/> to whatever is required, for the duration of the
625             call.
626              
627             An alternate calling mechanism using a hash reference is
628             available. The recognised keys are:
629              
630             =over 4
631              
632             =item file
633              
634             Reference to a list of file names, or the name of a single
635             file.
636              
637             $r->add_file({file => ['file.1', 'file.2', 'file.3']});
638             $r->add_file({file => 'file.n'});
639              
640             =item input_record_separator
641              
642             If present, indicates what constitutes a line
643              
644             $r->add_file({file => 'data.txt', input_record_separator => ':' });
645              
646             =item rs
647              
648             An alias for input_record_separator (mnemonic: same as the
649             English variable names).
650              
651             =back
652              
653             $r->add_file( {
654             file => [ 'pattern.txt', 'more.txt' ],
655             input_record_separator => "\r\n",
656             });
657              
658             =cut
659              
660             sub add_file {
661 0     0 1 0 my $self = shift;
662 0         0 my $rs;
663             my @file;
664 0 0       0 if (ref($_[0]) eq 'HASH') {
665 0         0 my $arg = shift;
666             $rs = $arg->{rs}
667             || $arg->{input_record_separator}
668             || $self->{input_record_separator}
669 0   0     0 || $/;
670             @file = ref($arg->{file}) eq 'ARRAY'
671 0         0 ? @{$arg->{file}}
672 0 0       0 : $arg->{file};
673             }
674             else {
675 0   0     0 $rs = $self->{input_record_separator} || $/;
676 0         0 @file = @_;
677             }
678 0         0 local $/ = $rs;
679 0         0 my $file;
680 0         0 for $file (@file) {
681 0 0       0 open my $fh, '<', $file or do {
682 0         0 require Carp;
683 0         0 Carp::croak("cannot open $file for input: $!");
684             };
685 0         0 while (defined (my $rec = <$fh>)) {
686 0         0 $self->add($rec);
687             }
688 0         0 close $fh;
689             }
690 0         0 return $self;
691             }
692              
693             =item insert(LIST)
694              
695             Takes a list of tokens representing a regular expression and
696             stores them in the object. Note: you should not pass it a bare
697             regular expression, such as C. You must pass it as
698             a list of tokens, I C<('a', 'b+', 'c?', 'd*', 'e')>.
699              
700             This method is chainable, I:
701              
702             my $ra = Regexp::Assemble->new
703             ->insert( qw[ a b+ c? d* e ] )
704             ->insert( qw[ a c+ d+ e* f ] );
705              
706             Lexing complex patterns with metacharacters and so on can consume
707             a significant proportion of the overall time to build an assembly.
708             If you have the information available in a tokenised form, calling
709             C directly can be a big win.
710              
711             =cut
712              
713             sub insert {
714 0     0 1 0 my $self = shift;
715 0 0 0     0 return if $self->{filter} and not $self->{filter}->(@_);
716 0         0 $self->_insertr( [@_] );
717 0         0 return $self;
718             }
719              
720             sub _insertr {
721 4     4   44 my $self = shift;
722 4   50     35 my $dup = $self->{stats_dup} || 0;
723 4         28 $self->{path} = $self->_insert_path( $self->_path, $self->_debug(DEBUG_ADD), $_[0] );
724 4 50 33     22 if( not defined $self->{stats_dup} or $dup == $self->{stats_dup} ) {
    0          
725 4         12 ++$self->{stats_add};
726 4 50       8 $self->{stats_cooked} += defined($_) ? length($_) : 0 for @{$_[0]};
  4         36  
727             }
728             elsif( $self->{dup_warn} ) {
729 0 0       0 if( ref $self->{dup_warn} eq 'CODE' ) {
730 0         0 $self->{dup_warn}->($self, $_[0]);
731             }
732             else {
733 0         0 my $pattern = join( '', @{$_[0]} );
  0         0  
734 0         0 require Carp;
735 0         0 Carp::carp("duplicate pattern added: /$pattern/");
736             }
737             }
738 4         19 $self->{str} = $self->{re} = undef;
739             }
740              
741             =item lexstr
742              
743             Use the C method if you are curious to see how a pattern
744             gets tokenised. It takes a scalar on input, representing a pattern,
745             and returns a reference to an array, containing the tokenised
746             pattern. You can recover the original pattern by performing a
747             C:
748              
749             my @token = $re->lexstr($pattern);
750             my $new_pattern = join( '', @token );
751              
752             If the original pattern contains unnecessary backslashes, or C<\x4b>
753             escapes, or quotemeta escapes (C<\Q>...C<\E>) the resulting pattern
754             may not be identical.
755              
756             Call C does not add the pattern to the object, it is merely
757             for exploratory purposes. It will, however, update various statistical
758             counters.
759              
760             =cut
761              
762             sub lexstr {
763 0     0 1 0 return shift->_lex(shift);
764             }
765              
766             =item pre_filter(CODE)
767              
768             Allows you to install a callback to check that the pattern being
769             loaded contains valid input. It receives the pattern as a whole to
770             be added, before it been tokenised by the lexer. It may to return
771             0 or C to indicate that the pattern should not be added, any
772             true value indicates that the contents are fine.
773              
774             A filter to strip out trailing comments (marked by #):
775              
776             $re->pre_filter( sub { $_[0] =~ s/\s*#.*$//; 1 } );
777              
778             A filter to ignore blank lines:
779              
780             $re->pre_filter( sub { length(shift) } );
781              
782             If you want to remove the filter, pass C as a parameter.
783              
784             $ra->pre_filter(undef);
785              
786             This method is chainable.
787              
788             =cut
789              
790             sub pre_filter {
791 0     0 1 0 my $self = shift;
792 0         0 my $pre_filter = shift;
793 0 0 0     0 if( defined $pre_filter and ref($pre_filter) ne 'CODE' ) {
794 0         0 require Carp;
795 0         0 Carp::croak("pre_filter method not passed a coderef");
796             }
797 0         0 $self->{pre_filter} = $pre_filter;
798 0         0 return $self;
799             }
800              
801              
802             =item filter(CODE)
803              
804             Allows you to install a callback to check that the pattern being
805             loaded contains valid input. It receives a list on input, after it
806             has been tokenised by the lexer. It may to return 0 or undef to
807             indicate that the pattern should not be added, any true value
808             indicates that the contents are fine.
809              
810             If you know that all patterns you expect to assemble contain
811             a restricted set of of tokens (e.g. no spaces), you could do
812             the following:
813              
814             $ra->filter(sub { not grep { / / } @_ });
815              
816             or
817              
818             sub only_spaces_and_digits {
819             not grep { ![\d ] } @_
820             }
821             $ra->filter( \&only_spaces_and_digits );
822              
823             These two examples will silently ignore faulty patterns, If you
824             want the user to be made aware of the problem you should raise an
825             error (via C or C), log an error message, whatever is
826             best. If you want to remove a filter, pass C as a parameter.
827              
828             $ra->filter(undef);
829              
830             This method is chainable.
831              
832             =cut
833              
834             sub filter {
835 0     0 1 0 my $self = shift;
836 0         0 my $filter = shift;
837 0 0 0     0 if( defined $filter and ref($filter) ne 'CODE' ) {
838 0         0 require Carp;
839 0         0 Carp::croak("filter method not passed a coderef");
840             }
841 0         0 $self->{filter} = $filter;
842 0         0 return $self;
843             }
844              
845             =item as_string
846              
847             Assemble the expression and return it as a string. You may want to do
848             this if you are writing the pattern to a file. The following arguments
849             can be passed to control the aspect of the resulting pattern:
850              
851             B, the number of spaces used to indent nested grouping of
852             a pattern. Use this to produce a pretty-printed pattern (for some
853             definition of "pretty"). The resulting output is rather verbose. The
854             reason is to ensure that the metacharacters C<(?:> and C<)> always
855             occur on otherwise empty lines. This allows you grep the result for an
856             even more synthetic view of the pattern:
857              
858             egrep -v '^ *[()]'
859              
860             The result of the above is quite readable. Remember to backslash the
861             spaces appearing in your own patterns if you wish to use an indented
862             pattern in an C construct. Indenting is ignored if tracking
863             is enabled.
864              
865             The B argument takes precedence over the C
866             method/attribute of the object.
867              
868             Calling this
869             method will drain the internal data structure. Large numbers of patterns
870             can eat a significant amount of memory, and this lets perl recover the
871             memory used for other purposes.
872              
873             If you want to reduce the pattern I continue to add new patterns,
874             clone the object and reduce the clone, leaving the original object intact.
875              
876             =cut
877              
878             sub as_string {
879 3     3 1 12 my $self = shift;
880 3 50       18 if( not defined $self->{str} ) {
881 3 50       22 if( $self->{track} ) {
882 0         0 $self->{m} = undef;
883 0         0 $self->{mcount} = 0;
884 0         0 $self->{mlist} = [];
885 0         0 $self->{str} = _re_path_track($self, $self->_path, '', '');
886             }
887             else {
888 3 50 33     85 $self->_reduce unless ($self->{mutable} or not $self->{reduce});
889 3         16 my $arg = {@_};
890             $arg->{indent} = $self->{indent}
891 3 50 33     40 if not exists $arg->{indent} and $self->{indent} > 0;
892 3 50 33     44 if( exists $arg->{indent} and $arg->{indent} > 0 ) {
    50          
893 0         0 $arg->{depth} = 0;
894 0         0 $self->{str} = _re_path_pretty($self, $self->_path, $arg);
895             }
896             elsif( $self->{lookahead} ) {
897 0         0 $self->{str} = _re_path_lookahead($self, $self->_path);
898             }
899             else {
900 3         15 $self->{str} = _re_path($self, $self->_path);
901             }
902             }
903 3 50       18 if (not length $self->{str}) {
904             # explicitly fail to match anything if no pattern was generated
905 0         0 $self->{str} = $Always_Fail;
906             }
907             else {
908             my $begin =
909             $self->{anchor_word_begin} ? '\\b'
910             : $self->{anchor_line_begin} ? '^'
911 3 50       35 : $self->{anchor_string_begin} ? '\A'
    50          
    50          
912             : ''
913             ;
914             my $end =
915             $self->{anchor_word_end} ? '\\b'
916             : $self->{anchor_line_end} ? '$'
917             : $self->{anchor_string_end} ? '\Z'
918 3 50       48 : $self->{anchor_string_end_absolute} ? '\z'
    50          
    50          
    50          
919             : ''
920             ;
921 3         15 $self->{str} = "$begin$self->{str}$end";
922             }
923 3 50       18 $self->{path} = [] unless $self->{mutable};
924             }
925 3         17 return $self->{str};
926             }
927              
928             =item re
929              
930             Assembles the pattern and return it as a compiled RE, using the
931             C operator.
932              
933             As with C, calling this method will reset the internal data
934             structures to free the memory used in assembling the RE.
935              
936             The B attribute, documented in the C method, can be
937             used here (it will be ignored if tracking is enabled).
938              
939             With method chaining, it is possible to produce a RE without having
940             a temporary C object lying around, I:
941              
942             my $re = Regexp::Assemble->new
943             ->add( q[ab+cd+e] )
944             ->add( q[ac\\d+e] )
945             ->add( q[c\\d+e] )
946             ->re;
947              
948             The C<$re> variable now contains a Regexp object that can be used
949             directly:
950              
951             while( <> ) {
952             /$re/ and print "Something in [$_] matched\n";
953             )
954              
955             The C method is called when the object is used in string context
956             (hence, within an C operator), so by and large you do not even
957             need to save the RE in a separate variable. The following will work
958             as expected:
959              
960             my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
961             while( ) {
962             if( /($re)/ ) {
963             print "Here be giants: $1\n";
964             }
965             }
966              
967             This approach does not work with tracked patterns. The
968             C and C methods must be used instead, see below.
969              
970             =cut
971              
972             sub re {
973 4     4 1 10 my $self = shift;
974 4 50       15 $self->_build_re($self->as_string(@_)) unless defined $self->{re};
975 4         92 return $self->{re};
976             }
977              
978             use overload '""' => sub {
979 8     8   35 my $self = shift;
980 8 100       97 return $self->{re} if $self->{re};
981 3         37 $self->_build_re($self->as_string());
982 3         21 return $self->{re};
983 12     12   142 };
  12         44  
  12         173  
984              
985             sub _build_re {
986 3     3   16 my $self = shift;
987 3         8 my $str = shift;
988 3 50       17 if( $self->{track} ) {
989 12     12   1348 use re 'eval';
  12         49  
  12         131556  
990             $self->{re} = length $self->{flags}
991 0 0       0 ? qr/(?$self->{flags}:$str)/
992             : qr/$str/
993             ;
994             }
995             else {
996             # how could I not repeat myself?
997             $self->{re} = length $self->{flags}
998 3 50       94 ? qr/(?$self->{flags}:$str)/
999             : qr/$str/
1000             ;
1001             }
1002             }
1003              
1004             =item match(SCALAR)
1005              
1006             The following information applies to Perl 5.8 and below. See
1007             the section that follows for information on Perl 5.10.
1008              
1009             If pattern tracking is in use, you must C in order
1010             to make things work correctly. At a minimum, this will make your
1011             code look like this:
1012              
1013             my $did_match = do { use re 'eval'; $target =~ /$ra/ }
1014             if( $did_match ) {
1015             print "matched ", $ra->matched, "\n";
1016             }
1017              
1018             (The main reason is that the C<$^R> variable is currently broken
1019             and an ugly workaround that runs some Perl code during the match
1020             is required, in order to simulate what C<$^R> should be doing. See
1021             Perl bug #32840 for more information if you are curious. The README
1022             also contains more information). This bug has been fixed in 5.10.
1023              
1024             The important thing to note is that with C, THERE
1025             ARE SECURITY IMPLICATIONS WHICH YOU IGNORE AT YOUR PERIL. The problem
1026             is this: if you do not have strict control over the patterns being
1027             fed to C when tracking is enabled, and someone
1028             slips you a pattern such as C and you
1029             attempt to match a string against the resulting pattern, you will
1030             know Fear and Loathing.
1031              
1032             What is more, the C<$^R> workaround means that that tracking does
1033             not work if you perform a bare C pattern match as shown
1034             above. You have to instead call the C method, in order to
1035             supply the necessary context to take care of the tracking housekeeping
1036             details.
1037              
1038             if( defined( my $match = $ra->match($_)) ) {
1039             print " $_ matched by $match\n";
1040             }
1041              
1042             In the case of a successful match, the original matched pattern
1043             is returned directly. The matched pattern will also be available
1044             through the C method.
1045              
1046             (Except that the above is not true for 5.6.0: the C method
1047             returns true or undef, and the C method always returns
1048             undef).
1049              
1050             If you are capturing parts of the pattern I C
1051             you will want to get at the captures. See the C, C,
1052             C and C methods. If you are not using captures
1053             then you may safely ignore this section.
1054              
1055             In 5.10, since the bug concerning C<$^R> has been resolved, there
1056             is no need to use C and the assembled pattern does
1057             not require any Perl code to be executed during the match.
1058              
1059             =cut
1060              
1061             sub match {
1062 0     0 1 0 my $self = shift;
1063 0         0 my $target = shift;
1064 0 0       0 $self->_build_re($self->as_string(@_)) unless defined $self->{re};
1065 0         0 $self->{m} = undef;
1066 0         0 $self->{mvar} = [];
1067 0 0       0 if( not $target =~ /$self->{re}/ ) {
1068 0         0 $self->{mbegin} = [];
1069 0         0 $self->{mend} = [];
1070 0         0 return undef;
1071             }
1072 0 0       0 $self->{m} = $^R if $] >= 5.009005;
1073 0         0 $self->{mbegin} = _path_copy([@-]);
1074 0         0 $self->{mend} = _path_copy([@+]);
1075 0         0 my $n = 0;
1076 0         0 for( my $n = 0; $n < @-; ++$n ) {
1077 0 0 0     0 push @{$self->{mvar}}, substr($target, $-[$n], $+[$n] - $-[$n])
  0         0  
1078             if defined $-[$n] and defined $+[$n];
1079             }
1080 0 0       0 if( $self->{track} ) {
1081 0 0       0 return defined $self->{m} ? $self->{mlist}[$self->{m}] : 1;
1082             }
1083             else {
1084 0         0 return 1;
1085             }
1086             }
1087              
1088             =item source
1089              
1090             When using tracked mode, after a successful match is made, returns
1091             the original source pattern that caused the match. In Perl 5.10,
1092             the C<$^R> variable can be used to as an index to fetch the correct
1093             pattern from the object.
1094              
1095             If no successful match has been performed, or the object is not in
1096             tracked mode, this method returns C.
1097              
1098             my $r = Regexp::Assemble->new->track(1)->add(qw(foo? bar{2} [Rr]at));
1099              
1100             for my $w (qw(this food is rather barren)) {
1101             if ($w =~ /$r/) {
1102             print "$w matched by ", $r->source($^R), $/;
1103             }
1104             else {
1105             print "$w no match\n";
1106             }
1107             }
1108              
1109             =cut
1110              
1111             sub source {
1112 0     0 1 0 my $self = shift;
1113 0 0       0 return unless $self->{track};
1114 0 0       0 defined($_[0]) and return $self->{mlist}[$_[0]];
1115 0 0       0 return unless defined $self->{m};
1116 0         0 return $self->{mlist}[$self->{m}];
1117             }
1118              
1119             =item mbegin
1120              
1121             This method returns a copy of C<@-> at the moment of the
1122             last match. You should ordinarily not need to bother with
1123             this, C should be able to supply all your needs.
1124              
1125             =cut
1126              
1127             sub mbegin {
1128 0     0 1 0 my $self = shift;
1129 0 0       0 return exists $self->{mbegin} ? $self->{mbegin} : [];
1130             }
1131              
1132             =item mend
1133              
1134             This method returns a copy of C<@+> at the moment of the
1135             last match.
1136              
1137             =cut
1138              
1139             sub mend {
1140 0     0 1 0 my $self = shift;
1141 0 0       0 return exists $self->{mend} ? $self->{mend} : [];
1142             }
1143              
1144             =item mvar(NUMBER)
1145              
1146             The C method returns the captures of the last match.
1147             C corresponds to $1, C to $2, and so on.
1148             C happens to return the target string matched,
1149             as a byproduct of walking down the C<@-> and C<@+> arrays
1150             after the match.
1151              
1152             If called without a parameter, C will return a
1153             reference to an array containing all captures.
1154              
1155             =cut
1156              
1157             sub mvar {
1158 0     0 1 0 my $self = shift;
1159 0 0       0 return undef unless exists $self->{mvar};
1160 0 0       0 return defined($_[0]) ? $self->{mvar}[$_[0]] : $self->{mvar};
1161             }
1162              
1163             =item capture
1164              
1165             The C method returns the the captures of the last
1166             match as an array. Unlink C, this method does not
1167             include the matched string. It is equivalent to getting an
1168             array back that contains C<$1, $2, $3, ...>.
1169              
1170             If no captures were found in the match, an empty array is
1171             returned, rather than C. You are therefore guaranteed
1172             to be able to use C<< for my $c ($re->capture) { ... >>
1173             without have to check whether anything was captured.
1174              
1175             =cut
1176              
1177             sub capture {
1178 0     0 1 0 my $self = shift;
1179 0 0       0 if( $self->{mvar} ) {
1180 0         0 my @capture = @{$self->{mvar}};
  0         0  
1181 0         0 shift @capture;
1182 0         0 return @capture;
1183             }
1184 0         0 return ();
1185             }
1186              
1187             =item matched
1188              
1189             If pattern tracking has been set, via the C attribute,
1190             or through the C method, this method will return the
1191             original pattern of the last successful match. Returns undef
1192             match has yet been performed, or tracking has not been enabled.
1193              
1194             See below in the NOTES section for additional subtleties of
1195             which you should be aware of when tracking patterns.
1196              
1197             Note that this method is not available in 5.6.0, due to
1198             limitations in the implementation of C<(?{...})> at the time.
1199              
1200             =cut
1201              
1202             sub matched {
1203 0     0 1 0 my $self = shift;
1204 0 0       0 return defined $self->{m} ? $self->{mlist}[$self->{m}] : undef;
1205             }
1206              
1207             =back
1208              
1209             =head2 Statistics/Reporting routines
1210              
1211             =over 8
1212              
1213             =item stats_add
1214              
1215             Returns the number of patterns added to the assembly (whether
1216             by C or C). Duplicate patterns are not included
1217             in this total.
1218              
1219             =cut
1220              
1221             sub stats_add {
1222 0     0 1 0 my $self = shift;
1223 0   0     0 return $self->{stats_add} || 0;
1224             }
1225              
1226             =item stats_dup
1227              
1228             Returns the number of duplicate patterns added to the assembly.
1229             If non-zero, this may be a sign that something is wrong with
1230             your data (or at the least, some needless redundancy). This may
1231             occur when you have two patterns (for instance, C and
1232             C) which map to the same result.
1233              
1234             =cut
1235              
1236             sub stats_dup {
1237 0     0 1 0 my $self = shift;
1238 0   0     0 return $self->{stats_dup} || 0;
1239             }
1240              
1241             =item stats_raw
1242              
1243             Returns the raw number of bytes in the patterns added to the
1244             assembly. This includes both original and duplicate patterns.
1245             For instance, adding the two patterns C and C will
1246             count as 4 bytes.
1247              
1248             =cut
1249              
1250             sub stats_raw {
1251 0     0 1 0 my $self = shift;
1252 0   0     0 return $self->{stats_raw} || 0;
1253             }
1254              
1255             =item stats_cooked
1256              
1257             Return the true number of bytes added to the assembly. This
1258             will not include duplicate patterns. Furthermore, it may differ
1259             from the raw bytes due to quotemeta treatment. For instance,
1260             C will count as 7 (not 8) bytes, because C<\,> will
1261             be stored as C<,>. Also, C<\Qa.b\E> is 7 bytes long, however,
1262             after the quotemeta directives are processed, C will be
1263             stored, for a total of 4 bytes.
1264              
1265             =cut
1266              
1267             sub stats_cooked {
1268 0     0 1 0 my $self = shift;
1269 0   0     0 return $self->{stats_cooked} || 0;
1270             }
1271              
1272             =item stats_length
1273              
1274             Returns the length of the resulting assembled expression.
1275             Until C or C have been called, the length
1276             will be 0 (since the assembly will have not yet been
1277             performed). The length includes only the pattern, not the
1278             additional (C<(?-xism...>) fluff added by the compilation.
1279              
1280             =cut
1281              
1282             sub stats_length {
1283 0     0 1 0 my $self = shift;
1284 0 0 0     0 return (defined $self->{str} and $self->{str} ne $Always_Fail) ? length $self->{str} : 0;
1285             }
1286              
1287             =item dup_warn(NUMBER|CODEREF)
1288              
1289             Turns warnings about duplicate patterns on or off. By
1290             default, no warnings are emitted. If the method is
1291             called with no parameters, or a true parameter,
1292             the object will carp about patterns it has
1293             already seen. To turn off the warnings, use 0 as a
1294             parameter.
1295              
1296             $r->dup_warn();
1297              
1298             The method may also be passed a code block. In this case
1299             the code will be executed and it will receive a reference
1300             to the object in question, and the lexed pattern.
1301              
1302             $r->dup_warn(
1303             sub {
1304             my $self = shift;
1305             print $self->stats_add, " patterns added at line $.\n",
1306             join( '', @_ ), " added previously\n";
1307             }
1308             )
1309              
1310             =cut
1311              
1312             sub dup_warn {
1313 0     0 1 0 my $self = shift;
1314 0 0       0 $self->{dup_warn} = defined($_[0]) ? $_[0] : 1;
1315 0         0 return $self;
1316             }
1317              
1318             =back
1319              
1320             =head2 Anchor routines
1321              
1322             Suppose you wish to assemble a series of patterns that all begin
1323             with C<^> and end with C<$> (anchor pattern to the beginning and
1324             end of line). Rather than add the anchors to each and every pattern
1325             (and possibly forget to do so when a new entry is added), you may
1326             specify the anchors in the object, and they will appear in the
1327             resulting pattern, and you no longer need to (or should) put them
1328             in your source patterns. For example, the two following snippets
1329             will produce identical patterns:
1330              
1331             $r->add(qw(^this ^that ^them))->as_string;
1332              
1333             $r->add(qw(this that them))->anchor_line_begin->as_string;
1334              
1335             # both techniques will produce ^th(?:at|em|is)
1336              
1337             All anchors are possible word (C<\b>) boundaries, line
1338             boundaries (C<^> and C<$>) and string boundaries (C<\A>
1339             and C<\Z> (or C<\z> if you absolutely need it)).
1340              
1341             The shortcut C> implies both
1342             C_begin> C_end>
1343             is also available. If different anchors are specified
1344             the most specific anchor wins. For instance, if both
1345             C and C are
1346             specified, C takes precedence.
1347              
1348             All the anchor methods are chainable.
1349              
1350             =over 8
1351              
1352             =item anchor_word_begin
1353              
1354             The resulting pattern will be prefixed with a C<\b>
1355             word boundary assertion when the value is true. Set
1356             to 0 to disable.
1357              
1358             $r->add('pre')->anchor_word_begin->as_string;
1359             # produces '\bpre'
1360              
1361             =cut
1362              
1363             sub anchor_word_begin {
1364 0     0 1 0 my $self = shift;
1365 0 0       0 $self->{anchor_word_begin} = defined($_[0]) ? $_[0] : 1;
1366 0         0 return $self;
1367             }
1368              
1369             =item anchor_word_end
1370              
1371             The resulting pattern will be suffixed with a C<\b>
1372             word boundary assertion when the value is true. Set
1373             to 0 to disable.
1374              
1375             $r->add(qw(ing tion))
1376             ->anchor_word_end
1377             ->as_string; # produces '(?:tion|ing)\b'
1378              
1379             =cut
1380              
1381             sub anchor_word_end {
1382 0     0 1 0 my $self = shift;
1383 0 0       0 $self->{anchor_word_end} = defined($_[0]) ? $_[0] : 1;
1384 0         0 return $self;
1385             }
1386              
1387             =item anchor_word
1388              
1389             The resulting pattern will be have C<\b>
1390             word boundary assertions at the beginning and end
1391             of the pattern when the value is true. Set
1392             to 0 to disable.
1393              
1394             $r->add(qw(cat carrot)
1395             ->anchor_word(1)
1396             ->as_string; # produces '\bca(?:rro)t\b'
1397              
1398             =cut
1399              
1400             sub anchor_word {
1401 0     0 1 0 my $self = shift;
1402 0         0 my $state = shift;
1403 0         0 $self->anchor_word_begin($state)->anchor_word_end($state);
1404 0         0 return $self;
1405             }
1406              
1407             =item anchor_line_begin
1408              
1409             The resulting pattern will be prefixed with a C<^>
1410             line boundary assertion when the value is true. Set
1411             to 0 to disable.
1412              
1413             $r->anchor_line_begin;
1414             # or
1415             $r->anchor_line_begin(1);
1416              
1417             =cut
1418              
1419             sub anchor_line_begin {
1420 0     0 1 0 my $self = shift;
1421 0 0       0 $self->{anchor_line_begin} = defined($_[0]) ? $_[0] : 1;
1422 0         0 return $self;
1423             }
1424              
1425             =item anchor_line_end
1426              
1427             The resulting pattern will be suffixed with a C<$>
1428             line boundary assertion when the value is true. Set
1429             to 0 to disable.
1430              
1431             # turn it off
1432             $r->anchor_line_end(0);
1433              
1434             =cut
1435              
1436             sub anchor_line_end {
1437 0     0 1 0 my $self = shift;
1438 0 0       0 $self->{anchor_line_end} = defined($_[0]) ? $_[0] : 1;
1439 0         0 return $self;
1440             }
1441              
1442             =item anchor_line
1443              
1444             The resulting pattern will be have the C<^> and C<$>
1445             line boundary assertions at the beginning and end
1446             of the pattern, respectively, when the value is true. Set
1447             to 0 to disable.
1448              
1449             $r->add(qw(cat carrot)
1450             ->anchor_line
1451             ->as_string; # produces '^ca(?:rro)t$'
1452              
1453             =cut
1454              
1455             sub anchor_line {
1456 0     0 1 0 my $self = shift;
1457 0         0 my $state = shift;
1458 0         0 $self->anchor_line_begin($state)->anchor_line_end($state);
1459 0         0 return $self;
1460             }
1461              
1462             =item anchor_string_begin
1463              
1464             The resulting pattern will be prefixed with a C<\A>
1465             string boundary assertion when the value is true. Set
1466             to 0 to disable.
1467              
1468             $r->anchor_string_begin(1);
1469              
1470             =cut
1471              
1472             sub anchor_string_begin {
1473 0     0 1 0 my $self = shift;
1474 0 0       0 $self->{anchor_string_begin} = defined($_[0]) ? $_[0] : 1;
1475 0         0 return $self;
1476             }
1477              
1478             =item anchor_string_end
1479              
1480             The resulting pattern will be suffixed with a C<\Z>
1481             string boundary assertion when the value is true. Set
1482             to 0 to disable.
1483              
1484             # disable the string boundary end anchor
1485             $r->anchor_string_end(0);
1486              
1487             =cut
1488              
1489             sub anchor_string_end {
1490 0     0 1 0 my $self = shift;
1491 0 0       0 $self->{anchor_string_end} = defined($_[0]) ? $_[0] : 1;
1492 0         0 return $self;
1493             }
1494              
1495             =item anchor_string_end_absolute
1496              
1497             The resulting pattern will be suffixed with a C<\z>
1498             string boundary assertion when the value is true. Set
1499             to 0 to disable.
1500              
1501             # disable the string boundary absolute end anchor
1502             $r->anchor_string_end_absolute(0);
1503              
1504             If you don't understand the difference between
1505             C<\Z> and C<\z>, the former will probably do what
1506             you want.
1507              
1508             =cut
1509              
1510             sub anchor_string_end_absolute {
1511 0     0 1 0 my $self = shift;
1512 0 0       0 $self->{anchor_string_end_absolute} = defined($_[0]) ? $_[0] : 1;
1513 0         0 return $self;
1514             }
1515              
1516             =item anchor_string
1517              
1518             The resulting pattern will be have the C<\A> and C<\Z>
1519             string boundary assertions at the beginning and end
1520             of the pattern, respectively, when the value is true. Set
1521             to 0 to disable.
1522              
1523             $r->add(qw(cat carrot)
1524             ->anchor_string
1525             ->as_string; # produces '\Aca(?:rro)t\Z'
1526              
1527             =cut
1528              
1529             sub anchor_string {
1530 0     0 1 0 my $self = shift;
1531 0 0       0 my $state = defined($_[0]) ? $_[0] : 1;
1532 0         0 $self->anchor_string_begin($state)->anchor_string_end($state);
1533 0         0 return $self;
1534             }
1535              
1536             =item anchor_string_absolute
1537              
1538             The resulting pattern will be have the C<\A> and C<\z>
1539             string boundary assertions at the beginning and end
1540             of the pattern, respectively, when the value is true. Set
1541             to 0 to disable.
1542              
1543             $r->add(qw(cat carrot)
1544             ->anchor_string_absolute
1545             ->as_string; # produces '\Aca(?:rro)t\z'
1546              
1547             =cut
1548              
1549             sub anchor_string_absolute {
1550 0     0 1 0 my $self = shift;
1551 0 0       0 my $state = defined($_[0]) ? $_[0] : 1;
1552 0         0 $self->anchor_string_begin($state)->anchor_string_end_absolute($state);
1553 0         0 return $self;
1554             }
1555              
1556             =back
1557              
1558             =over 8
1559              
1560             =item debug(NUMBER)
1561              
1562             Turns debugging on or off. Statements are printed
1563             to the currently selected file handle (STDOUT by default).
1564             If you are already using this handle, you will have to
1565             arrange to select an output handle to a file of your own
1566             choosing, before call the C, C or C)
1567             functions, otherwise it will scribble all over your
1568             carefully formatted output.
1569              
1570             =over 8
1571              
1572             =item C<0>
1573              
1574             Off. Turns off all debugging output.
1575              
1576             =item C<1>
1577              
1578             Add. Trace the addition of patterns.
1579              
1580             =item C<2>
1581              
1582             Reduce. Trace the process of reduction and assembly.
1583              
1584             =item C<4>
1585              
1586             Lex. Trace the lexing of the input patterns into its constituent
1587             tokens.
1588              
1589             =item C<8>
1590              
1591             Time. Print to STDOUT the time taken to load all the patterns. This is
1592             nothing more than the difference between the time the object was
1593             instantiated and the time reduction was initiated.
1594              
1595             # load=
1596              
1597             Any lengthy computation performed in the client code will be reflected
1598             in this value. Another line will be printed after reduction is
1599             complete.
1600              
1601             # reduce=
1602              
1603             The above output lines will be changed to C and
1604             C if the internal state of the object is corrupted
1605             and the initial timestamp is lost.
1606              
1607             The code attempts to load L in order to report fractional
1608             seconds. If this is not successful, the elapsed time is displayed
1609             in whole seconds.
1610              
1611             =back
1612              
1613             Values can be added (or or'ed together) to trace everything
1614              
1615             $r->debug(7)->add( '\\d+abc' );
1616              
1617             Calling C with no arguments turns debugging off.
1618              
1619             =cut
1620              
1621             sub debug {
1622 0     0 1 0 my $self = shift;
1623 0 0       0 $self->{debug} = defined($_[0]) ? $_[0] : 0;
1624 0 0       0 if ($self->_debug(DEBUG_TIME)) {
1625             # hmm, debugging time was switched on after instantiation
1626 0         0 $self->_init_time_func;
1627 0         0 $self->{_begin_time} = $self->{_time_func}->();
1628             }
1629 0         0 return $self;
1630             }
1631              
1632             =item dump
1633              
1634             Produces a synthetic view of the internal data structure. How
1635             to interpret the results is left as an exercise to the reader.
1636              
1637             print $r->dump;
1638              
1639             =cut
1640              
1641             sub dump {
1642 0     0 1 0 return _dump($_[0]->_path);
1643             }
1644              
1645             =item chomp(0|1)
1646              
1647             Turns chomping on or off.
1648              
1649             IMPORTANT: As of version 0.24, chomping is now on by default as it
1650             makes C Just Work. The only time you may run into trouble
1651             is with C. So don't do that, or else explicitly turn
1652             off chomping.
1653              
1654             To avoid incorporating (spurious)
1655             record separators (such as "\n" on Unix) when reading from a file,
1656             C Cs its input. If you don't want this to happen,
1657             call C with a false value.
1658              
1659             $re->chomp(0); # really want the record separators
1660             $re->add();
1661              
1662             =cut
1663              
1664             sub chomp {
1665 0     0 1 0 my $self = shift;
1666 0 0       0 $self->{chomp} = defined($_[0]) ? $_[0] : 1;
1667 0         0 return $self;
1668             }
1669              
1670             =item fold_meta_pairs(NUMBER)
1671              
1672             Determines whether C<\s>, C<\S> and C<\w>, C<\W> and C<\d>, C<\D>
1673             are folded into a C<.> (dot). Folding happens by default (for
1674             reasons of backwards compatibility, even though it is wrong when
1675             the C expression modifier is active).
1676              
1677             Call this method with a false value to prevent this behaviour (which
1678             is only a problem when dealing with C<\n> if the C expression
1679             modifier is also set).
1680              
1681             $re->add( '\\w', '\\W' );
1682             my $clone = $re->clone;
1683              
1684             $clone->fold_meta_pairs(0);
1685             print $clone->as_string; # prints '.'
1686             print $re->as_string; # print '[\W\w]'
1687              
1688             =cut
1689              
1690             sub fold_meta_pairs {
1691 0     0 1 0 my $self = shift;
1692 0 0       0 $self->{fold_meta_pairs} = defined($_[0]) ? $_[0] : 1;
1693 0         0 return $self;
1694             }
1695              
1696             =item indent(NUMBER)
1697              
1698             Sets the level of indent for pretty-printing nested groups
1699             within a pattern. See the C method for more details.
1700             When called without a parameter, no indenting is performed.
1701              
1702             $re->indent( 4 );
1703             print $re->as_string;
1704              
1705             =cut
1706              
1707             sub indent {
1708 0     0 1 0 my $self = shift;
1709 0 0       0 $self->{indent} = defined($_[0]) ? $_[0] : 0;
1710 0         0 return $self;
1711             }
1712              
1713             =item lookahead(0|1)
1714              
1715             Turns on zero-width lookahead assertions. This is usually
1716             beneficial when you expect that the pattern will usually fail.
1717             If you expect that the pattern will usually match you will
1718             probably be worse off.
1719              
1720             =cut
1721              
1722             sub lookahead {
1723 0     0 1 0 my $self = shift;
1724 0 0       0 $self->{lookahead} = defined($_[0]) ? $_[0] : 1;
1725 0         0 return $self;
1726             }
1727              
1728             =item flags(STRING)
1729              
1730             Sets the flags that govern how the pattern behaves (for
1731             versions of Perl up to 5.9 or so, these are C). By
1732             default no flags are enabled.
1733              
1734              
1735             =item modifiers(STRING)
1736              
1737             An alias of the C method, for users familiar with
1738             C.
1739              
1740             =cut
1741              
1742             sub flags {
1743 0     0 1 0 my $self = shift;
1744 0 0       0 $self->{flags} = defined($_[0]) ? $_[0] : '';
1745 0         0 return $self;
1746             }
1747              
1748             sub modifiers {
1749 0     0 1 0 my $self = shift;
1750 0         0 return $self->flags(@_);
1751             }
1752              
1753             =item track(0|1)
1754              
1755             Turns tracking on or off. When this attribute is enabled,
1756             additional housekeeping information is inserted into the
1757             assembled expression using C<({...}> embedded code
1758             constructs. This provides the necessary information to
1759             determine which, of the original patterns added, was the
1760             one that caused the match.
1761              
1762             $re->track( 1 );
1763             if( $target =~ /$re/ ) {
1764             print "$target matched by ", $re->matched, "\n";
1765             }
1766              
1767             Note that when this functionality is enabled, no
1768             reduction is performed and no character classes are
1769             generated. In other words, C is not
1770             reduced down to C<(?:br|t)ag> and C is not
1771             reduced to C.
1772              
1773             =cut
1774              
1775             sub track {
1776 0     0 1 0 my $self = shift;
1777 0 0       0 $self->{track} = defined($_[0]) ? $_[0] : 1;
1778 0         0 return $self;
1779             }
1780              
1781             =item unroll_plus(0|1)
1782              
1783             Turns the unrolling of plus metacharacters on or off. When
1784             a pattern is broken up, C becomes C, C (and
1785             C becomes C, C. This may allow the freed C
1786             to assemble with other patterns. Not enabled by default.
1787              
1788             =cut
1789              
1790             sub unroll_plus {
1791 0     0 1 0 my $self = shift;
1792 0 0       0 $self->{unroll_plus} = defined($_[0]) ? $_[0] : 1;
1793 0         0 return $self;
1794             }
1795              
1796             =item lex(SCALAR)
1797              
1798             Change the pattern used to break a string apart into tokens.
1799             You can examine the C script as a starting point.
1800              
1801             =cut
1802              
1803             sub lex {
1804 0     0 1 0 my $self = shift;
1805 0         0 $self->{lex} = qr($_[0]);
1806 0         0 return $self;
1807             }
1808              
1809             =item reduce(0|1)
1810              
1811             Turns pattern reduction on or off. A reduced pattern may
1812             be considerably shorter than an unreduced pattern. Consider
1813             C I C. An unreduced
1814             pattern will be very similar to those produced by
1815             C. Reduction is on by default. Turning
1816             it off speeds assembly (but assembly is pretty fast -- it's
1817             the breaking up of the initial patterns in the lexing stage
1818             that can consume a non-negligible amount of time).
1819              
1820             =cut
1821              
1822             sub reduce {
1823 0     0 1 0 my $self = shift;
1824 0 0       0 $self->{reduce} = defined($_[0]) ? $_[0] : 1;
1825 0         0 return $self;
1826             }
1827              
1828             =item mutable(0|1)
1829              
1830             This method has been marked as DEPRECATED. It will be removed
1831             in a future release. See the C method for a technique
1832             to replace its functionality.
1833              
1834             =cut
1835              
1836             sub mutable {
1837 0     0 1 0 my $self = shift;
1838 0 0       0 $self->{mutable} = defined($_[0]) ? $_[0] : 1;
1839 0         0 return $self;
1840             }
1841              
1842             =item reset
1843              
1844             Empties out the patterns that have been Ced or C-ed
1845             into the object. Does not modify the state of controller attributes
1846             such as C, C, C and the like.
1847              
1848             =cut
1849              
1850             sub reset {
1851             # reinitialise the internal state of the object
1852 0     0 1 0 my $self = shift;
1853 0         0 $self->{path} = [];
1854 0         0 $self->{re} = undef;
1855 0         0 $self->{str} = undef;
1856 0         0 return $self;
1857             }
1858              
1859             =item Default_Lexer
1860              
1861             B the C function is a class method, not
1862             an object method. It is a fatal error to call it as an object
1863             method.
1864              
1865             The C method lets you replace the default pattern
1866             used for all subsequently created C objects. It
1867             will not have any effect on existing objects. (It is also possible
1868             to override the lexer pattern used on a per-object basis).
1869              
1870             The parameter should be an ordinary scalar, not a compiled
1871             pattern. If the pattern fails to match all parts of the string,
1872             the missing parts will be returned as single chunks. Therefore
1873             the following pattern is legal (albeit rather cork-brained):
1874              
1875             Regexp::Assemble::Default_Lexer( '\\d' );
1876              
1877             The above pattern will split up input strings digit by digit, and
1878             all non-digit characters as single chunks.
1879              
1880             =cut
1881              
1882             sub Default_Lexer {
1883 0 0   0 1 0 if( $_[0] ) {
1884 0 0       0 if( my $refname = ref($_[0]) ) {
1885 0         0 require Carp;
1886 0         0 Carp::croak("Cannot pass a $refname to Default_Lexer");
1887             }
1888 0         0 $Current_Lexer = $_[0];
1889             }
1890 0 0       0 return defined $Current_Lexer ? $Current_Lexer : $Default_Lexer;
1891             }
1892              
1893             # --- no user serviceable parts below ---
1894              
1895             # -- debug helpers
1896              
1897             sub _debug {
1898 17     17   44 my $self = shift;
1899 17 50       221 return $self->{debug} & shift() ? 1 : 0;
1900             }
1901              
1902             # -- helpers
1903              
1904             sub _path {
1905             # access the path
1906 10     10   117 return $_[0]->{path};
1907             }
1908              
1909             # -- the heart of the matter
1910              
1911             $have_Storable = do {
1912             eval {
1913             require Storable;
1914             import Storable 'dclone';
1915             };
1916             $@ ? 0 : 1;
1917             };
1918              
1919             sub _path_clone {
1920 0 0   0   0 $have_Storable ? dclone($_[0]) : _path_copy($_[0]);
1921             }
1922              
1923             sub _path_copy {
1924 0     0   0 my $path = shift;
1925 0         0 my $new = [];
1926 0         0 for( my $p = 0; $p < @$path; ++$p ) {
1927 0 0       0 if( ref($path->[$p]) eq 'HASH' ) {
    0          
1928 0         0 push @$new, _node_copy($path->[$p]);
1929             }
1930             elsif( ref($path->[$p]) eq 'ARRAY' ) {
1931 0         0 push @$new, _path_copy($path->[$p]);
1932             }
1933             else {
1934 0         0 push @$new, $path->[$p];
1935             }
1936             }
1937 0         0 return $new;
1938             }
1939              
1940             sub _node_copy {
1941 0     0   0 my $node = shift;
1942 0         0 my $new = {};
1943 0         0 while( my( $k, $v ) = each %$node ) {
1944 0 0       0 $new->{$k} = defined($v)
1945             ? _path_copy($v)
1946             : undef
1947             ;
1948             }
1949 0         0 return $new;
1950             }
1951              
1952             sub _insert_path {
1953 4     4   11 my $self = shift;
1954 4         8 my $list = shift;
1955 4         37 my $debug = shift;
1956 4         10 my @in = @{shift()}; # create a new copy
  4         17  
1957 4 50       16 if( @$list == 0 ) { # special case the first time
1958 4 50 0     38 if( @in == 0 or (@in == 1 and (not defined $in[0] or $in[0] eq ''))) {
      33        
      33        
1959 0         0 return [{'' => undef}];
1960             }
1961             else {
1962 4         30 return \@in;
1963             }
1964             }
1965 0 0       0 $debug and print "# _insert_path @{[_dump(\@in)]} into @{[_dump($list)]}\n";
  0         0  
  0         0  
1966 0         0 my $path = $list;
1967 0         0 my $offset = 0;
1968 0         0 my $token;
1969 0 0       0 if( not @in ) {
1970 0 0       0 if( ref($list->[0]) ne 'HASH' ) {
1971 0         0 return [ { '' => undef, $list->[0] => $list } ];
1972             }
1973             else {
1974 0         0 $list->[0]{''} = undef;
1975 0         0 return $list;
1976             }
1977             }
1978 0         0 while( defined( $token = shift @in )) {
1979 0 0       0 if( ref($token) eq 'HASH' ) {
1980 0 0       0 $debug and print "# p0=", _dump($path), "\n";
1981 0         0 $path = $self->_insert_node( $path, $offset, $token, $debug, @in );
1982 0 0       0 $debug and print "# p1=", _dump($path), "\n";
1983 0         0 last;
1984             }
1985 0 0       0 if( ref($path->[$offset]) eq 'HASH' ) {
1986 0 0       0 $debug and print "# at (off=$offset len=@{[scalar @$path]}) ", _dump($path->[$offset]), "\n";
  0         0  
1987 0         0 my $node = $path->[$offset];
1988 0 0       0 if( exists( $node->{$token} )) {
1989 0 0       0 if ($offset < $#$path) {
1990             my $new = {
1991             $token => [$token, @in],
1992 0         0 _re_path($self, [$node]) => [@{$path}[$offset..$#$path]],
  0         0  
1993             };
1994 0         0 splice @$path, $offset, @$path-$offset, $new;
1995 0         0 last;
1996             }
1997             else {
1998 0 0       0 $debug and print "# descend key=$token @{[_dump($node->{$token})]}\n";
  0         0  
1999 0         0 $path = $node->{$token};
2000 0         0 $offset = 0;
2001 0         0 redo;
2002             }
2003             }
2004             else {
2005 0 0       0 $debug and print "# add path ($token:@{[_dump(\@in)]}) into @{[_dump($path)]} at off=$offset to end=@{[scalar $#$path]}\n";
  0         0  
  0         0  
  0         0  
2006 0 0       0 if( $offset == $#$path ) {
2007 0         0 $node->{$token} = [ $token, @in ];
2008             }
2009             else {
2010             my $new = {
2011             _node_key($token) => [ $token, @in ],
2012 0         0 _node_key($node) => [@{$path}[$offset..$#{$path}]],
  0         0  
  0         0  
2013             };
2014 0         0 splice( @$path, $offset, @$path - $offset, $new );
2015 0 0       0 $debug and print "# fused node=@{[_dump($new)]} path=@{[_dump($path)]}\n";
  0         0  
  0         0  
2016             }
2017 0         0 last;
2018             }
2019             }
2020              
2021 0 0       0 if( $debug ) {
2022 0         0 my $msg = '';
2023 0         0 my $n;
2024 0         0 for( $n = 0; $n < @$path; ++$n ) {
2025 0 0       0 $msg .= ' ' if $n;
2026             my $atom = ref($path->[$n]) eq 'HASH'
2027 0 0       0 ? '{'.join( ' ', keys(%{$path->[$n]})).'}'
  0         0  
2028             : $path->[$n]
2029             ;
2030 0 0       0 $msg .= $n == $offset ? "<$atom>" : $atom;
2031             }
2032 0         0 print "# at path ($msg)\n";
2033             }
2034              
2035 0 0       0 if( $offset >= @$path ) {
    0          
    0          
2036 0         0 push @$path, { $token => [ $token, @in ], '' => undef };
2037 0 0       0 $debug and print "# added remaining @{[_dump($path)]}\n";
  0         0  
2038 0         0 last;
2039             }
2040             elsif( $token ne $path->[$offset] ) {
2041 0 0       0 $debug and print "# token $token not present\n";
2042             splice @$path, $offset, @$path-$offset, {
2043             length $token
2044             ? ( _node_key($token) => [$token, @in])
2045             : ( '' => undef )
2046             ,
2047 0 0       0 $path->[$offset] => [@{$path}[$offset..$#{$path}]],
  0         0  
  0         0  
2048             };
2049 0 0       0 $debug and print "# path=@{[_dump($path)]}\n";
  0         0  
2050 0         0 last;
2051             }
2052             elsif( not @in ) {
2053 0 0       0 $debug and print "# last token to add\n";
2054 0 0       0 if( defined( $path->[$offset+1] )) {
2055 0         0 ++$offset;
2056 0 0       0 if( ref($path->[$offset]) eq 'HASH' ) {
2057 0 0       0 $debug and print "# add sentinel to node\n";
2058 0         0 $path->[$offset]{''} = undef;
2059             }
2060             else {
2061 0 0       0 $debug and print "# convert <$path->[$offset]> to node for sentinel\n";
2062             splice @$path, $offset, @$path-$offset, {
2063             '' => undef,
2064 0         0 $path->[$offset] => [ @{$path}[$offset..$#{$path}] ],
  0         0  
  0         0  
2065             };
2066             }
2067             }
2068             else {
2069             # already seen this pattern
2070 0         0 ++$self->{stats_dup};
2071             }
2072 0         0 last;
2073             }
2074             # if we get here then @_ still contains a token
2075 0         0 ++$offset;
2076             }
2077 0         0 $list;
2078             }
2079              
2080             sub _insert_node {
2081 0     0   0 my $self = shift;
2082 0         0 my $path = shift;
2083 0         0 my $offset = shift;
2084 0         0 my $token = shift;
2085 0         0 my $debug = shift;
2086 0         0 my $path_end = [@{$path}[$offset..$#{$path}]];
  0         0  
  0         0  
2087             # NB: $path->[$offset] and $[path_end->[0] are equivalent
2088 0         0 my $token_key = _re_path($self, [$token]);
2089 0 0       0 $debug and print "# insert node(@{[_dump($token)]}:@{[_dump(\@_)]}) (key=$token_key)",
  0         0  
  0         0  
2090 0         0 " at path=@{[_dump($path_end)]}\n";
2091 0 0       0 if( ref($path_end->[0]) eq 'HASH' ) {
2092 0 0       0 if( exists($path_end->[0]{$token_key}) ) {
    0          
2093 0 0       0 if( @$path_end > 1 ) {
2094 0         0 my $path_key = _re_path($self, [$path_end->[0]]);
2095 0         0 my $new = {
2096             $path_key => [ @$path_end ],
2097             $token_key => [ $token, @_ ],
2098             };
2099 0 0       0 $debug and print "# +bifurcate new=@{[_dump($new)]}\n";
  0         0  
2100 0         0 splice( @$path, $offset, @$path_end, $new );
2101             }
2102             else {
2103 0         0 my $old_path = $path_end->[0]{$token_key};
2104 0         0 my $new_path = [];
2105 0   0     0 while( @$old_path and _node_eq( $old_path->[0], $token )) {
2106 0 0       0 $debug and print "# identical nodes in sub_path ",
    0          
2107             ref($token) ? _dump($token) : $token, "\n";
2108 0         0 push @$new_path, shift(@$old_path);
2109 0         0 $token = shift @_;
2110             }
2111 0 0       0 if( @$new_path ) {
2112 0         0 my $new;
2113 0         0 my $token_key = $token;
2114 0 0       0 if( @_ ) {
2115 0         0 $new = {
2116             _re_path($self, $old_path) => $old_path,
2117             $token_key => [$token, @_],
2118             };
2119 0 0       0 $debug and print "# insert_node(bifurc) n=@{[_dump([$new])]}\n";
  0         0  
2120             }
2121             else {
2122 0 0       0 $debug and print "# insert $token into old path @{[_dump($old_path)]}\n";
  0         0  
2123 0 0       0 if( @$old_path ) {
2124 0         0 $new = ($self->_insert_path( $old_path, $debug, [$token] ))->[0];
2125             }
2126             else {
2127 0         0 $new = { '' => undef, $token => [$token] };
2128             }
2129             }
2130 0         0 push @$new_path, $new;
2131             }
2132 0         0 $path_end->[0]{$token_key} = $new_path;
2133 0 0       0 $debug and print "# +_insert_node result=@{[_dump($path_end)]}\n";
  0         0  
2134 0         0 splice( @$path, $offset, @$path_end, @$path_end );
2135             }
2136             }
2137             elsif( not _node_eq( $path_end->[0], $token )) {
2138 0 0       0 if( @$path_end > 1 ) {
2139 0         0 my $path_key = _re_path($self, [$path_end->[0]]);
2140 0         0 my $new = {
2141             $path_key => [ @$path_end ],
2142             $token_key => [ $token, @_ ],
2143             };
2144 0 0       0 $debug and print "# path->node1 at $path_key/$token_key @{[_dump($new)]}\n";
  0         0  
2145 0         0 splice( @$path, $offset, @$path_end, $new );
2146             }
2147             else {
2148 0 0       0 $debug and print "# next in path is node, trivial insert at $token_key\n";
2149 0         0 $path_end->[0]{$token_key} = [$token, @_];
2150 0         0 splice( @$path, $offset, @$path_end, @$path_end );
2151             }
2152             }
2153             else {
2154 0   0     0 while( @$path_end and _node_eq( $path_end->[0], $token )) {
2155 0 0       0 $debug and print "# identical nodes @{[_dump([$token])]}\n";
  0         0  
2156 0         0 shift @$path_end;
2157 0         0 $token = shift @_;
2158 0         0 ++$offset;
2159             }
2160 0 0       0 if( @$path_end ) {
2161 0 0       0 $debug and print "# insert at $offset $token:@{[_dump(\@_)]} into @{[_dump($path_end)]}\n";
  0         0  
  0         0  
2162 0         0 $path_end = $self->_insert_path( $path_end, $debug, [$token, @_] );
2163 0 0       0 $debug and print "# got off=$offset s=@{[scalar @_]} path_add=@{[_dump($path_end)]}\n";
  0         0  
  0         0  
2164 0         0 splice( @$path, $offset, @$path - $offset, @$path_end );
2165 0 0       0 $debug and print "# got final=@{[_dump($path)]}\n";
  0         0  
2166             }
2167             else {
2168 0         0 $token_key = _node_key($token);
2169 0         0 my $new = {
2170             '' => undef,
2171             $token_key => [ $token, @_ ],
2172             };
2173 0 0       0 $debug and print "# convert opt @{[_dump($new)]}\n";
  0         0  
2174 0         0 push @$path, $new;
2175             }
2176             }
2177             }
2178             else {
2179 0 0       0 if( @$path_end ) {
2180 0         0 my $new = {
2181             $path_end->[0] => [ @$path_end ],
2182             $token_key => [ $token, @_ ],
2183             };
2184 0 0       0 $debug and print "# atom->node @{[_dump($new)]}\n";
  0         0  
2185 0         0 splice( @$path, $offset, @$path_end, $new );
2186 0 0       0 $debug and print "# out=@{[_dump($path)]}\n";
  0         0  
2187             }
2188             else {
2189 0 0       0 $debug and print "# add opt @{[_dump([$token,@_])]} via $token_key\n";
  0         0  
2190 0         0 push @$path, {
2191             '' => undef,
2192             $token_key => [ $token, @_ ],
2193             };
2194             }
2195             }
2196 0         0 $path;
2197             }
2198              
2199             sub _reduce {
2200 3     3   21 my $self = shift;
2201 3         51 my $context = { debug => $self->_debug(DEBUG_TAIL), depth => 0 };
2202              
2203 3 50       24 if ($self->_debug(DEBUG_TIME)) {
2204 0         0 $self->_init_time_func;
2205 0         0 my $now = $self->{_time_func}->();
2206 0 0       0 if (exists $self->{_begin_time}) {
2207 0         0 printf "# load=%0.6f\n", $now - $self->{_begin_time};
2208             }
2209             else {
2210 0         0 printf "# load-epoch=%0.6f\n", $now;
2211             }
2212 0         0 $self->{_begin_time} = $self->{_time_func}->();
2213             }
2214              
2215 3         31 my ($head, $tail) = _reduce_path( $self->_path, $context );
2216 3 50       22 $context->{debug} and print "# final head=", _dump($head), ' tail=', _dump($tail), "\n";
2217 3 50       14 if( !@$head ) {
2218 3         16 $self->{path} = $tail;
2219             }
2220             else {
2221             $self->{path} = [
2222 0         0 @{_unrev_path( $tail, $context )},
2223 0         0 @{_unrev_path( $head, $context )},
  0         0  
2224             ];
2225             }
2226              
2227 3 50       17 if ($self->_debug(DEBUG_TIME)) {
2228 0         0 my $now = $self->{_time_func}->();
2229 0 0       0 if (exists $self->{_begin_time}) {
2230 0         0 printf "# reduce=%0.6f\n", $now - $self->{_begin_time};
2231             }
2232             else {
2233 0         0 printf "# reduce-epoch=%0.6f\n", $now;
2234             }
2235 0         0 $self->{_begin_time} = $self->{_time_func}->();
2236             }
2237              
2238 3 50       23 $context->{debug} and print "# final path=", _dump($self->{path}), "\n";
2239 3         9 return $self;
2240             }
2241              
2242             sub _remove_optional {
2243 0 0   0   0 if( exists $_[0]->{''} ) {
2244 0         0 delete $_[0]->{''};
2245 0         0 return 1;
2246             }
2247 0         0 return 0;
2248             }
2249              
2250             sub _reduce_path {
2251 3     3   11 my ($path, $ctx) = @_;
2252 3         25 my $indent = ' ' x $ctx->{depth};
2253 3         10 my $debug = $ctx->{debug};
2254 3 50       20 $debug and print "#$indent _reduce_path $ctx->{depth} ", _dump($path), "\n";
2255 3         16 my $new;
2256 3         9 my $head = [];
2257 3         13 my $tail = [];
2258 3         33 while( defined( my $p = pop @$path )) {
2259 22 50       53 if( ref($p) eq 'HASH' ) {
2260 0         0 my ($node_head, $node_tail) = _reduce_node($p, _descend($ctx) );
2261 0 0       0 $debug and print "#$indent| head=", _dump($node_head), " tail=", _dump($node_tail), "\n";
2262 0 0       0 push @$head, @$node_head if scalar @$node_head;
2263 0 0       0 push @$tail, ref($node_tail) eq 'HASH' ? $node_tail : @$node_tail;
2264             }
2265             else {
2266 22 50       50 if( @$head ) {
2267 0 0       0 $debug and print "#$indent| push $p leaves @{[_dump($path)]}\n";
  0         0  
2268 0         0 push @$tail, $p;
2269             }
2270             else {
2271 22 50       59 $debug and print "#$indent| unshift $p\n";
2272 22         110 unshift @$tail, $p;
2273             }
2274             }
2275             }
2276 0         0 $debug and print "#$indent| tail nr=@{[scalar @$tail]} t0=", ref($tail->[0]),
2277 3 0       13 (ref($tail->[0]) eq 'HASH' ? " n=" . scalar(keys %{$tail->[0]}) : '' ),
  0 50       0  
2278             "\n";
2279 3 50 33     47 if( @$tail > 1
      33        
2280             and ref($tail->[0]) eq 'HASH'
2281 0         0 and keys %{$tail->[0]} == 2
2282             ) {
2283 0         0 my $opt;
2284             my $fixed;
2285 0         0 while( my ($key, $path) = each %{$tail->[0]} ) {
  0         0  
2286 0 0       0 $debug and print "#$indent| scan k=$key p=@{[_dump($path)]}\n";
  0         0  
2287 0 0       0 next unless $path;
2288 0 0 0     0 if (@$path == 1 and ref($path->[0]) eq 'HASH') {
2289 0         0 $opt = $path->[0];
2290             }
2291             else {
2292 0         0 $fixed = $path;
2293             }
2294             }
2295 0 0       0 if( exists $tail->[0]{''} ) {
2296 0         0 my $path = [@{$tail}[1..$#{$tail}]];
  0         0  
  0         0  
2297 0         0 $tail = $tail->[0];
2298 0         0 ($head, $tail, $path) = _slide_tail( $head, $tail, $path, _descend($ctx) );
2299 0         0 $tail = [$tail, @$path];
2300             }
2301             }
2302 3 50       22 $debug and print "#$indent _reduce_path $ctx->{depth} out head=", _dump($head), ' tail=', _dump($tail), "\n";
2303 3         17 return ($head, $tail);
2304             }
2305              
2306             sub _reduce_node {
2307 0     0   0 my ($node, $ctx) = @_;
2308 0         0 my $indent = ' ' x $ctx->{depth};
2309 0         0 my $debug = $ctx->{debug};
2310 0         0 my $optional = _remove_optional($node);
2311 0 0       0 $debug and print "#$indent _reduce_node $ctx->{depth} in @{[_dump($node)]} opt=$optional\n";
  0         0  
2312 0 0 0     0 if( $optional and scalar keys %$node == 1 ) {
2313 0         0 my $path = (values %$node)[0];
2314 0 0       0 if( not grep { ref($_) eq 'HASH' } @$path ) {
  0         0  
2315             # if we have removed an optional, and there is only one path
2316             # left then there is nothing left to compare. Because of the
2317             # optional it cannot participate in any further reductions.
2318             # (unless we test for equality among sub-trees).
2319 0         0 my $result = {
2320             '' => undef,
2321             $path->[0] => $path
2322             };
2323 0 0       0 $debug and print "#$indent| fast fail @{[_dump($result)]}\n";
  0         0  
2324 0         0 return [], $result;
2325             }
2326             }
2327              
2328 0         0 my( $fail, $reduce ) = _scan_node( $node, _descend($ctx) );
2329              
2330 0 0       0 $debug and print "#$indent|_scan_node done opt=$optional reduce=@{[_dump($reduce)]} fail=@{[_dump($fail)]}\n";
  0         0  
  0         0  
2331              
2332             # We now perform tail reduction on each of the nodes in the reduce
2333             # hash. If we have only one key, we know we will have a successful
2334             # reduction (since everything that was inserted into the node based
2335             # on the value of the last token of each path all mapped to the same
2336             # value).
2337              
2338 0 0 0     0 if( @$fail == 0 and keys %$reduce == 1 and not $optional) {
      0        
2339             # every path shares a common path
2340 0         0 my $path = (values %$reduce)[0];
2341 0         0 my ($common, $tail) = _do_reduce( $path, _descend($ctx) );
2342 0 0       0 $debug and print "#$indent|_reduce_node $ctx->{depth} common=@{[_dump($common)]} tail=", _dump($tail), "\n";
  0         0  
2343 0         0 return( $common, $tail );
2344             }
2345              
2346             # this node resulted in a list of paths, game over
2347 0         0 $ctx->{indent} = $indent;
2348 0         0 return _reduce_fail( $reduce, $fail, $optional, _descend($ctx) );
2349             }
2350              
2351             sub _reduce_fail {
2352 0     0   0 my( $reduce, $fail, $optional, $ctx ) = @_;
2353 0         0 my( $debug, $depth, $indent ) = @{$ctx}{qw(debug depth indent)};
  0         0  
2354 0         0 my %result;
2355 0 0       0 $result{''} = undef if $optional;
2356 0         0 my $p;
2357 0         0 for $p (keys %$reduce) {
2358 0         0 my $path = $reduce->{$p};
2359 0 0       0 if( scalar @$path == 1 ) {
2360 0         0 $path = $path->[0];
2361 0 0       0 $debug and print "#$indent| -simple opt=$optional unrev @{[_dump($path)]}\n";
  0         0  
2362 0         0 $path = _unrev_path($path, _descend($ctx) );
2363 0         0 $result{_node_key($path->[0])} = $path;
2364             }
2365             else {
2366 0 0       0 $debug and print "#$indent| _do_reduce(@{[_dump($path)]})\n";
  0         0  
2367 0         0 my ($common, $tail) = _do_reduce( $path, _descend($ctx) );
2368             $path = [
2369             (
2370             ref($tail) eq 'HASH'
2371             ? _unrev_node($tail, _descend($ctx) )
2372             : _unrev_path($tail, _descend($ctx) )
2373             ),
2374 0 0       0 @{_unrev_path($common, _descend($ctx) )}
  0         0  
2375             ];
2376 0 0       0 $debug and print "#$indent| +reduced @{[_dump($path)]}\n";
  0         0  
2377 0         0 $result{_node_key($path->[0])} = $path;
2378             }
2379             }
2380 0         0 my $f;
2381 0         0 for $f( @$fail ) {
2382 0 0       0 $debug and print "#$indent| +fail @{[_dump($f)]}\n";
  0         0  
2383 0         0 $result{$f->[0]} = $f;
2384             }
2385 0 0       0 $debug and print "#$indent _reduce_fail $depth fail=@{[_dump(\%result)]}\n";
  0         0  
2386 0         0 return ( [], \%result );
2387             }
2388              
2389             sub _scan_node {
2390 0     0   0 my( $node, $ctx ) = @_;
2391 0         0 my $indent = ' ' x $ctx->{depth};
2392 0         0 my $debug = $ctx->{debug};
2393              
2394             # For all the paths in the node, reverse them. If the first token
2395             # of the path is a scalar, push it onto an array in a hash keyed by
2396             # the value of the scalar.
2397             #
2398             # If it is a node, call _reduce_node on this node beforehand. If we
2399             # get back a common head, all of the paths in the subnode shared a
2400             # common tail. We then store the common part and the remaining node
2401             # of paths (which is where the paths diverged from the end and install
2402             # this into the same hash. At this point both the common and the tail
2403             # are in reverse order, just as simple scalar paths are.
2404             #
2405             # On the other hand, if there were no common path returned then all
2406             # the paths of the sub-node diverge at the end character. In this
2407             # case the tail cannot participate in any further reductions and will
2408             # appear in forward order.
2409             #
2410             # certainly the hurgliest function in the whole file :(
2411              
2412             # $debug = 1 if $depth >= 8;
2413 0         0 my @fail;
2414             my %reduce;
2415              
2416 0         0 my $n;
2417 0         0 for $n(
2418 0         0 map { substr($_, index($_, '#')+1) }
2419             sort
2420             map {
2421             join( '|' =>
2422 0         0 scalar(grep {ref($_) eq 'HASH'} @{$node->{$_}}),
  0         0  
2423             _node_offset($node->{$_}),
2424 0         0 scalar @{$node->{$_}},
  0         0  
2425             )
2426             . "#$_"
2427             }
2428             keys %$node ) {
2429 0         0 my( $end, @path ) = reverse @{$node->{$n}};
  0         0  
2430 0 0       0 if( ref($end) ne 'HASH' ) {
2431 0 0       0 $debug and print "# $indent|_scan_node push reduce ($end:@{[_dump(\@path)]})\n";
  0         0  
2432 0         0 push @{$reduce{$end}}, [ $end, @path ];
  0         0  
2433             }
2434             else {
2435 0 0       0 $debug and print "# $indent|_scan_node head=", _dump(\@path), ' tail=', _dump($end), "\n";
2436 0         0 my $new_path;
2437             # deal with sing, singing => s(?:ing)?ing
2438 0 0 0     0 if( keys %$end == 2 and exists $end->{''} ) {
2439 0         0 my ($key, $opt_path) = each %$end;
2440 0 0       0 ($key, $opt_path) = each %$end if $key eq '';
2441 0         0 $opt_path = [reverse @{$opt_path}];
  0         0  
2442 0 0       0 $debug and print "# $indent| check=", _dump($opt_path), "\n";
2443 0         0 my $end = { '' => undef, $opt_path->[0] => [@$opt_path] };
2444 0         0 my $head = [];
2445 0         0 my $path = [@path];
2446 0         0 ($head, my $slide, $path) = _slide_tail( $head, $end, $path, $ctx );
2447 0 0       0 if( @$head ) {
2448 0         0 $new_path = [ @$head, $slide, @$path ];
2449             }
2450             }
2451 0 0       0 if( $new_path ) {
2452 0 0       0 $debug and print "# $indent|_scan_node slid=", _dump($new_path), "\n";
2453 0         0 push @{$reduce{$new_path->[0]}}, $new_path;
  0         0  
2454             }
2455             else {
2456 0         0 my( $common, $tail ) = _reduce_node( $end, _descend($ctx) );
2457 0 0       0 if( not @$common ) {
2458 0 0       0 $debug and print "# $indent| +failed $n\n";
2459 0         0 push @fail, [reverse(@path), $tail];
2460             }
2461             else {
2462 0         0 my $path = [@path];
2463 0 0       0 $debug and print "# $indent|_scan_node ++recovered common=@{[_dump($common)]} tail=",
  0         0  
2464 0         0 _dump($tail), " path=@{[_dump($path)]}\n";
2465 0 0 0     0 if( ref($tail) eq 'HASH'
2466             and keys %$tail == 2
2467             ) {
2468 0 0       0 if( exists $tail->{''} ) {
2469 0         0 ($common, $tail, $path) = _slide_tail( $common, $tail, $path, $ctx );
2470             }
2471             }
2472 0 0       0 push @{$reduce{$common->[0]}}, [
  0         0  
2473             @$common,
2474             (ref($tail) eq 'HASH' ? $tail : @$tail ),
2475             @$path
2476             ];
2477             }
2478             }
2479             }
2480             }
2481 0 0       0 $debug and print
2482 0         0 "# $indent|_scan_node counts: reduce=@{[scalar keys %reduce]} fail=@{[scalar @fail]}\n";
  0         0  
2483 0         0 return( \@fail, \%reduce );
2484             }
2485              
2486             sub _do_reduce {
2487 0     0   0 my ($path, $ctx) = @_;
2488 0         0 my $indent = ' ' x $ctx->{depth};
2489 0         0 my $debug = $ctx->{debug};
2490 0         0 my $ra = CPANPLUS::YACSmoke::ReAssemble->new(chomp=>0);
2491 0         0 $ra->debug($debug);
2492 0 0       0 $debug and print "# $indent| do @{[_dump($path)]}\n";
  0         0  
2493 0         0 $ra->_insertr( $_ ) for
2494             # When nodes come into the picture, we have to be careful
2495             # about how we insert the paths into the assembly.
2496             # Paths with nodes first, then closest node to front
2497             # then shortest path. Merely because if we can control
2498             # order in which paths containing nodes get inserted,
2499             # then we can make a couple of assumptions that simplify
2500             # the code in _insert_node.
2501             sort {
2502 0         0 scalar(grep {ref($_) eq 'HASH'} @$a)
2503 0 0 0     0 <=> scalar(grep {ref($_) eq 'HASH'} @$b)
  0         0  
2504             ||
2505             _node_offset($b) <=> _node_offset($a)
2506             ||
2507             scalar @$a <=> scalar @$b
2508             }
2509             @$path
2510             ;
2511 0         0 $path = $ra->_path;
2512 0         0 my $common = [];
2513 0         0 push @$common, shift @$path while( ref($path->[0]) ne 'HASH' );
2514 0 0       0 my $tail = scalar( @$path ) > 1 ? [@$path] : $path->[0];
2515 0 0       0 $debug and print "# $indent| _do_reduce common=@{[_dump($common)]} tail=@{[_dump($tail)]}\n";
  0         0  
  0         0  
2516 0         0 return ($common, $tail);
2517             }
2518              
2519             sub _node_offset {
2520             # return the offset that the first node is found, or -ve
2521             # optimised for speed
2522 0     0   0 my $nr = @{$_[0]};
  0         0  
2523 0         0 my $atom = -1;
2524 0   0     0 ref($_[0]->[$atom]) eq 'HASH' and return $atom while ++$atom < $nr;
2525 0         0 return -1;
2526             }
2527              
2528             sub _slide_tail {
2529 0     0   0 my $head = shift;
2530 0         0 my $tail = shift;
2531 0         0 my $path = shift;
2532 0         0 my $ctx = shift;
2533 0         0 my $indent = ' ' x $ctx->{depth};
2534 0         0 my $debug = $ctx->{debug};
2535 0 0       0 $debug and print "# $indent| slide in h=", _dump($head),
2536             ' t=', _dump($tail), ' p=', _dump($path), "\n";
2537 0         0 my $slide_path = (each %$tail)[-1];
2538 0 0       0 $slide_path = (each %$tail)[-1] unless defined $slide_path;
2539 0 0       0 $debug and print "# $indent| slide potential ", _dump($slide_path), " over ", _dump($path), "\n";
2540 0   0     0 while( defined $path->[0] and $path->[0] eq $slide_path->[0] ) {
2541 0 0       0 $debug and print "# $indent| slide=tail=$slide_path->[0]\n";
2542 0         0 my $slide = shift @$path;
2543 0         0 shift @$slide_path;
2544 0         0 push @$slide_path, $slide;
2545 0         0 push @$head, $slide;
2546             }
2547 0 0       0 $debug and print "# $indent| slide path ", _dump($slide_path), "\n";
2548 0         0 my $slide_node = {
2549             '' => undef,
2550             _node_key($slide_path->[0]) => $slide_path,
2551             };
2552 0 0       0 $debug and print "# $indent| slide out h=", _dump($head),
2553             ' s=', _dump($slide_node), ' p=', _dump($path), "\n";
2554 0         0 return ($head, $slide_node, $path);
2555             }
2556              
2557             sub _unrev_path {
2558 0     0   0 my ($path, $ctx) = @_;
2559 0         0 my $indent = ' ' x $ctx->{depth};
2560 0         0 my $debug = $ctx->{debug};
2561 0         0 my $new;
2562 0 0       0 if( not grep { ref($_) } @$path ) {
  0         0  
2563 0 0       0 $debug and print "# ${indent}_unrev path fast ", _dump($path);
2564 0         0 $new = [reverse @$path];
2565 0 0       0 $debug and print "# -> ", _dump($new), "\n";
2566 0         0 return $new;
2567             }
2568 0 0       0 $debug and print "# ${indent}unrev path in ", _dump($path), "\n";
2569 0         0 while( defined( my $p = pop @$path )) {
2570 0 0       0 push @$new,
    0          
2571             ref($p) eq 'HASH' ? _unrev_node($p, _descend($ctx) )
2572             : ref($p) eq 'ARRAY' ? _unrev_path($p, _descend($ctx) )
2573             : $p
2574             ;
2575             }
2576 0 0       0 $debug and print "# ${indent}unrev path out ", _dump($new), "\n";
2577 0         0 return $new;
2578             }
2579              
2580             sub _unrev_node {
2581 0     0   0 my ($node, $ctx ) = @_;
2582 0         0 my $indent = ' ' x $ctx->{depth};
2583 0         0 my $debug = $ctx->{debug};
2584 0         0 my $optional = _remove_optional($node);
2585 0 0       0 $debug and print "# ${indent}unrev node in ", _dump($node), " opt=$optional\n";
2586 0         0 my $new;
2587 0 0       0 $new->{''} = undef if $optional;
2588 0         0 my $n;
2589 0         0 for $n( keys %$node ) {
2590 0         0 my $path = _unrev_path($node->{$n}, _descend($ctx) );
2591 0         0 $new->{_node_key($path->[0])} = $path;
2592             }
2593 0 0       0 $debug and print "# ${indent}unrev node out ", _dump($new), "\n";
2594 0         0 return $new;
2595             }
2596              
2597             sub _node_key {
2598 0     0   0 my $node = shift;
2599 0 0       0 return _node_key($node->[0]) if ref($node) eq 'ARRAY';
2600 0 0       0 return $node unless ref($node) eq 'HASH';
2601 0         0 my $key = '';
2602 0         0 my $k;
2603 0         0 for $k( keys %$node ) {
2604 0 0       0 next if $k eq '';
2605 0 0 0     0 $key = $k if $key eq '' or $key gt $k;
2606             }
2607 0         0 return $key;
2608             }
2609              
2610             sub _descend {
2611             # Take a context object, and increase the depth by one.
2612             # By creating a fresh hash each time, we don't have to
2613             # bother adding make-work code to decrease the depth
2614             # when we return from what we called.
2615 0     0   0 my $ctx = shift;
2616 0         0 return {%$ctx, depth => $ctx->{depth}+1};
2617             }
2618              
2619             #####################################################################
2620              
2621             sub _make_class {
2622 0     0   0 my $self = shift;
2623 0         0 my %set = map { ($_,1) } @_;
  0         0  
2624 0 0       0 delete $set{'\\d'} if exists $set{'\\w'};
2625 0 0       0 delete $set{'\\D'} if exists $set{'\\W'};
2626             return '.' if exists $set{'.'}
2627             or ($self->{fold_meta_pairs} and (
2628             (exists $set{'\\d'} and exists $set{'\\D'})
2629             or (exists $set{'\\s'} and exists $set{'\\S'})
2630 0 0 0     0 or (exists $set{'\\w'} and exists $set{'\\W'})
      0        
      0        
2631             ))
2632             ;
2633 0         0 for my $meta( q/\\d/, q/\\D/, q/\\s/, q/\\S/, q/\\w/, q/\\W/ ) {
2634 0 0       0 if( exists $set{$meta} ) {
2635 0         0 my $re = qr/$meta/;
2636 0         0 my @delete;
2637 0   0     0 $_ =~ /^$re$/ and push @delete, $_ for keys %set;
2638 0 0       0 delete @set{@delete} if @delete;
2639             }
2640             }
2641 0 0       0 return (keys %set)[0] if keys %set == 1;
2642 0         0 for my $meta( '.', '+', '*', '?', '(', ')', '^', '@', '$', '[', '/', ) {
2643 0 0       0 exists $set{"\\$meta"} and $set{$meta} = delete $set{"\\$meta"};
2644             }
2645 0 0       0 my $dash = exists $set{'-'} ? do { delete($set{'-'}), '-' } : '';
  0         0  
2646 0 0       0 my $caret = exists $set{'^'} ? do { delete($set{'^'}), '^' } : '';
  0         0  
2647 0         0 my $class = join( '' => sort keys %set );
2648 0 0 0     0 $class =~ s/0123456789/\\d/ and $class eq '\\d' and return $class;
2649 0         0 return "[$dash$class$caret]";
2650             }
2651              
2652             sub _re_sort {
2653 0   0 0   0 return length $b <=> length $a || $a cmp $b
2654             }
2655              
2656             sub _combine {
2657 0     0   0 my $self = shift;
2658 0         0 my $type = shift;
2659             # print "c in = @{[_dump(\@_)]}\n";
2660             # my $combine =
2661             return '('
2662             . $type
2663 0         0 . do {
2664 0         0 my( @short, @long );
2665 0 0       0 push @{ /^$Single_Char$/ ? \@short : \@long}, $_ for @_;
  0         0  
2666 0 0       0 if( @short == 1 ) {
    0          
2667 0         0 @long = sort _re_sort @long, @short;
2668             }
2669             elsif( @short > 1 ) {
2670             # yucky but true
2671 0         0 my @combine = (_make_class($self, @short), sort _re_sort @long);
2672 0         0 @long = @combine;
2673             }
2674             else {
2675 0         0 @long = sort _re_sort @long;
2676             }
2677 0         0 join( '|', @long );
2678             }
2679             . ')';
2680             # print "combine <$combine>\n";
2681             # $combine;
2682             }
2683              
2684             sub _combine_new {
2685 0     0   0 my $self = shift;
2686 0         0 my( @short, @long );
2687 0 0       0 push @{ /^$Single_Char$/ ? \@short : \@long}, $_ for @_;
  0         0  
2688 0 0 0     0 if( @short == 1 and @long == 0 ) {
    0 0        
2689 0         0 return $short[0];
2690             }
2691             elsif( @short > 1 and @short == @_ ) {
2692 0         0 return _make_class($self, @short);
2693             }
2694             else {
2695 0 0       0 return '(?:'
2696             . join( '|' =>
2697             @short > 1
2698             ? ( _make_class($self, @short), sort _re_sort @long)
2699             : ( (sort _re_sort( @long )), @short )
2700             )
2701             . ')';
2702             }
2703             }
2704              
2705             sub _re_path {
2706 3     3   9 my $self = shift;
2707             # in shorter assemblies, _re_path() is the second hottest
2708             # routine. after insert(), so make it fast.
2709              
2710 3 50       12 if ($self->{unroll_plus}) {
2711             # but we can't easily make this blockless
2712 0         0 my @arr = @{$_[0]};
  0         0  
2713 0         0 my $str = '';
2714 0         0 my $skip = 0;
2715 0         0 for my $i (0..$#arr) {
2716 0 0 0     0 if (ref($arr[$i]) eq 'ARRAY') {
    0          
    0          
    0          
2717 0         0 $str .= _re_path($self, $arr[$i]);
2718             }
2719             elsif (ref($arr[$i]) eq 'HASH') {
2720             $str .= exists $arr[$i]->{''}
2721             ? _combine_new( $self,
2722 0         0 map { _re_path( $self, $arr[$i]->{$_} ) } grep { $_ ne '' } keys %{$arr[$i]}
  0         0  
  0         0  
2723             ) . '?'
2724 0 0       0 : _combine_new($self, map { _re_path( $self, $arr[$i]->{$_} ) } keys %{$arr[$i]})
  0         0  
  0         0  
2725             ;
2726             }
2727             elsif ($i < $#arr and $arr[$i+1] =~ /\A$arr[$i]\*(\??)\Z/) {
2728 0 0       0 $str .= "$arr[$i]+" . (defined $1 ? $1 : '');
2729 0         0 ++$skip;
2730             }
2731             elsif ($skip) {
2732 0         0 $skip = 0;
2733             }
2734             else {
2735 0         0 $str .= $arr[$i];
2736             }
2737             }
2738 0         0 return $str;
2739             }
2740              
2741 3 50       14 return join( '', @_ ) unless grep { length ref $_ } @_;
  3         20  
2742 3         9 my $p;
2743             return join '', map {
2744             ref($_) eq '' ? $_
2745 22 0       79 : ref($_) eq 'HASH' ? do {
    50          
2746             # In the case of a node, see whether there's a '' which
2747             # indicates that the whole thing is optional and thus
2748             # requires a trailing ?
2749             # Unroll the two different paths to avoid the needless
2750             # grep when it isn't necessary.
2751 0         0 $p = $_;
2752             exists $_->{''}
2753             ? _combine_new( $self,
2754 0         0 map { _re_path( $self, $p->{$_} ) } grep { $_ ne '' } keys %$_
  0         0  
2755             ) . '?'
2756 0 0       0 : _combine_new($self, map { _re_path( $self, $p->{$_} ) } keys %$_ )
  0         0  
2757             }
2758             : _re_path($self, $_) # ref($_) eq 'ARRAY'
2759 3         10 } @{$_[0]}
  3         16  
2760             }
2761              
2762             sub _lookahead {
2763 0     0     my $in = shift;
2764 0           my %head;
2765             my $path;
2766 0           for $path( keys %$in ) {
2767 0 0         next unless defined $in->{$path};
2768             # print "look $path: ", ref($in->{$path}[0]), ".\n";
2769 0 0         if( ref($in->{$path}[0]) eq 'HASH' ) {
    0          
2770 0           my $next = 0;
2771 0   0       while( ref($in->{$path}[$next]) eq 'HASH' and @{$in->{$path}} > $next + 1 ) {
  0            
2772 0 0         if( exists $in->{$path}[$next]{''} ) {
2773 0           ++$head{$in->{$path}[$next+1]};
2774             }
2775 0           ++$next;
2776             }
2777 0           my $inner = _lookahead( $in->{$path}[0] );
2778 0           @head{ keys %$inner } = (values %$inner);
2779             }
2780             elsif( ref($in->{$path}[0]) eq 'ARRAY' ) {
2781 0           my $subpath = $in->{$path}[0];
2782 0           for( my $sp = 0; $sp < @$subpath; ++$sp ) {
2783 0 0         if( ref($subpath->[$sp]) eq 'HASH' ) {
2784 0           my $follow = _lookahead( $subpath->[$sp] );
2785 0           @head{ keys %$follow } = (values %$follow);
2786 0 0         last unless exists $subpath->[$sp]{''};
2787             }
2788             else {
2789 0           ++$head{$subpath->[$sp]};
2790 0           last;
2791             }
2792             }
2793             }
2794             else {
2795 0           ++$head{ $in->{$path}[0] };
2796             }
2797             }
2798             # print "_lookahead ", _dump($in), '==>', _dump([keys %head]), "\n";
2799 0           return \%head;
2800             }
2801              
2802             sub _re_path_lookahead {
2803 0     0     my $self = shift;
2804 0           my $in = shift;
2805             # print "_re_path_la in ", _dump($in), "\n";
2806 0           my $out = '';
2807 0           for( my $p = 0; $p < @$in; ++$p ) {
2808 0 0         if( ref($in->[$p]) eq '' ) {
    0          
2809 0           $out .= $in->[$p];
2810 0           next;
2811             }
2812             elsif( ref($in->[$p]) eq 'ARRAY' ) {
2813 0           $out .= _re_path_lookahead($self, $in->[$p]);
2814 0           next;
2815             }
2816             # print "$p ", _dump($in->[$p]), "\n";
2817             my $path = [
2818 0           map { _re_path_lookahead($self, $in->[$p]{$_} ) }
2819 0           grep { $_ ne '' }
2820 0           keys %{$in->[$p]}
  0            
2821             ];
2822 0           my $ahead = _lookahead($in->[$p]);
2823 0           my $more = 0;
2824 0 0 0       if( exists $in->[$p]{''} and $p + 1 < @$in ) {
2825 0           my $next = 1;
2826 0           while( $p + $next < @$in ) {
2827 0 0         if( ref( $in->[$p+$next] ) eq 'HASH' ) {
2828 0           my $follow = _lookahead( $in->[$p+$next] );
2829 0           @{$ahead}{ keys %$follow } = (values %$follow);
  0            
2830             }
2831             else {
2832 0           ++$ahead->{$in->[$p+$next]};
2833 0           last;
2834             }
2835 0           ++$next;
2836             }
2837 0           $more = 1;
2838             }
2839 0           my $nr_one = grep { /^$Single_Char$/ } @$path;
  0            
2840 0           my $nr = @$path;
2841 0 0 0       if( $nr_one > 1 and $nr_one == $nr ) {
2842 0           $out .= _make_class($self, @$path);
2843 0 0         $out .= '?' if exists $in->[$p]{''};
2844             }
2845             else {
2846             my $zwla = keys(%$ahead) > 1
2847 0 0         ? _combine($self, '?=', grep { s/\+$//; $_ } keys %$ahead )
  0            
  0            
2848             : '';
2849 0 0         my $patt = $nr > 1 ? _combine($self, '?:', @$path ) : $path->[0];
2850             # print "have nr=$nr n1=$nr_one n=", _dump($in->[$p]), ' a=', _dump([keys %$ahead]), " zwla=$zwla patt=$patt @{[_dump($path)]}\n";
2851 0 0         if( exists $in->[$p]{''} ) {
2852 0 0         $out .= $more ? "$zwla(?:$patt)?" : "(?:$zwla$patt)?";
2853             }
2854             else {
2855 0           $out .= "$zwla$patt";
2856             }
2857             }
2858             }
2859 0           return $out;
2860             }
2861              
2862             sub _re_path_track {
2863 0     0     my $self = shift;
2864 0           my $in = shift;
2865 0           my $normal = shift;
2866 0           my $augmented = shift;
2867 0           my $o;
2868 0           my $simple = '';
2869 0           my $augment = '';
2870 0           for( my $n = 0; $n < @$in; ++$n ) {
2871 0 0         if( ref($in->[$n]) eq '' ) {
2872 0           $o = $in->[$n];
2873 0           $simple .= $o;
2874 0           $augment .= $o;
2875 0 0 0       if( (
      0        
      0        
2876             $n < @$in - 1
2877             and ref($in->[$n+1]) eq 'HASH' and exists $in->[$n+1]{''}
2878             )
2879             or $n == @$in - 1
2880             ) {
2881 0           push @{$self->{mlist}}, $normal . $simple ;
  0            
2882 0 0         $augment .= $] < 5.009005
2883             ? "(?{\$self->{m}=$self->{mcount}})"
2884             : "(?{$self->{mcount}})"
2885             ;
2886 0           ++$self->{mcount};
2887             }
2888             }
2889             else {
2890             my $path = [
2891 0           map { $self->_re_path_track( $in->[$n]{$_}, $normal.$simple , $augmented.$augment ) }
2892 0           grep { $_ ne '' }
2893 0           keys %{$in->[$n]}
  0            
2894             ];
2895 0           $o = '(?:' . join( '|' => sort _re_sort @$path ) . ')';
2896 0 0         $o .= '?' if exists $in->[$n]{''};
2897 0           $simple .= $o;
2898 0           $augment .= $o;
2899             }
2900             }
2901 0           return $augment;
2902             }
2903              
2904             sub _re_path_pretty {
2905 0     0     my $self = shift;
2906 0           my $in = shift;
2907 0           my $arg = shift;
2908 0           my $pre = ' ' x (($arg->{depth}+0) * $arg->{indent});
2909 0           my $indent = ' ' x (($arg->{depth}+1) * $arg->{indent});
2910 0           my $out = '';
2911 0           $arg->{depth}++;
2912 0           my $prev_was_paren = 0;
2913 0           for( my $p = 0; $p < @$in; ++$p ) {
2914 0 0         if( ref($in->[$p]) eq '' ) {
    0          
2915 0 0         $out .= "\n$pre" if $prev_was_paren;
2916 0           $out .= $in->[$p];
2917 0           $prev_was_paren = 0;
2918             }
2919             elsif( ref($in->[$p]) eq 'ARRAY' ) {
2920 0           $out .= _re_path($self, $in->[$p]);
2921             }
2922             else {
2923             my $path = [
2924 0           map { _re_path_pretty($self, $in->[$p]{$_}, $arg ) }
2925 0           grep { $_ ne '' }
2926 0           keys %{$in->[$p]}
  0            
2927             ];
2928 0           my $nr = @$path;
2929 0           my( @short, @long );
2930 0 0         push @{/^$Single_Char$/ ? \@short : \@long}, $_ for @$path;
  0            
2931 0 0         if( @short == $nr ) {
2932 0 0         $out .= $nr == 1 ? $path->[0] : _make_class($self, @short);
2933 0 0         $out .= '?' if exists $in->[$p]{''};
2934             }
2935             else {
2936 0 0         $out .= "\n" if length $out;
2937 0 0         $out .= $pre if $p;
2938 0           $out .= "(?:\n$indent";
2939 0 0         if( @short < 2 ) {
2940 0           my $r = 0;
2941             $out .= join( "\n$indent|" => map {
2942 0 0         $r++ and $_ =~ s/^\(\?:/\n$indent(?:/;
  0            
2943 0           $_
2944             }
2945             sort _re_sort @$path
2946             );
2947             }
2948             else {
2949 0           $out .= join( "\n$indent|" => ( (sort _re_sort @long), _make_class($self, @short) ));
2950             }
2951 0           $out .= "\n$pre)";
2952 0 0         if( exists $in->[$p]{''} ) {
2953 0           $out .= "\n$pre?";
2954 0           $prev_was_paren = 0;
2955             }
2956             else {
2957 0           $prev_was_paren = 1;
2958             }
2959             }
2960             }
2961             }
2962 0           $arg->{depth}--;
2963 0           return $out;
2964             }
2965              
2966             sub _node_eq {
2967 0 0 0 0     return 0 if not defined $_[0] or not defined $_[1];
2968 0 0         return 0 if ref $_[0] ne ref $_[1];
2969             # Now that we have determined that the reference of each
2970             # argument are the same, we only have to test the first
2971             # one, which gives us a nice micro-optimisation.
2972 0 0         if( ref($_[0]) eq 'HASH' ) {
    0          
2973 0           keys %{$_[0]} == keys %{$_[1]}
  0            
2974             and
2975             # does this short-circuit to avoid _re_path() cost more than it saves?
2976 0 0 0       join( '|' => sort keys %{$_[0]}) eq join( '|' => sort keys %{$_[1]})
  0            
  0            
2977             and
2978             _re_path(undef, [$_[0]] ) eq _re_path(undef, [$_[1]] );
2979             }
2980             elsif( ref($_[0]) eq 'ARRAY' ) {
2981 0 0         scalar @{$_[0]} == scalar @{$_[1]}
  0            
  0            
2982             and
2983             _re_path(undef, $_[0]) eq _re_path(undef, $_[1]);
2984             }
2985             else {
2986 0           $_[0] eq $_[1];
2987             }
2988             }
2989              
2990             sub _pretty_dump {
2991 0     0     return sprintf "\\x%02x", ord(shift);
2992             }
2993              
2994             sub _dump {
2995 0     0     my $path = shift;
2996 0 0         return _dump_node($path) if ref($path) eq 'HASH';
2997 0           my $dump = '[';
2998 0           my $d;
2999 0           my $nr = 0;
3000 0           for $d( @$path ) {
3001 0 0         $dump .= ' ' if $nr++;
3002 0 0         if( ref($d) eq 'HASH' ) {
    0          
    0          
3003 0           $dump .= _dump_node($d);
3004             }
3005             elsif( ref($d) eq 'ARRAY' ) {
3006 0           $dump .= _dump($d);
3007             }
3008             elsif( defined $d ) {
3009             # D::C indicates the second test is redundant
3010             # $dump .= ( $d =~ /\s/ or not length $d )
3011 0 0         $dump .= (
    0          
3012             $d =~ /\s/ ? qq{'$d'} :
3013             $d =~ /^[\x00-\x1f]$/ ? _pretty_dump($d) :
3014             $d
3015             );
3016             }
3017             else {
3018 0           $dump .= '*';
3019             }
3020             }
3021 0           return $dump . ']';
3022             }
3023              
3024             sub _dump_node {
3025 0     0     my $node = shift;
3026 0           my $dump = '{';
3027 0           my $nr = 0;
3028 0           my $n;
3029 0           for $n (sort keys %$node) {
3030 0 0         $dump .= ' ' if $nr++;
3031             # Devel::Cover shows this to test to be redundant
3032             # $dump .= ( $n eq '' and not defined $node->{$n} )
3033             $dump .= $n eq ''
3034             ? '*'
3035             : ($n =~ /^[\x00-\x1f]$/ ? _pretty_dump($n) : $n)
3036 0 0         . "=>" . _dump($node->{$n})
    0          
3037             ;
3038             }
3039 0           return $dump . '}';
3040             }
3041              
3042             =back
3043              
3044             =head1 DIAGNOSTICS
3045              
3046             "Cannot pass a C to Default_Lexer"
3047              
3048             You tried to replace the default lexer pattern with an object
3049             instead of a scalar. Solution: You probably tried to call
3050             C<< $obj->Default_Lexer >>. Call the qualified class method instead
3051             C.
3052              
3053             "filter method not passed a coderef"
3054              
3055             "pre_filter method not passed a coderef"
3056              
3057             A reference to a subroutine (anonymous or otherwise) was expected.
3058             Solution: read the documentation for the C method.
3059              
3060             "duplicate pattern added: /.../"
3061              
3062             The C attribute is active, and a duplicate pattern was
3063             added (well duh!). Solution: clean your data.
3064              
3065             "cannot open [file] for input: [reason]"
3066              
3067             The C method was unable to open the specified file for
3068             whatever reason. Solution: make sure the file exists and the script
3069             has the required privileges to read it.
3070              
3071             =head1 NOTES
3072              
3073             This module has been tested successfully with a range of versions
3074             of perl, from 5.005_03 to 5.9.3. Use of 5.6.0 is not recommended.
3075              
3076             The expressions produced by this module can be used with the PCRE
3077             library.
3078              
3079             Remember to "double up" your backslashes if the patterns are
3080             hard-coded as constants in your program. That is, you should
3081             literally C rather than C. It
3082             usually will work either way, but it's good practice to do so.
3083              
3084             Where possible, supply the simplest tokens possible. Don't add
3085             C when C will do. The reason is that
3086             if you also add C the resulting assembly changes
3087             dramatically: C I
3088             C. Since R::A doesn't perform enough analysis,
3089             it won't "unroll" the C<{2}> quantifier, and will fail to notice
3090             the divergence after the first C<-d\d+>.
3091              
3092             Furthermore, when the string 'X-123000P' is matched against the
3093             first assembly, the regexp engine will have to backtrack over each
3094             alternation (the one that ends in Y B the one that ends in Z)
3095             before determining that there is no match. No such backtracking
3096             occurs in the second pattern: as soon as the engine encounters the
3097             'P' in the target string, neither of the alternations at that point
3098             (C<-\d+Y> or C) could succeed and so the match fails.
3099              
3100             C does, however, know how to build character
3101             classes. Given C, C and C, it will assemble these
3102             into C. When C<-> (dash) appears as a candidate for a
3103             character class it will be the first character in the class. When
3104             C<^> (circumflex) appears as a candidate for a character class it
3105             will be the last character in the class.
3106              
3107             It also knows about meta-characters than can "absorb" regular
3108             characters. For instance, given C and C, it knows that
3109             C<5> can be represented by C<\d> and so the assembly is just C.
3110             The "absorbent" meta-characters it deals with are C<.>, C<\d>, C<\s>
3111             and C<\W> and their complements. It will replace C<\d>/C<\D>,
3112             C<\s>/C<\S> and C<\w>/C<\W> by C<.> (dot), and it will drop C<\d>
3113             if C<\w> is also present (as will C<\D> in the presence of C<\W>).
3114              
3115             C deals correctly with C's propensity
3116             to backslash many characters that have no need to be. Backslashes on
3117             non-metacharacters will be removed. Similarly, in character classes,
3118             a number of characters lose their magic and so no longer need to be
3119             backslashed within a character class. Two common examples are C<.>
3120             (dot) and C<$>. Such characters will lose their backslash.
3121              
3122             At the same time, it will also process C<\Q...\E> sequences. When
3123             such a sequence is encountered, the inner section is extracted and
3124             C is applied to the section. The resulting quoted text
3125             is then used in place of the original unquoted text, and the C<\Q>
3126             and C<\E> metacharacters are thrown away. Similar processing occurs
3127             with the C<\U...\E> and C<\L...\E> sequences. This may have surprising
3128             effects when using a dispatch table. In this case, you will need
3129             to know exactly what the module makes of your input. Use the C
3130             method to find out what's going on:
3131              
3132             $pattern = join( '', @{$re->lexstr($pattern)} );
3133              
3134             If all the digits 0..9 appear in a character class, C
3135             will replace them by C<\d>. I'd do it for letters as well, but
3136             thinking about accented characters and other glyphs hurts my head.
3137              
3138             In an alternation, the longest paths are chosen first (for example,
3139             C). When two paths have the same length, the path
3140             with the most subpaths will appear first. This aims to put the
3141             "busiest" paths to the front of the alternation. For example, the
3142             list C, C, C, C and C will produce the
3143             pattern C<(?:f(?:ew|ig|un)|b(?:ad|it))>. See F for a
3144             real-world example of how alternations are sorted. Once you have
3145             looked at that, everything should be crystal clear.
3146              
3147             When tracking is in use, no reduction is performed. nor are
3148             character classes formed. The reason is that it is
3149             too difficult to determine the original pattern afterwards. Consider the
3150             two patterns C and C. These should be reduced to
3151             C. The final character matches one of two possibilities.
3152             To resolve whether it matched an C<'e'> or C<'m'> would require
3153             keeping track of the fact that the pattern finished up in a character
3154             class, which would the require a whole lot more work to figure out
3155             which character of the class matched. Without character classes
3156             it becomes much easier. Instead, C is produced, which
3157             lets us find out more simply where we ended up.
3158              
3159             Similarly, C and C should form C<(?:dog|sea)food>.
3160             When the pattern is being assembled, the tracking decision needs
3161             to be made at the end of the grouping, but the tail of the pattern
3162             has not yet been visited. Deferring things to make this work correctly
3163             is a vast hassle. In this case, the pattern becomes merely
3164             C<(?:dogfood|seafood>. Tracked patterns will therefore be bulkier than
3165             simple patterns.
3166              
3167             There is an open bug on this issue:
3168              
3169             L
3170              
3171             If this bug is ever resolved, tracking would become much easier to
3172             deal with (none of the C hassle would be required - you could
3173             just match like a regular RE and it would Just Work).
3174              
3175             =head1 SEE ALSO
3176              
3177             =over 8
3178              
3179             =item L
3180              
3181             General information about Perl's regular expressions.
3182              
3183             =item L
3184              
3185             Specific information about C.
3186              
3187             =item Regex::PreSuf
3188              
3189             C takes a string and chops it itself into tokens of
3190             length 1. Since it can't deal with tokens of more than one character,
3191             it can't deal with meta-characters and thus no regular expressions.
3192             Which is the main reason why I wrote this module.
3193              
3194             =item Regexp::Optimizer
3195              
3196             C produces regular expressions that are similar to
3197             those produced by R::A with reductions switched off. It's biggest
3198             drawback is that it is exponentially slower than Regexp::Assemble on
3199             very large sets of patterns.
3200              
3201             =item Regexp::Parser
3202              
3203             Fine grained analysis of regular expressions.
3204              
3205             =item Regexp::Trie
3206              
3207             Funnily enough, this was my working name for C
3208             during its developement. I changed the name because I thought it
3209             was too obscure. Anyway, C does much the same as
3210             C and C except that it runs
3211             much faster (according to the author). It does not recognise
3212             meta characters (that is, 'a+b' is interpreted as 'a\+b').
3213              
3214             =item Text::Trie
3215              
3216             C is well worth investigating. Tries can outperform very
3217             bushy (read: many alternations) patterns.
3218              
3219             =item Tree::Trie
3220              
3221             C is another module that builds tries. The algorithm that
3222             C uses appears to be quite similar to the
3223             algorithm described therein, except that C solves its
3224             end-marker problem without having to rewrite the leaves.
3225              
3226             =back
3227              
3228             =head1 LIMITATIONS
3229              
3230             C does not attempt to find common substrings. For
3231             instance, it will not collapse C down to C.
3232             If there's a module out there that performs this sort of string
3233             analysis I'd like to know about it. But keep in mind that the
3234             algorithms that do this are very expensive: quadratic or worse.
3235              
3236             C does not interpret meta-character modifiers.
3237             For instance, if the following two patterns are
3238             given: C and C, it will not determine that C<\d> can be
3239             matched by C<\d+>. Instead, it will produce C. Along
3240             a similar line of reasoning, it will not determine that C and
3241             C is equivalent to C (It will produce C
3242             instead).
3243              
3244             You cannot remove a pattern that has been added to an object. You'll
3245             just have to start over again. Adding a pattern is difficult enough,
3246             I'd need a solid argument to convince me to add a C method.
3247             If you need to do this you should read the documentation for the
3248             C method.
3249              
3250             C does not (yet)? employ the C<(?E...)>
3251             construct.
3252              
3253             The module does not produce POSIX-style regular expressions. This
3254             would be quite easy to add, if there was a demand for it.
3255              
3256             =head1 BUGS
3257              
3258             Patterns that generate look-ahead assertions sometimes produce
3259             incorrect patterns in certain obscure corner cases. If you
3260             suspect that this is occurring in your pattern, disable
3261             lookaheads.
3262              
3263             Tracking doesn't really work at all with 5.6.0. It works better
3264             in subsequent 5.6 releases. For maximum reliability, the use of
3265             a 5.8 release is strongly recommended. Tracking barely works with
3266             5.005_04. Of note, using C<\d>-style meta-characters invariably
3267             causes panics. Tracking really comes into its own in Perl 5.10.
3268              
3269             If you feed C patterns with nested parentheses,
3270             there is a chance that the resulting pattern will be uncompilable
3271             due to mismatched parentheses (not enough closing parentheses). This
3272             is normal, so long as the default lexer pattern is used. If you want
3273             to find out which pattern among a list of 3000 patterns are to blame
3274             (speaking from experience here), the F script offers
3275             a strategy for pinpointing the pattern at fault. While you may not
3276             be able to use the script directly, the general approach is easy to
3277             implement.
3278              
3279             The algorithm used to assemble the regular expressions makes extensive
3280             use of mutually-recursive functions (that is, A calls B, B calls
3281             A, ...) For deeply similar expressions, it may be possible to provoke
3282             "Deep recursion" warnings.
3283              
3284             The module has been tested extensively, and has an extensive test
3285             suite (that achieves close to 100% statement coverage), but you
3286             never know... A bug may manifest itself in two ways: creating a
3287             pattern that cannot be compiled, such as C, or a pattern
3288             that compiles correctly but that either matches things it shouldn't,
3289             or doesn't match things it should. It is assumed that Such problems
3290             will occur when the reduction algorithm encounters some sort of
3291             edge case. A temporary work-around is to disable reductions:
3292              
3293             my $pattern = $assembler->reduce(0)->re;
3294              
3295             A discussion about implementation details and where bugs might lurk
3296             appears in the README file. If this file is not available locally,
3297             you should be able to find a copy on the Web at your nearest CPAN
3298             mirror.
3299              
3300             Seriously, though, a number of people have been using this module to
3301             create expressions anywhere from 140Kb to 600Kb in size, and it seems to
3302             be working according to spec. Thus, I don't think there are any serious
3303             bugs remaining.
3304              
3305             If you are feeling brave, extensive debugging traces are available to
3306             figure out where assembly goes wrong.
3307              
3308             Please report all bugs at
3309             L
3310              
3311             Make sure you include the output from the following two commands:
3312              
3313             perl -MRegexp::Assemble -le 'print $Regexp::Assemble::VERSION'
3314             perl -V
3315              
3316             There is a mailing list for the discussion of C.
3317             Subscription details are available at
3318             L.
3319              
3320             =head1 ACKNOWLEDGEMENTS
3321              
3322             This module grew out of work I did building access maps for Postfix,
3323             a modern SMTP mail transfer agent. See L
3324             for more information. I used Perl to build large regular expressions
3325             for blocking dynamic/residential IP addresses to cut down on spam
3326             and viruses. Once I had the code running for this, it was easy to
3327             start adding stuff to block really blatant spam subject lines, bogus
3328             HELO strings, spammer mailer-ids and more...
3329              
3330             I presented the work at the French Perl Workshop in 2004, and the
3331             thing most people asked was whether the underlying mechanism for
3332             assembling the REs was available as a module. At that time it was
3333             nothing more that a twisty maze of scripts, all different. The
3334             interest shown indicated that a module was called for. I'd like to
3335             thank the people who showed interest. Hey, it's going to make I
3336             messy scripts smaller, in any case.
3337              
3338             Thomas Drugeon was a valuable sounding board for trying out
3339             early ideas. Jean Forget and Philippe Blayo looked over an early
3340             version. H.Merijn Brandt stopped over in Paris one evening, and
3341             discussed things over a few beers.
3342              
3343             Nicholas Clark pointed out that while what this module does
3344             (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he
3345             encouraged me to continue with the development of this module. In
3346             any event, this module allows one to gauge the difficulty of
3347             undertaking the endeavour in C. I'd rather gouge my eyes out with
3348             a blunt pencil.
3349              
3350             Paul Johnson settled the question as to whether this module should
3351             live in the Regex:: namespace, or Regexp:: namespace. If you're
3352             not convinced, try running the following one-liner:
3353              
3354             perl -le 'print ref qr//'
3355              
3356             Philippe Bruhat found a couple of corner cases where this module
3357             could produce incorrect results. Such feedback is invaluable,
3358             and only improves the module's quality.
3359              
3360             =head1 AUTHOR
3361              
3362             David Landgren
3363              
3364             Copyright (C) 2004-2008. All rights reserved.
3365              
3366             http://www.landgren.net/perl/
3367              
3368             If you use this module, I'd love to hear about what you're using
3369             it for. If you want to be informed of updates, send me a note.
3370              
3371             You can look at the latest working copy in the following
3372             Subversion repository:
3373              
3374             http://svnweb.mongueurs.net/Regexp-Assemble
3375              
3376             =head1 LICENSE
3377              
3378             This library is free software; you can redistribute it and/or modify
3379             it under the same terms as Perl itself.
3380              
3381             =cut
3382              
3383             'The Lusty Decadent Delights of Imperial Pompeii';
3384             __END__