File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 209 240 87.0
branch 112 140 80.0
condition 27 35 77.1
subroutine 24 28 85.7
pod 5 5 100.0
total 377 448 84.1


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C is implemented separately
58             from the lexer L.
59              
60             The job of C is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 63     63   371 use strict;
  63         111  
  63         1726  
82 63     63   281 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  63         99  
  63         2780  
83 63     63   303 use List::Util 1.33 ();
  63         942  
  63         1380  
84 63     63   269 use PPI::Util ();
  63         105  
  63         974  
85 63     63   310 use PPI::Element ();
  63         119  
  63         943  
86 63     63   283 use PPI::Token ();
  63         128  
  63         1132  
87 63     63   303 use PPI::Exception ();
  63         163  
  63         1039  
88 63     63   21750 use PPI::Exception::ParserRejection ();
  63         170  
  63         149435  
89              
90             our $VERSION = '1.275';
91              
92             # The x operator cannot follow most Perl operators, implying that
93             # anything beginning with x following an operator is a word.
94             # These are the exceptions.
95             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
96              
97             # The x operator cannot follow most structure elements, implying that
98             # anything beginning with x following a structure element is a word.
99             # These are the exceptions.
100             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
101              
102             # Something that looks like the x operator but follows a word
103             # is usually that word's argument.
104             # These are the exceptions.
105             # chop, chomp, dump are ambiguous because they can have either parms
106             # or no parms.
107             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
108             endgrent
109             endhostent
110             endnetent
111             endprotoent
112             endpwent
113             endservent
114             fork
115             getgrent
116             gethostent
117             getlogin
118             getnetent
119             getppid
120             getprotoent
121             getpwent
122             getservent
123             setgrent
124             setpwent
125             time
126             times
127             wait
128             wantarray
129             __SUB__
130             );
131              
132              
133              
134             #####################################################################
135             # Creation and Initialization
136              
137             =pod
138              
139             =head2 new $file | \@lines | \$source
140              
141             The main C constructor creates a new Tokenizer object. These
142             objects have no configuration parameters, and can only be used once,
143             to tokenize a single perl source file.
144              
145             It takes as argument either a normal scalar containing source code,
146             a reference to a scalar containing source code, or a reference to an
147             ARRAY containing newline-terminated lines of source code.
148              
149             Returns a new C object on success, or throws a
150             L exception on error.
151              
152             =cut
153              
154             sub new {
155 16792   33 16792 1 44997 my $class = ref($_[0]) || $_[0];
156              
157             # Create the empty tokenizer struct
158 16792         100403 my $self = bless {
159             # Source code
160             source => undef,
161             source_bytes => undef,
162              
163             # Line buffer
164             line => undef,
165             line_length => undef,
166             line_cursor => undef,
167             line_count => 0,
168              
169             # Parse state
170             token => undef,
171             class => 'PPI::Token::BOM',
172             zone => 'PPI::Token::Whitespace',
173              
174             # Output token buffer
175             tokens => [],
176             token_cursor => 0,
177             token_eof => 0,
178              
179             # Perl 6 blocks
180             perl6 => [],
181             }, $class;
182              
183 16792 50       57053 if ( ! defined $_[1] ) {
    100          
    50          
    0          
184             # We weren't given anything
185 0         0 PPI::Exception->throw("No source provided to Tokenizer");
186              
187             } elsif ( ! ref $_[1] ) {
188 495         1869 my $source = PPI::Util::_slurp($_[1]);
189 495 50       1580 if ( ref $source ) {
190             # Content returned by reference
191 495         1494 $self->{source} = $$source;
192             } else {
193             # Errors returned as a string
194 0         0 return( $source );
195             }
196              
197             } elsif ( _SCALAR0($_[1]) ) {
198 16297         18108 $self->{source} = ${$_[1]};
  16297         27542  
199              
200             } elsif ( _ARRAY0($_[1]) ) {
201 0         0 $self->{source} = join '', map { "\n" } @{$_[1]};
  0         0  
  0         0  
202              
203             } else {
204             # We don't support whatever this is
205 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
206             }
207              
208             # We can't handle a null string
209 16792         25544 $self->{source_bytes} = length $self->{source};
210 16792 100       24850 if ( $self->{source_bytes} ) {
211             # Split on local newlines
212 16788         263151 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
213 16788         181727 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
214              
215             } else {
216 4         9 $self->{source} = [ ];
217             }
218              
219             ### EVIL
220             # I'm explaining this earlier than I should so you can understand
221             # why I'm about to do something that looks very strange. There's
222             # a problem with the Tokenizer, in that tokens tend to change
223             # classes as each letter is added, but they don't get allocated
224             # their definite final class until the "end" of the token, the
225             # detection of which occurs in about a hundred different places,
226             # all through various crufty code (that triples the speed).
227             #
228             # However, in general, this does not apply to tokens in which a
229             # whitespace character is valid, such as comments, whitespace and
230             # big strings.
231             #
232             # So what we do is add a space to the end of the source. This
233             # triggers normal "end of token" functionality for all cases. Then,
234             # once the tokenizer hits end of file, it examines the last token to
235             # manually either remove the ' ' token, or chop it off the end of
236             # a longer one in which the space would be valid.
237 16792 100   70837   53929 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  70837 100       124000  
  16792 100       46465  
238 10         29 $self->{source_eof_chop} = '';
239             } elsif ( ! defined $self->{source}->[0] ) {
240 4         8 $self->{source_eof_chop} = '';
241             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
242 1061         2502 $self->{source_eof_chop} = '';
243             } else {
244 15717         23440 $self->{source_eof_chop} = 1;
245 15717         28010 $self->{source}->[-1] .= ' ';
246             }
247              
248 16792         56727 $self;
249             }
250              
251              
252              
253              
254              
255             #####################################################################
256             # Main Public Methods
257              
258             =pod
259              
260             =head2 get_token
261              
262             When using the PPI::Tokenizer object as an iterator, the C
263             method is the primary method that is used. It increments the cursor
264             and returns the next Token in the output array.
265              
266             The actual parsing of the file is done only as-needed, and a line at
267             a time. When C hits the end of the token array, it will
268             cause the parser to pull in the next line and parse it, continuing
269             as needed until there are more tokens on the output array that
270             get_token can then return.
271              
272             This means that a number of Tokenizer objects can be created, and
273             won't consume significant CPU until you actually begin to pull tokens
274             from it.
275              
276             Return a L object on success, C<0> if the Tokenizer had
277             reached the end of the file, or C on error.
278              
279             =cut
280              
281             sub get_token {
282 379456     379456 1 442458 my $self = shift;
283              
284             # Shortcut for EOF
285 379456 50 66     619257 if ( $self->{token_eof}
286 13292         28061 and $self->{token_cursor} > scalar @{$self->{tokens}}
287             ) {
288 0         0 return 0;
289             }
290              
291             # Return the next token if we can
292 379456 100       822333 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293 303147         326654 $self->{token_cursor}++;
294 303147         916966 return $token;
295             }
296              
297 76309         78049 my $line_rv;
298              
299             # Catch exceptions and return undef, so that we
300             # can start to convert code to exception-based code.
301 76309         89791 my $rv = eval {
302             # No token, we need to get some more
303 76309         118014 while ( $line_rv = $self->_process_next_line ) {
304             # If there is something in the buffer, return it
305             # The defined() prevents a ton of calls to PPI::Util::TRUE
306 67046 100       136361 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307 46314         54764 $self->{token_cursor}++;
308 46314         71903 return $token;
309             }
310             }
311 29994         38214 return undef;
312             };
313 76309 100       176879 if ( $@ ) {
    100          
314 1 50       8 if ( _INSTANCE($@, 'PPI::Exception') ) {
315 1         3 $@->throw;
316             } else {
317 0         0 my $errstr = $@;
318 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
319 0         0 PPI::Exception->throw( $errstr );
320             }
321             } elsif ( $rv ) {
322 46314         177927 return $rv;
323             }
324              
325 29994 50       44220 if ( defined $line_rv ) {
326             # End of file, but we can still return things from the buffer
327 29994 50       48522 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328 0         0 $self->{token_cursor}++;
329 0         0 return $token;
330             }
331              
332             # Set our token end of file flag
333 29994         33364 $self->{token_eof} = 1;
334 29994         85948 return 0;
335             }
336              
337             # Error, pass it up to our caller
338 0         0 undef;
339             }
340              
341             =pod
342              
343             =head2 all_tokens
344              
345             When not being used as an iterator, the C method tells
346             the Tokenizer to parse the entire file and return all of the tokens
347             in a single ARRAY reference.
348              
349             It should be noted that C does B interfere with the
350             use of the Tokenizer object as an iterator (does not modify the token
351             cursor) and use of the two different mechanisms can be mixed safely.
352              
353             Returns a reference to an ARRAY of L objects on success
354             or throws an exception on error.
355              
356             =cut
357              
358             sub all_tokens {
359 4     4 1 17 my $self = shift;
360              
361             # Catch exceptions and return undef, so that we
362             # can start to convert code to exception-based code.
363 4         6 my $ok = eval {
364             # Process lines until we get EOF
365 4 50       9 unless ( $self->{token_eof} ) {
366 4         4 my $rv;
367 4         8 while ( $rv = $self->_process_next_line ) {}
368 4 50       8 unless ( defined $rv ) {
369 0         0 PPI::Exception->throw("Error while processing source");
370             }
371              
372             # Clean up the end of the tokenizer
373 4         5 $self->_clean_eof;
374             }
375 4         7 1;
376             };
377 4 50       6 if ( !$ok ) {
378 0         0 my $errstr = $@;
379 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
380 0         0 PPI::Exception->throw( $errstr );
381             }
382              
383             # End of file, return a copy of the token array.
384 4         4 return [ @{$self->{tokens}} ];
  4         13  
385             }
386              
387             =pod
388              
389             =head2 increment_cursor
390              
391             Although exposed as a public method, C is implemented
392             for expert use only, when writing lexers or other components that work
393             directly on token streams.
394              
395             It manually increments the token cursor forward through the file, in effect
396             "skipping" the next token.
397              
398             Return true if the cursor is incremented, C<0> if already at the end of
399             the file, or C on error.
400              
401             =cut
402              
403             sub increment_cursor {
404             # Do this via the get_token method, which makes sure there
405             # is actually a token there to move to.
406 0 0   0 1 0 $_[0]->get_token and 1;
407             }
408              
409             =pod
410              
411             =head2 decrement_cursor
412              
413             Although exposed as a public method, C is implemented
414             for expert use only, when writing lexers or other components that work
415             directly on token streams.
416              
417             It manually decrements the token cursor backwards through the file, in
418             effect "rolling back" the token stream. And indeed that is what it is
419             primarily intended for, when the component that is consuming the token
420             stream needs to implement some sort of "roll back" feature in its use
421             of the token stream.
422              
423             Return true if the cursor is decremented, C<0> if already at the
424             beginning of the file, or C on error.
425              
426             =cut
427              
428             sub decrement_cursor {
429 0     0 1 0 my $self = shift;
430              
431             # Check for the beginning of the file
432 0 0       0 return 0 unless $self->{token_cursor};
433              
434             # Decrement the token cursor
435 0         0 $self->{token_eof} = 0;
436 0         0 --$self->{token_cursor};
437             }
438              
439              
440              
441              
442              
443             #####################################################################
444             # Working With Source
445              
446             # Fetches the next line from the input line buffer
447             # Returns undef at EOF.
448             sub _get_line {
449 104920     104920   109322 my $self = shift;
450 104920 100       173068 return undef unless $self->{source}; # EOF hit previously
451              
452             # Pull off the next line
453 89130         92046 my $line = shift @{$self->{source}};
  89130         151606  
454              
455             # Flag EOF if we hit it
456 89130 100       147348 $self->{source} = undef unless defined $line;
457              
458             # Return the line (or EOF flag)
459 89130         130293 return $line; # string or undef
460             }
461              
462             # Fetches the next line, ready to process
463             # Returns 1 on success
464             # Returns 0 on EOF
465             sub _fill_line {
466 102614     102614   110856 my $self = shift;
467 102614         106162 my $inscan = shift;
468              
469             # Get the next line
470 102614         140273 my $line = $self->_get_line;
471 102614 100       155783 unless ( defined $line ) {
472             # End of file
473 32023 100       47407 unless ( $inscan ) {
474 29998         43508 delete $self->{line};
475 29998         33033 delete $self->{line_cursor};
476 29998         31485 delete $self->{line_length};
477 29998         54631 return 0;
478             }
479              
480             # In the scan version, just set the cursor to the end
481             # of the line, and the rest should just cascade out.
482 2025         2547 $self->{line_cursor} = $self->{line_length};
483 2025         3857 return 0;
484             }
485              
486             # Populate the appropriate variables
487 70591         99095 $self->{line} = $line;
488 70591         79455 $self->{line_cursor} = -1;
489 70591         83545 $self->{line_length} = length $line;
490 70591         76668 $self->{line_count}++;
491              
492 70591         114575 1;
493             }
494              
495             # Get the current character
496             sub _char {
497 0     0   0 my $self = shift;
498 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
499             }
500              
501              
502              
503              
504              
505             ####################################################################
506             # Per line processing methods
507              
508             # Processes the next line
509             # Returns 1 on success completion
510             # Returns 0 if EOF
511             # Returns undef on error
512             sub _process_next_line {
513 97055     97055   108142 my $self = shift;
514              
515             # Fill the line buffer
516 97055         96531 my $rv;
517 97055 100       137490 unless ( $rv = $self->_fill_line ) {
518 29998 50       45585 return undef unless defined $rv;
519              
520             # End of file, finalize last token
521 29998         52353 $self->_finalize_token;
522 29998         55166 return 0;
523             }
524              
525             # Run the __TOKENIZER__on_line_start
526 67057         153368 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527 67057 100       107582 unless ( $rv ) {
528             # If there are no more source lines, then clean up
529 27737 100 66     47778 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  27737         59023  
530 307         797 $self->_clean_eof;
531             }
532              
533             # Defined but false means next line
534 27737 50       57566 return 1 if defined $rv;
535 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
536             }
537              
538             # If we can't deal with the entire line, process char by char
539 39320         61131 while ( $rv = $self->_process_next_char ) {}
540 39319 50       64854 unless ( defined $rv ) {
541 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542             }
543              
544             # Trigger any action that needs to happen at the end of a line
545 39319         89447 $self->{class}->__TOKENIZER__on_line_end( $self );
546              
547             # If there are no more source lines, then clean up
548 39319 100 100     84109 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  37052         91711  
549 16480         27059 return $self->_clean_eof;
550             }
551              
552 22839         45294 return 1;
553             }
554              
555              
556              
557              
558              
559             #####################################################################
560             # Per-character processing methods
561              
562             # Process on a per-character basis.
563             # Note that due the high number of times this gets
564             # called, it has been fairly heavily in-lined, so the code
565             # might look a bit ugly and duplicated.
566             sub _process_next_char {
567 432915     432915   493891 my $self = shift;
568              
569             ### FIXME - This checks for a screwed up condition that triggers
570             ### several warnings, amongst other things.
571 432915 50 33     1073971 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572             # $DB::single = 1;
573 0         0 return undef;
574             }
575              
576             # Increment the counter and check for end of line
577 432915 100       684749 return 0 if ++$self->{line_cursor} >= $self->{line_length};
578              
579             # Pass control to the token class
580 393596         386874 my $result;
581 393596 100       718745 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583 76004 50       205522 return defined $result ? 1 : undef;
584             }
585              
586             # We will need the value of the current character
587 317591         466078 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588 317591 100       469619 if ( $result eq '1' ) {
589             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590             # the character is part of it.
591              
592             # Add the character
593 57466 50       82430 if ( defined $self->{token} ) {
594 57466         80186 $self->{token}->{content} .= $char;
595             } else {
596 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
597             }
598              
599 57466         123426 return 1;
600             }
601              
602             # We have been provided with the name of a class
603 260125 100       502077 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
604             # New class
605 101228         156357 $self->_new_token( $result, $char );
606             } elsif ( defined $self->{token} ) {
607             # Same class as current
608 29445         36617 $self->{token}->{content} .= $char;
609             } else {
610             # Same class, but no current
611 129452 50       245308 defined($self->{token} = $self->{class}->new($char)) or return undef;
612             }
613              
614 260125         546914 1;
615             }
616              
617              
618              
619              
620              
621             #####################################################################
622             # Altering Tokens in Tokenizer
623              
624             # Finish the end of a token.
625             # Returns the resulting parse class as a convenience.
626             sub _finalize_token {
627 392681     392681   430891 my $self = shift;
628 392681 100       583246 return $self->{class} unless defined $self->{token};
629              
630             # Add the token to the token buffer
631 362681         360073 push @{ $self->{tokens} }, $self->{token};
  362681         551316  
632 362681         413172 $self->{token} = undef;
633              
634             # Return the parse class to that of the zone we are in
635 362681         667751 $self->{class} = $self->{zone};
636             }
637              
638             # Creates a new token and sets it in the tokenizer
639             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
640             sub _new_token {
641 233227     233227   253834 my $self = shift;
642             # throw PPI::Exception() unless @_;
643 233227 100       444651 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
644             ? shift : 'PPI::Token::' . shift;
645              
646             # Finalize any existing token
647 233227 100       439758 $self->_finalize_token if defined $self->{token};
648              
649             # Create the new token and update the parse class
650 233227 50       507428 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
651 233227         330341 $self->{class} = $class;
652              
653 233227         290840 1;
654             }
655              
656             # At the end of the file, we need to clean up the results of the erroneous
657             # space that we inserted at the beginning of the process.
658             sub _clean_eof {
659 16791     16791   19285 my $self = shift;
660              
661             # Finish any partially completed token
662 16791 100       28100 $self->_finalize_token if $self->{token};
663              
664             # Find the last token, and if it has no content, kill it.
665             # There appears to be some evidence that such "null tokens" are
666             # somehow getting created accidentally.
667 16791         20557 my $last_token = $self->{tokens}->[ -1 ];
668 16791 50       29742 unless ( length $last_token->{content} ) {
669 0         0 pop @{$self->{tokens}};
  0         0  
670             }
671              
672             # Now, if the last character of the last token is a space we added,
673             # chop it off, deleting the token if there's nothing else left.
674 16791 100       27772 if ( $self->{source_eof_chop} ) {
675 15444         18555 $last_token = $self->{tokens}->[ -1 ];
676 15444         57687 $last_token->{content} =~ s/ $//;
677 15444 100       31199 unless ( length $last_token->{content} ) {
678             # Popping token
679 13192         13623 pop @{$self->{tokens}};
  13192         17874  
680             }
681              
682             # The hack involving adding an extra space is now reversed, and
683             # now nobody will ever know. The perfect crime!
684 15444         22390 $self->{source_eof_chop} = '';
685             }
686              
687 16791         43051 1;
688             }
689              
690              
691              
692              
693              
694             #####################################################################
695             # Utility Methods
696              
697             # Context
698             sub _last_token {
699 0     0   0 $_[0]->{tokens}->[-1];
700             }
701              
702             sub _last_significant_token {
703 3115     3115   3948 my $self = shift;
704 3115         3554 my $cursor = $#{ $self->{tokens} };
  3115         4725  
705 3115         5883 while ( $cursor >= 0 ) {
706 4128         5643 my $token = $self->{tokens}->[$cursor--];
707 4128 100       10787 return $token if $token->significant;
708             }
709 406         738 return;
710             }
711              
712             # Get an array ref of previous significant tokens.
713             # Like _last_significant_token except it gets more than just one token
714             # Returns array with 0 to x entries
715             sub _previous_significant_tokens {
716 149023     149023   162848 my $self = shift;
717 149023   50     211193 my $count = shift || 1;
718 149023         149441 my $cursor = $#{ $self->{tokens} };
  149023         199901  
719              
720 149023         168368 my @tokens;
721 149023         225242 while ( $cursor >= 0 ) {
722 238330         274567 my $token = $self->{tokens}->[$cursor--];
723 238330 100       444017 next if not $token->significant;
724 154344         180185 push @tokens, $token;
725 154344 100       267142 last if @tokens >= $count;
726             }
727              
728 149023         271159 return @tokens;
729             }
730              
731             my %OBVIOUS_CLASS = (
732             'PPI::Token::Symbol' => 'operator',
733             'PPI::Token::Magic' => 'operator',
734             'PPI::Token::Number' => 'operator',
735             'PPI::Token::ArrayIndex' => 'operator',
736             'PPI::Token::Quote::Double' => 'operator',
737             'PPI::Token::Quote::Interpolate' => 'operator',
738             'PPI::Token::Quote::Literal' => 'operator',
739             'PPI::Token::Quote::Single' => 'operator',
740             'PPI::Token::QuoteLike::Backtick' => 'operator',
741             'PPI::Token::QuoteLike::Command' => 'operator',
742             'PPI::Token::QuoteLike::Readline' => 'operator',
743             'PPI::Token::QuoteLike::Regexp' => 'operator',
744             'PPI::Token::QuoteLike::Words' => 'operator',
745             );
746              
747             my %OBVIOUS_CONTENT = (
748             '(' => 'operand',
749             '{' => 'operand',
750             '[' => 'operand',
751             ';' => 'operand',
752             '}' => 'operator',
753             );
754              
755              
756             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
757              
758             # Try to determine operator/operand context, if possible.
759             # Returns "operator", "operand", or "" if unknown.
760             sub _opcontext {
761 6914     6914   8073 my $self = shift;
762 6914         12111 my @tokens = $self->_previous_significant_tokens(1);
763 6914         8152 my $p0 = $tokens[0];
764 6914 100       17305 return '' if not $p0;
765 6793         9474 my $c0 = ref $p0;
766              
767             # Map the obvious cases
768 6793 100       17557 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
769 2215 100       4731 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
770              
771             # Most of the time after an operator, we are an operand
772 1654 100       6462 return 'operand' if $p0->isa('PPI::Token::Operator');
773              
774             # If there's NOTHING, it's operand
775 1452 50       2468 return 'operand' if $p0->content eq '';
776              
777             # Otherwise, we don't know
778 1452         3123 return ''
779             }
780              
781             # Assuming we are currently parsing the word 'x', return true
782             # if previous tokens imply the x is an operator, false otherwise.
783             sub _current_x_is_operator {
784 1134     1134   1703 my ( $self ) = @_;
785 1134 100       1164 return if !@{$self->{tokens}};
  1134         2461  
786              
787 928         1635 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
788 928 100       2855 return if !$prev;
789              
790 926 100       3061 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
791              
792             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
793 762   100     4054 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
794             && !$prev->isa('PPI::Token::Label')
795             ;
796             }
797              
798              
799             # Assuming we are at the end of parsing the current token that could be a word,
800             # a wordlike operator, or a version string, try to determine whether context
801             # before or after it forces it to be a bareword. This method is only useful
802             # during tokenization.
803             sub __current_token_is_forced_word {
804 32544     32544   51470 my ( $t, $word ) = @_;
805              
806             # Check if forced by preceding tokens.
807              
808 32544         46993 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
809 32544 100       67191 if ( !$prev ) {
810 8908         17764 pos $t->{line} = $t->{line_cursor};
811             }
812             else {
813 23636         34535 my $content = $prev->{content};
814              
815             # We are forced if we are a method name.
816             # '->' will always be an operator, so we don't check its type.
817 23636 100       37358 return 1 if $content eq '->';
818              
819             # If we are contained in a pair of curly braces, we are probably a
820             # forced bareword hash key. '{' is never a word or operator, so we
821             # don't check its type.
822 23509         44002 pos $t->{line} = $t->{line_cursor};
823 23509 100 100     56306 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
824              
825             # sub, package, use, and no all indicate that what immediately follows
826             # is a word not an operator or (in the case of sub and package) a
827             # version string. However, we don't want to be fooled by 'package
828             # package v10' or 'use no v10'. We're a forced package unless we're
829             # preceded by 'package sub', in which case we're a version string.
830             # We also have to make sure that the sub/package/etc doing the forcing
831             # is not a method call.
832 23280 100       42998 if( $USUALLY_FORCES{$content}) {
833 5631 100 66     12173 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
834 5621 100       20188 return 1 if not $prevprev;
835 236 100 100     553 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
836 6         26 return;
837             }
838             }
839             # pos on $t->{line} is guaranteed to be set at this point.
840              
841             # Check if forced by following tokens.
842              
843             # If the word is followed by => it is probably a word, not a regex.
844 26557 100       59473 return 1 if $t->{line} =~ /\G\s*=>/gc;
845              
846             # Otherwise we probably aren't forced
847 25766         116252 return '';
848             }
849              
850             1;
851              
852             =pod
853              
854             =head1 NOTES
855              
856             =head2 How the Tokenizer Works
857              
858             Understanding the Tokenizer is not for the faint-hearted. It is by far
859             the most complex and twisty piece of perl I've ever written that is actually
860             still built properly and isn't a terrible spaghetti-like mess. In fact, you
861             probably want to skip this section.
862              
863             But if you really want to understand, well then here goes.
864              
865             =head2 Source Input and Clean Up
866              
867             The Tokenizer starts by taking source in a variety of forms, sucking it
868             all in and merging into one big string, and doing our own internal line
869             split, using a "universal line separator" which allows the Tokenizer to
870             take source for any platform (and even supports a few known types of
871             broken newlines caused by mixed mac/pc/*nix editor screw ups).
872              
873             The resulting array of lines is used to feed the tokenizer, and is also
874             accessed directly by the heredoc-logic to do the line-oriented part of
875             here-doc support.
876              
877             =head2 Doing Things the Old Fashioned Way
878              
879             Due to the complexity of perl, and after 2 previously aborted parser
880             attempts, in the end the tokenizer was fashioned around a line-buffered
881             character-by-character method.
882              
883             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
884             and then iterates a cursor along it. At each cursor position, a method is
885             called in whatever token class we are currently in, which will examine the
886             character at the current position, and handle it.
887              
888             As the handler methods in the various token classes are called, they
889             build up an output token array for the source code.
890              
891             Various parts of the Tokenizer use look-ahead, arbitrary-distance
892             look-behind (although currently the maximum is three significant tokens),
893             or both, and various other heuristic guesses.
894              
895             I've been told it is officially termed a I<"backtracking parser
896             with infinite lookaheads">.
897              
898             =head2 State Variables
899              
900             Aside from the current line and the character cursor, the Tokenizer
901             maintains a number of different state variables.
902              
903             =over
904              
905             =item Current Class
906              
907             The Tokenizer maintains the current token class at all times. Much of the
908             time is just going to be the "Whitespace" class, which is what the base of
909             a document is. As the tokenizer executes the various character handlers,
910             the class changes a lot as it moves a long. In fact, in some instances,
911             the character handler may not handle the character directly itself, but
912             rather change the "current class" and then hand off to the character
913             handler for the new class.
914              
915             Because of this, and some other things I'll deal with later, the number of
916             times the character handlers are called does not in fact have a direct
917             relationship to the number of actual characters in the document.
918              
919             =item Current Zone
920              
921             Rather than create a class stack to allow for infinitely nested layers of
922             classes, the Tokenizer recognises just a single layer.
923              
924             To put it a different way, in various parts of the file, the Tokenizer will
925             recognise different "base" or "substrate" classes. When a Token such as a
926             comment or a number is finalised by the tokenizer, it "falls back" to the
927             base state.
928              
929             This allows proper tokenization of special areas such as __DATA__
930             and __END__ blocks, which also contain things like comments and POD,
931             without allowing the creation of any significant Tokens inside these areas.
932              
933             For the main part of a document we use L for this,
934             with the idea being that code is "floating in a sea of whitespace".
935              
936             =item Current Token
937              
938             The final main state variable is the "current token". This is the Token
939             that is currently being built by the Tokenizer. For certain types, it
940             can be manipulated and morphed and change class quite a bit while being
941             assembled, as the Tokenizer's understanding of the token content changes.
942              
943             When the Tokenizer is confident that it has seen the end of the Token, it
944             will be "finalized", which adds it to the output token array and resets
945             the current class to that of the zone that we are currently in.
946              
947             I should also note at this point that the "current token" variable is
948             optional. The Tokenizer is capable of knowing what class it is currently
949             set to, without actually having accumulated any characters in the Token.
950              
951             =back
952              
953             =head2 Making It Faster
954              
955             As I'm sure you can imagine, calling several different methods for each
956             character and running regexes and other complex heuristics made the first
957             fully working version of the tokenizer extremely slow.
958              
959             During testing, I created a metric to measure parsing speed called
960             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
961             cycles on a typical single-core CPU, and so a Tokenizer running at
962             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
963             code when running on a 1200 MHz processor.
964              
965             The first working version of the tokenizer ran at only 350 LPGC, so to
966             tokenize a typical large module such as L took
967             10-15 seconds. This sluggishness made it unpractical for many uses.
968              
969             So in the current parser, there are multiple layers of optimisation
970             very carefully built in to the basic. This has brought the tokenizer
971             up to a more reasonable 1000 LPGC, at the expense of making the code
972             quite a bit twistier.
973              
974             =head2 Making It Faster - Whole Line Classification
975              
976             The first step in the optimisation process was to add a hew handler to
977             enable several of the more basic classes (whitespace, comments) to be
978             able to be parsed a line at a time. At the start of each line, a
979             special optional handler (only supported by a few classes) is called to
980             check and see if the entire line can be parsed in one go.
981              
982             This is used mainly to handle things like POD, comments, empty lines,
983             and a few other minor special cases.
984              
985             =head2 Making It Faster - Inlining
986              
987             The second stage of the optimisation involved inlining a small
988             number of critical methods that were repeated an extremely high number
989             of times. Profiling suggested that there were about 1,000,000 individual
990             method calls per gigacycle, and by cutting these by two thirds a significant
991             speed improvement was gained, in the order of about 50%.
992              
993             You may notice that many methods in the C code look
994             very nested and long hand. This is primarily due to this inlining.
995              
996             At around this time, some statistics code that existed in the early
997             versions of the parser was also removed, as it was determined that
998             it was consuming around 15% of the CPU for the entire parser, while
999             making the core more complicated.
1000              
1001             A judgment call was made that with the difficulties likely to be
1002             encountered with future planned enhancements, and given the relatively
1003             high cost involved, the statistics features would be removed from the
1004             Tokenizer.
1005              
1006             =head2 Making It Faster - Quote Engine
1007              
1008             Once inlining had reached diminishing returns, it became obvious from
1009             the profiling results that a huge amount of time was being spent
1010             stepping a char at a time though long, simple and "syntactically boring"
1011             code such as comments and strings.
1012              
1013             The existing regex engine was expanded to also encompass quotes and
1014             other quote-like things, and a special abstract base class was added
1015             that provided a number of specialised parsing methods that would "scan
1016             ahead", looking out ahead to find the end of a string, and updating
1017             the cursor to leave it in a valid position for the next call.
1018              
1019             This is also the point at which the number of character handler calls began
1020             to greatly differ from the number of characters. But it has been done
1021             in a way that allows the parser to retain the power of the original
1022             version at the critical points, while skipping through the "boring bits"
1023             as needed for additional speed.
1024              
1025             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1026             for the first time.
1027              
1028             =head2 Making It Faster - The "Complete" Mechanism
1029              
1030             As it became evident that great speed increases were available by using
1031             this "skipping ahead" mechanism, a new handler method was added that
1032             explicitly handles the parsing of an entire token, where the structure
1033             of the token is relatively simple. Tokens such as symbols fit this case,
1034             as once we are passed the initial sigil and word char, we know that we
1035             can skip ahead and "complete" the rest of the token much more easily.
1036              
1037             A number of these have been added for most or possibly all of the common
1038             cases, with most of these "complete" handlers implemented using regular
1039             expressions.
1040              
1041             In fact, so many have been added that at this point, you could arguably
1042             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1043             tokenizer". More tokens are now consumed in "complete" methods in a
1044             typical program than are handled by the normal char-by-char methods.
1045              
1046             Many of the these complete-handlers were implemented during the writing
1047             of the Lexer, and this has allowed the full parser to maintain around
1048             1000 LPGC despite the increasing weight of the Lexer.
1049              
1050             =head2 Making It Faster - Porting To C (In Progress)
1051              
1052             While it would be extraordinarily difficult to port all of the Tokenizer
1053             to C, work has started on a L "accelerator" package which acts as
1054             a separate and automatically-detected add-on to the main PPI package.
1055              
1056             L implements faster versions of a variety of functions scattered
1057             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1058             various other places, and implements them identically in XS/C.
1059              
1060             In particular, the skip-ahead methods from the Quote Engine would appear
1061             to be extremely amenable to being done in C, and a number of other
1062             functions could be cherry-picked one at a time and implemented in C.
1063              
1064             Each method is heavily tested to ensure that the functionality is
1065             identical, and a versioning mechanism is included to ensure that if a
1066             function gets out of sync, L will degrade gracefully and just
1067             not replace that single method.
1068              
1069             =head1 TO DO
1070              
1071             - Add an option to reset or seek the token stream...
1072              
1073             - Implement more Tokenizer functions in L
1074              
1075             =head1 SUPPORT
1076              
1077             See the L in the main module.
1078              
1079             =head1 AUTHOR
1080              
1081             Adam Kennedy Eadamk@cpan.orgE
1082              
1083             =head1 COPYRIGHT
1084              
1085             Copyright 2001 - 2011 Adam Kennedy.
1086              
1087             This program is free software; you can redistribute
1088             it and/or modify it under the same terms as Perl itself.
1089              
1090             The full text of the license can be found in the
1091             LICENSE file included with this module.
1092              
1093             =cut