File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 223 249 89.5
branch 118 144 81.9
condition 27 37 72.9
subroutine 29 33 87.8
pod 5 5 100.0
total 402 468 85.9


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C is implemented separately
58             from the lexer L.
59              
60             The job of C is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 68     68   383 use strict;
  68         145  
  68         2544  
82 68     68   267 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  68         153  
  68         3895  
83 68     68   306 use List::Util 1.33 ();
  68         1156  
  68         1178  
84 68     68   283 use PPI::Util ();
  68         99  
  68         1231  
85 68     68   504 use PPI::Element ();
  68         110  
  68         932  
86 68     68   258 use PPI::Token ();
  68         112  
  68         837  
87 68     68   235 use PPI::Exception ();
  68         122  
  68         822  
88 68     68   25127 use PPI::Exception::ParserRejection ();
  68         171  
  68         1173  
89 68     68   260 use PPI::Document ();
  68         96  
  68         185856  
90              
91             our $VERSION = '1.291';
92              
93             # The x operator cannot follow most Perl operators, implying that
94             # anything beginning with x following an operator is a word.
95             # These are the exceptions.
96             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97              
98             # The x operator cannot follow most structure elements, implying that
99             # anything beginning with x following a structure element is a word.
100             # These are the exceptions.
101             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102              
103             # Something that looks like the x operator but follows a word
104             # is usually that word's argument.
105             # These are the exceptions.
106             # chop, chomp, dump are ambiguous because they can have either parms
107             # or no parms.
108             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109             endgrent
110             endhostent
111             endnetent
112             endprotoent
113             endpwent
114             endservent
115             fork
116             getgrent
117             gethostent
118             getlogin
119             getnetent
120             getppid
121             getprotoent
122             getpwent
123             getservent
124             setgrent
125             setpwent
126             time
127             times
128             wait
129             wantarray
130             __SUB__
131             );
132              
133              
134              
135             #####################################################################
136             # Creation and Initialization
137              
138             =pod
139              
140             =head2 new $file | \@lines | \$source
141              
142             The main C constructor creates a new Tokenizer object. These
143             objects have no configuration parameters, and can only be used once,
144             to tokenize a single perl source file.
145              
146             It takes as argument either a normal scalar containing source code,
147             a reference to a scalar containing source code, or a reference to an
148             ARRAY containing newline-terminated lines of source code.
149              
150             Returns a new C object on success, or throws a
151             L exception on error.
152              
153             =cut
154              
155             sub new {
156 16939   33 16939 1 61836 my $class = ref($_[0]) || $_[0];
157              
158             # Create the empty tokenizer struct
159 16939         146690 my $self = bless {
160             # Source code
161             source => undef,
162             source_bytes => undef,
163             document => undef,
164              
165             # Line buffer
166             line => undef,
167             line_length => undef,
168             line_cursor => undef,
169             line_count => 0,
170              
171             # Parse state
172             token => undef,
173             class => 'PPI::Token::BOM',
174             zone => 'PPI::Token::Whitespace',
175             feature_set => undef,
176              
177             # Output token buffer
178             tokens => [],
179             token_cursor => 0,
180             token_eof => 0,
181              
182             # Perl 6 blocks
183             perl6 => [],
184             }, $class;
185              
186 16939 100       58065 if ( ! ref $_[1] ) {
    100          
    50          
187 518         2499 my $source = PPI::Util::_slurp($_[1]);
188 518 100       1822 PPI::Exception->throw("Tokenizer failed to open file: $source")
189             if not ref $source;
190 516         1926 $self->{source} = $$source;
191              
192             } elsif ( _SCALAR0($_[1]) ) {
193 1         8 PPI::Exception->throw("Did not pass a string: ${$_[1]}")
194 16418 100       19467 if _SCALAR0( $self->{source} = ${$_[1]} );
  16418         50431  
195              
196             } elsif ( _ARRAY0($_[1]) ) {
197 3         17 $self->{source} = join '', map "$_\n", @{$_[1]};
  3         15  
198              
199             } else {
200             # We don't support whatever this is
201 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
202             }
203              
204             # We can't handle a null string
205 16936         32172 $self->{source_bytes} = length $self->{source};
206 16936 100       29689 if ( $self->{source_bytes} ) {
207             # Split on local newlines
208 16932         526030 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
209 16932         257321 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
210              
211             } else {
212 4         6 $self->{source} = [ ];
213             }
214              
215             ### EVIL
216             # I'm explaining this earlier than I should so you can understand
217             # why I'm about to do something that looks very strange. There's
218             # a problem with the Tokenizer, in that tokens tend to change
219             # classes as each letter is added, but they don't get allocated
220             # their definite final class until the "end" of the token, the
221             # detection of which occurs in about a hundred different places,
222             # all through various crufty code (that triples the speed).
223             #
224             # However, in general, this does not apply to tokens in which a
225             # whitespace character is valid, such as comments, whitespace and
226             # big strings.
227             #
228             # So what we do is add a space to the end of the source. This
229             # triggers normal "end of token" functionality for all cases. Then,
230             # once the tokenizer hits end of file, it examines the last token to
231             # manually either remove the ' ' token, or chop it off the end of
232             # a longer one in which the space would be valid.
233 16936 100   77168   67769 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  77168 100       164112  
  16936 100       65148  
234 10         30 $self->{source_eof_chop} = '';
235             } elsif ( ! defined $self->{source}->[0] ) {
236 4         7 $self->{source_eof_chop} = '';
237             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
238 1128         3368 $self->{source_eof_chop} = '';
239             } else {
240 15794         30982 $self->{source_eof_chop} = 1;
241 15794         29040 $self->{source}->[-1] .= ' ';
242             }
243              
244 16936         72279 $self;
245             }
246              
247             sub _document {
248 16845     16845   20626 my $self = shift;
249 16845 50       40540 return @_ ? $self->{document} = shift : $self->{document};
250             }
251              
252              
253              
254              
255              
256             #####################################################################
257             # Main Public Methods
258              
259             =pod
260              
261             =head2 get_token
262              
263             When using the PPI::Tokenizer object as an iterator, the C
264             method is the primary method that is used. It increments the cursor
265             and returns the next Token in the output array.
266              
267             The actual parsing of the file is done only as-needed, and a line at
268             a time. When C hits the end of the token array, it will
269             cause the parser to pull in the next line and parse it, continuing
270             as needed until there are more tokens on the output array that
271             get_token can then return.
272              
273             This means that a number of Tokenizer objects can be created, and
274             won't consume significant CPU until you actually begin to pull tokens
275             from it.
276              
277             Return a L object on success, C<0> if the Tokenizer had
278             reached the end of the file, or C on error.
279              
280             =cut
281              
282             sub get_token {
283 597242     597242 1 597115 my $self = shift;
284              
285             # Shortcut for EOF
286 597242 50 66     897028 if ( $self->{token_eof}
287 13618         25490 and $self->{token_cursor} > scalar @{$self->{tokens}}
288             ) {
289 0         0 return 0;
290             }
291              
292             # Return the next token if we can
293 597242 100       1211490 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
294 514738         514292 $self->{token_cursor}++;
295 514738         1229847 return $token;
296             }
297              
298 82504         82321 my $line_rv;
299              
300             # Catch exceptions and return undef, so that we
301             # can start to convert code to exception-based code.
302 82504         92491 my $rv = eval {
303             # No token, we need to get some more
304 82504         138376 while ( $line_rv = $self->_process_next_line ) {
305             # If there is something in the buffer, return it
306             # The defined() prevents a ton of calls to PPI::Util::TRUE
307 73403 100       148176 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
308 52042         54681 $self->{token_cursor}++;
309 52042         90118 return $token;
310             }
311             }
312 30461         39025 return undef;
313             };
314 82504 100       178833 if ( $@ ) {
    100          
315 1 50       8 if ( _INSTANCE($@, 'PPI::Exception') ) {
316 1         2 $@->throw;
317             } else {
318 0         0 my $errstr = $@;
319 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
320 0         0 PPI::Exception->throw( $errstr );
321             }
322             } elsif ( $rv ) {
323 52042         211512 return $rv;
324             }
325              
326 30461 50       41178 if ( defined $line_rv ) {
327             # End of file, but we can still return things from the buffer
328 30461 50       50737 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
329 0         0 $self->{token_cursor}++;
330 0         0 return $token;
331             }
332              
333             # Set our token end of file flag
334 30461         33958 $self->{token_eof} = 1;
335 30461         83090 return 0;
336             }
337              
338             # Error, pass it up to our caller
339 0         0 undef;
340             }
341              
342             =pod
343              
344             =head2 all_tokens
345              
346             When not being used as an iterator, the C method tells
347             the Tokenizer to parse the entire file and return all of the tokens
348             in a single ARRAY reference.
349              
350             It should be noted that C does B interfere with the
351             use of the Tokenizer object as an iterator (does not modify the token
352             cursor) and use of the two different mechanisms can be mixed safely.
353              
354             Returns a reference to an ARRAY of L objects on success
355             or throws an exception on error.
356              
357             =cut
358              
359             sub all_tokens {
360 5     5 1 21 my $self = shift;
361              
362             # Catch exceptions and return undef, so that we
363             # can start to convert code to exception-based code.
364 5         8 my $ok = eval {
365             # Process lines until we get EOF
366 5 50       14 unless ( $self->{token_eof} ) {
367 5         6 my $rv;
368 5         18 while ( $rv = $self->_process_next_line ) {}
369 5 50       8 unless ( defined $rv ) {
370 0         0 PPI::Exception->throw("Error while processing source");
371             }
372              
373             # Clean up the end of the tokenizer
374 5         12 $self->_clean_eof;
375             }
376 5         8 1;
377             };
378 5 50       10 if ( !$ok ) {
379 0         0 my $errstr = $@;
380 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
381 0         0 PPI::Exception->throw( $errstr );
382             }
383              
384             # End of file, return a copy of the token array.
385 5         7 return [ @{$self->{tokens}} ];
  5         22  
386             }
387              
388             =pod
389              
390             =head2 increment_cursor
391              
392             Although exposed as a public method, C is implemented
393             for expert use only, when writing lexers or other components that work
394             directly on token streams.
395              
396             It manually increments the token cursor forward through the file, in effect
397             "skipping" the next token.
398              
399             Return true if the cursor is incremented, C<0> if already at the end of
400             the file, or C on error.
401              
402             =cut
403              
404             sub increment_cursor {
405             # Do this via the get_token method, which makes sure there
406             # is actually a token there to move to.
407 0 0   0 1 0 $_[0]->get_token and 1;
408             }
409              
410             =pod
411              
412             =head2 decrement_cursor
413              
414             Although exposed as a public method, C is implemented
415             for expert use only, when writing lexers or other components that work
416             directly on token streams.
417              
418             It manually decrements the token cursor backwards through the file, in
419             effect "rolling back" the token stream. And indeed that is what it is
420             primarily intended for, when the component that is consuming the token
421             stream needs to implement some sort of "roll back" feature in its use
422             of the token stream.
423              
424             Return true if the cursor is decremented, C<0> if already at the
425             beginning of the file, or C on error.
426              
427             =cut
428              
429             sub decrement_cursor {
430 0     0 1 0 my $self = shift;
431              
432             # Check for the beginning of the file
433 0 0       0 return 0 unless $self->{token_cursor};
434              
435             # Decrement the token cursor
436 0         0 $self->{token_eof} = 0;
437 0         0 --$self->{token_cursor};
438             }
439              
440              
441              
442              
443              
444             #####################################################################
445             # Working With Source
446              
447             # Fetches the next line from the input line buffer
448             # Returns undef at EOF.
449             sub _get_line {
450 111750     111750   104141 my $self = shift;
451 111750 100       173560 return undef unless $self->{source}; # EOF hit previously
452              
453             # Pull off the next line
454 95605         88574 my $line = shift @{$self->{source}};
  95605         151303  
455              
456             # Flag EOF if we hit it
457 95605 100       159863 $self->{source} = undef unless defined $line;
458              
459             # Return the line (or EOF flag)
460 95605         140333 return $line; # string or undef
461             }
462              
463             # Fetches the next line, ready to process
464             # Returns 1 on success
465             # Returns 0 on EOF
466             sub _fill_line {
467 109357     109357   104355 my $self = shift;
468 109357         115035 my $inscan = shift;
469              
470             # Get the next line
471 109357         149874 my $line = $self->_get_line;
472 109357 100       159733 unless ( defined $line ) {
473             # End of file
474 32520 100       45913 unless ( $inscan ) {
475 30466         47347 delete $self->{line};
476 30466         34335 delete $self->{line_cursor};
477 30466         31171 delete $self->{line_length};
478 30466         54562 return 0;
479             }
480              
481             # In the scan version, just set the cursor to the end
482             # of the line, and the rest should just cascade out.
483 2054         2608 $self->{line_cursor} = $self->{line_length};
484 2054         3840 return 0;
485             }
486              
487             # Populate the appropriate variables
488 76837         114403 $self->{line} = $line;
489 76837         87653 $self->{line_cursor} = -1;
490 76837         94446 $self->{line_length} = length $line;
491 76837         80131 $self->{line_count}++;
492              
493 76837         132537 1;
494             }
495              
496             # Get the current character
497             sub _char {
498 0     0   0 my $self = shift;
499 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
500             }
501              
502              
503              
504              
505              
506             ####################################################################
507             # Per line processing methods
508              
509             # Processes the next line
510             # Returns 1 on success completion
511             # Returns 0 if EOF
512             # Returns undef on error
513             sub _process_next_line {
514 103881     103881   105247 my $self = shift;
515              
516             # Fill the line buffer
517 103881         99693 my $rv;
518 103881 100       148858 unless ( $rv = $self->_fill_line ) {
519 30466 50       47340 return undef unless defined $rv;
520              
521             # End of file, finalize last token
522 30466         51210 $self->_finalize_token;
523 30466         55260 return 0;
524             }
525              
526             # Run the __TOKENIZER__on_line_start
527 73415         192547 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
528 73415 100       109732 unless ( $rv ) {
529             # If there are no more source lines, then clean up
530 28445 100 66     45453 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  28445         54325  
531 308         736 $self->_clean_eof;
532             }
533              
534             # Defined but false means next line
535 28445 50       54881 return 1 if defined $rv;
536 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
537             }
538              
539             # If we can't deal with the entire line, process char by char
540 44970         80521 while ( $rv = $self->_process_next_char ) {}
541 44969 50       69136 unless ( defined $rv ) {
542 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
543             }
544              
545             # Trigger any action that needs to happen at the end of a line
546 44969         112050 $self->{class}->__TOKENIZER__on_line_end( $self );
547              
548             # If there are no more source lines, then clean up
549 44969 100 100     100220 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  42674         94296  
550 16623         29101 return $self->_clean_eof;
551             }
552              
553 28346         52350 return 1;
554             }
555              
556              
557              
558              
559              
560             #####################################################################
561             # Per-character processing methods
562              
563             # Process on a per-character basis.
564             # Note that due the high number of times this gets
565             # called, it has been fairly heavily in-lined, so the code
566             # might look a bit ugly and duplicated.
567             sub _process_next_char {
568 705254     705254   679230 my $self = shift;
569              
570             ### FIXME - This checks for a screwed up condition that triggers
571             ### several warnings, amongst other things.
572 705254 50 33     1551890 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
573             # $DB::single = 1;
574 0         0 return undef;
575             }
576              
577 705254         688510 $self->{line_cursor}++;
578 705254 100       827682 return 0 if $self->_at_line_end;
579              
580             # Pass control to the token class
581 660285         634383 my $result;
582 660285 100       1184626 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
583             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
584 94068 50       239091 return defined $result ? 1 : undef;
585             }
586              
587             # We will need the value of the current character
588 566216         755002 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
589 566216 100       768544 if ( $result eq '1' ) {
590             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
591             # the character is part of it.
592              
593             # Add the character
594 96222 50       129446 if ( defined $self->{token} ) {
595 96222         132975 $self->{token}->{content} .= $char;
596             } else {
597 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
598             }
599              
600 96222         184958 return 1;
601             }
602              
603             # We have been provided with the name of a class
604 469994 100       746203 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
605             # New class
606 228473         305009 $self->_new_token( $result, $char );
607             } elsif ( defined $self->{token} ) {
608             # Same class as current
609 44589         58722 $self->{token}->{content} .= $char;
610             } else {
611             # Same class, but no current
612 196932 50       340282 defined($self->{token} = $self->{class}->new($char)) or return undef;
613             }
614              
615 469994         881797 1;
616             }
617              
618             sub _at_line_end {
619 705254     705254   785038 my ($self) = @_;
620 705254         1197660 return $self->{line_cursor} >= $self->{line_length};
621             }
622              
623              
624              
625              
626              
627             #####################################################################
628             # Altering Tokens in Tokenizer
629              
630             # Finish the end of a token.
631             # Returns the resulting parse class as a convenience.
632             sub _finalize_token {
633 610515     610515   614217 my $self = shift;
634 610515 100       841677 return $self->{class} unless defined $self->{token};
635              
636             # Add the token to the token buffer
637 580047         532926 push @{ $self->{tokens} }, $self->{token};
  580047         840484  
638 580047         636109 $self->{token} = undef;
639              
640             # Return the parse class to that of the zone we are in
641 580047         1042892 $self->{class} = $self->{zone};
642             }
643              
644             # Creates a new token and sets it in the tokenizer
645             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
646             sub _new_token {
647 383113     383113   404817 my $self = shift;
648             # throw PPI::Exception() unless @_;
649 383113 100       628877 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
650             ? shift : 'PPI::Token::' . shift;
651              
652             # Finalize any existing token
653 383113 100       648810 $self->_finalize_token if defined $self->{token};
654              
655             # Create the new token and update the parse class
656 383113 50       785566 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
657 383113         463336 $self->{class} = $class;
658              
659 383113         429272 1;
660             }
661              
662             # At the end of the file, we need to clean up the results of the erroneous
663             # space that we inserted at the beginning of the process.
664             sub _clean_eof {
665 16936     16936   18684 my $self = shift;
666              
667             # Finish any partially completed token
668 16936 100       29479 $self->_finalize_token if $self->{token};
669              
670             # Find the last token, and if it has no content, kill it.
671             # There appears to be some evidence that such "null tokens" are
672             # somehow getting created accidentally.
673 16936         22537 my $last_token = $self->{tokens}->[ -1 ];
674 16936 50       29115 unless ( length $last_token->{content} ) {
675 0         0 pop @{$self->{tokens}};
  0         0  
676             }
677              
678             # Now, if the last character of the last token is a space we added,
679             # chop it off, deleting the token if there's nothing else left.
680 16936 100       28896 if ( $self->{source_eof_chop} ) {
681 15521         19645 $last_token = $self->{tokens}->[ -1 ];
682 15521         64946 $last_token->{content} =~ s/ $//;
683 15521 100       30839 unless ( length $last_token->{content} ) {
684             # Popping token
685 13236         13686 pop @{$self->{tokens}};
  13236         18521  
686             }
687              
688             # The hack involving adding an extra space is now reversed, and
689             # now nobody will ever know. The perfect crime!
690 15521         24993 $self->{source_eof_chop} = '';
691             }
692              
693 16936         48720 1;
694             }
695              
696              
697              
698              
699              
700             #####################################################################
701             # Utility Methods
702              
703             # Context
704             sub _last_token {
705 0     0   0 $_[0]->{tokens}->[-1];
706             }
707              
708             sub _last_significant_token {
709 3127     3127   3524 my $self = shift;
710 3127         3613 my $cursor = $#{ $self->{tokens} };
  3127         4310  
711 3127         6254 while ( $cursor >= 0 ) {
712 4195         5431 my $token = $self->{tokens}->[$cursor--];
713 4195 100       12184 return $token if $token->significant;
714             }
715 402         776 return;
716             }
717              
718             # Get an array ref of previous significant tokens.
719             # Like _last_significant_token except it gets more than just one token
720             # Returns array with 0 to x entries
721             sub _previous_significant_tokens {
722 165598     165598   163569 my $self = shift;
723 165598   50     231169 my $count = shift || 1;
724 165598         146617 my $cursor = $#{ $self->{tokens} };
  165598         204779  
725              
726 165598         176653 my @tokens;
727 165598         228368 while ( $cursor >= 0 ) {
728 267863         290835 my $token = $self->{tokens}->[$cursor--];
729 267863 100       468781 next if not $token->significant;
730 172714         179463 push @tokens, $token;
731 172714 100       272766 last if @tokens >= $count;
732             }
733              
734 165598         278044 return @tokens;
735             }
736              
737             my %OBVIOUS_CLASS = (
738             'PPI::Token::Symbol' => 'operator',
739             'PPI::Token::Magic' => 'operator',
740             'PPI::Token::Number' => 'operator',
741             'PPI::Token::ArrayIndex' => 'operator',
742             'PPI::Token::Quote::Double' => 'operator',
743             'PPI::Token::Quote::Interpolate' => 'operator',
744             'PPI::Token::Quote::Literal' => 'operator',
745             'PPI::Token::Quote::Single' => 'operator',
746             'PPI::Token::QuoteLike::Backtick' => 'operator',
747             'PPI::Token::QuoteLike::Command' => 'operator',
748             'PPI::Token::QuoteLike::Readline' => 'operator',
749             'PPI::Token::QuoteLike::Regexp' => 'operator',
750             'PPI::Token::QuoteLike::Words' => 'operator',
751             );
752              
753             my %OBVIOUS_CONTENT = (
754             '(' => 'operand',
755             '{' => 'operand',
756             '[' => 'operand',
757             ';' => 'operand',
758             '}' => 'operator',
759             );
760              
761              
762             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
763              
764             # Try to determine operator/operand context, if possible.
765             # Returns "operator", "operand", or "" if unknown.
766             sub _opcontext {
767 11030     11030   12603 my $self = shift;
768 11030         18529 my @tokens = $self->_previous_significant_tokens(1);
769 11030         13731 my $p0 = $tokens[0];
770 11030 100       26595 return '' if not $p0;
771 10912         14973 my $c0 = ref $p0;
772              
773             # Map the obvious cases
774 10912 100       30795 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
775 2321 100       4746 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
776              
777             # Most of the time after an operator, we are an operand
778 1733 100       6837 return 'operand' if $p0->isa('PPI::Token::Operator');
779              
780             # If there's NOTHING, it's operand
781 1509 50       2655 return 'operand' if $p0->content eq '';
782              
783             # Otherwise, we don't know
784 1509         2969 return ''
785             }
786              
787             # Assuming we are currently parsing the word 'x', return true
788             # if previous tokens imply the x is an operator, false otherwise.
789             sub _current_x_is_operator {
790 1113     1113   1485 my ( $self ) = @_;
791 1113 100       1032 return if !@{$self->{tokens}};
  1113         2667  
792              
793 915         1816 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
794 915 100       2753 return if !$prev;
795              
796 914 100       3426 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
797              
798             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
799 748   100     4424 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
800             && !$prev->isa('PPI::Token::Label')
801             ;
802             }
803              
804              
805             # Assuming we are at the end of parsing the current token that could be a word,
806             # a wordlike operator, or a version string, try to determine whether context
807             # before or after it forces it to be a bareword. This method is only useful
808             # during tokenization.
809             sub __current_token_is_forced_word {
810 33401     33401   49836 my ( $t, $word ) = @_;
811              
812             # Check if forced by preceding tokens.
813              
814 33401         46648 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
815 33401 100       69526 if ( !$prev ) {
816 8860         19761 pos $t->{line} = $t->{line_cursor};
817             }
818             else {
819 24541         39425 my $content = $prev->{content};
820              
821             # We are forced if we are a method name.
822             # '->' will always be an operator, so we don't check its type.
823 24541 100       39667 return 1 if $content eq '->';
824              
825             # If we are contained in a pair of curly braces, we are probably a
826             # forced bareword hash key. '{' is never a word or operator, so we
827             # don't check its type.
828 24411         47370 pos $t->{line} = $t->{line_cursor};
829 24411 100 100     59069 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
830              
831             # sub, package, use, and no all indicate that what immediately follows
832             # is a word not an operator or (in the case of sub and package) a
833             # version string. However, we don't want to be fooled by 'package
834             # package v10' or 'use no v10'. We're a forced package unless we're
835             # preceded by 'package sub', in which case we're a version string.
836             # We also have to make sure that the sub/package/etc doing the forcing
837             # is not a method call.
838 24181 100       49339 if( $USUALLY_FORCES{$content}) {
839 5631 100 66     18776 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
840 5621 100       25010 return 1 if not $prevprev;
841 236 100 100     465 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
842 6         32 return;
843             }
844             }
845             # pos on $t->{line} is guaranteed to be set at this point.
846              
847             # Check if forced by following tokens.
848              
849             # If the word is followed by => it is probably a word, not a regex.
850 27410 100       65221 return 1 if $t->{line} =~ /\G\s*=>/gc;
851              
852             # Otherwise we probably aren't forced
853 26619         121515 return '';
854             }
855              
856             sub _features {
857 37     37   78 my ( $self, $arg ) = @_;
858 37 50 0     138 return $arg ? $self->{feature_set} = $arg : $self->{feature_set} || {};
859             }
860              
861 12314     12314   38002 sub _current_token_has_signatures_active { shift->{feature_set}{signatures} }
862              
863             1;
864              
865             =pod
866              
867             =head1 NOTES
868              
869             =head2 How the Tokenizer Works
870              
871             Understanding the Tokenizer is not for the faint-hearted. It is by far
872             the most complex and twisty piece of perl I've ever written that is actually
873             still built properly and isn't a terrible spaghetti-like mess. In fact, you
874             probably want to skip this section.
875              
876             But if you really want to understand, well then here goes.
877              
878             =head2 Source Input and Clean Up
879              
880             The Tokenizer starts by taking source in a variety of forms, sucking it
881             all in and merging into one big string, and doing our own internal line
882             split, using a "universal line separator" which allows the Tokenizer to
883             take source for any platform (and even supports a few known types of
884             broken newlines caused by mixed mac/pc/*nix editor screw ups).
885              
886             The resulting array of lines is used to feed the tokenizer, and is also
887             accessed directly by the heredoc-logic to do the line-oriented part of
888             here-doc support.
889              
890             =head2 Doing Things the Old Fashioned Way
891              
892             Due to the complexity of perl, and after 2 previously aborted parser
893             attempts, in the end the tokenizer was fashioned around a line-buffered
894             character-by-character method.
895              
896             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
897             and then iterates a cursor along it. At each cursor position, a method is
898             called in whatever token class we are currently in, which will examine the
899             character at the current position, and handle it.
900              
901             As the handler methods in the various token classes are called, they
902             build up an output token array for the source code.
903              
904             Various parts of the Tokenizer use look-ahead, arbitrary-distance
905             look-behind (although currently the maximum is three significant tokens),
906             or both, and various other heuristic guesses.
907              
908             I've been told it is officially termed a I<"backtracking parser
909             with infinite lookaheads">.
910              
911             =head2 State Variables
912              
913             Aside from the current line and the character cursor, the Tokenizer
914             maintains a number of different state variables.
915              
916             =over
917              
918             =item Current Class
919              
920             The Tokenizer maintains the current token class at all times. Much of the
921             time is just going to be the "Whitespace" class, which is what the base of
922             a document is. As the tokenizer executes the various character handlers,
923             the class changes a lot as it moves a long. In fact, in some instances,
924             the character handler may not handle the character directly itself, but
925             rather change the "current class" and then hand off to the character
926             handler for the new class.
927              
928             Because of this, and some other things I'll deal with later, the number of
929             times the character handlers are called does not in fact have a direct
930             relationship to the number of actual characters in the document.
931              
932             =item Current Zone
933              
934             Rather than create a class stack to allow for infinitely nested layers of
935             classes, the Tokenizer recognises just a single layer.
936              
937             To put it a different way, in various parts of the file, the Tokenizer will
938             recognise different "base" or "substrate" classes. When a Token such as a
939             comment or a number is finalised by the tokenizer, it "falls back" to the
940             base state.
941              
942             This allows proper tokenization of special areas such as __DATA__
943             and __END__ blocks, which also contain things like comments and POD,
944             without allowing the creation of any significant Tokens inside these areas.
945              
946             For the main part of a document we use L for this,
947             with the idea being that code is "floating in a sea of whitespace".
948              
949             =item Current Token
950              
951             The final main state variable is the "current token". This is the Token
952             that is currently being built by the Tokenizer. For certain types, it
953             can be manipulated and morphed and change class quite a bit while being
954             assembled, as the Tokenizer's understanding of the token content changes.
955              
956             When the Tokenizer is confident that it has seen the end of the Token, it
957             will be "finalized", which adds it to the output token array and resets
958             the current class to that of the zone that we are currently in.
959              
960             I should also note at this point that the "current token" variable is
961             optional. The Tokenizer is capable of knowing what class it is currently
962             set to, without actually having accumulated any characters in the Token.
963              
964             =back
965              
966             =head2 Making It Faster
967              
968             As I'm sure you can imagine, calling several different methods for each
969             character and running regexes and other complex heuristics made the first
970             fully working version of the tokenizer extremely slow.
971              
972             During testing, I created a metric to measure parsing speed called
973             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
974             cycles on a typical single-core CPU, and so a Tokenizer running at
975             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
976             code when running on a 1200 MHz processor.
977              
978             The first working version of the tokenizer ran at only 350 LPGC, so to
979             tokenize a typical large module such as L took
980             10-15 seconds. This sluggishness made it unpractical for many uses.
981              
982             So in the current parser, there are multiple layers of optimisation
983             very carefully built in to the basic. This has brought the tokenizer
984             up to a more reasonable 1000 LPGC, at the expense of making the code
985             quite a bit twistier.
986              
987             =head2 Making It Faster - Whole Line Classification
988              
989             The first step in the optimisation process was to add a hew handler to
990             enable several of the more basic classes (whitespace, comments) to be
991             able to be parsed a line at a time. At the start of each line, a
992             special optional handler (only supported by a few classes) is called to
993             check and see if the entire line can be parsed in one go.
994              
995             This is used mainly to handle things like POD, comments, empty lines,
996             and a few other minor special cases.
997              
998             =head2 Making It Faster - Inlining
999              
1000             The second stage of the optimisation involved inlining a small
1001             number of critical methods that were repeated an extremely high number
1002             of times. Profiling suggested that there were about 1,000,000 individual
1003             method calls per gigacycle, and by cutting these by two thirds a significant
1004             speed improvement was gained, in the order of about 50%.
1005              
1006             You may notice that many methods in the C code look
1007             very nested and long hand. This is primarily due to this inlining.
1008              
1009             At around this time, some statistics code that existed in the early
1010             versions of the parser was also removed, as it was determined that
1011             it was consuming around 15% of the CPU for the entire parser, while
1012             making the core more complicated.
1013              
1014             A judgment call was made that with the difficulties likely to be
1015             encountered with future planned enhancements, and given the relatively
1016             high cost involved, the statistics features would be removed from the
1017             Tokenizer.
1018              
1019             =head2 Making It Faster - Quote Engine
1020              
1021             Once inlining had reached diminishing returns, it became obvious from
1022             the profiling results that a huge amount of time was being spent
1023             stepping a char at a time though long, simple and "syntactically boring"
1024             code such as comments and strings.
1025              
1026             The existing regex engine was expanded to also encompass quotes and
1027             other quote-like things, and a special abstract base class was added
1028             that provided a number of specialised parsing methods that would "scan
1029             ahead", looking out ahead to find the end of a string, and updating
1030             the cursor to leave it in a valid position for the next call.
1031              
1032             This is also the point at which the number of character handler calls began
1033             to greatly differ from the number of characters. But it has been done
1034             in a way that allows the parser to retain the power of the original
1035             version at the critical points, while skipping through the "boring bits"
1036             as needed for additional speed.
1037              
1038             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1039             for the first time.
1040              
1041             =head2 Making It Faster - The "Complete" Mechanism
1042              
1043             As it became evident that great speed increases were available by using
1044             this "skipping ahead" mechanism, a new handler method was added that
1045             explicitly handles the parsing of an entire token, where the structure
1046             of the token is relatively simple. Tokens such as symbols fit this case,
1047             as once we are passed the initial sigil and word char, we know that we
1048             can skip ahead and "complete" the rest of the token much more easily.
1049              
1050             A number of these have been added for most or possibly all of the common
1051             cases, with most of these "complete" handlers implemented using regular
1052             expressions.
1053              
1054             In fact, so many have been added that at this point, you could arguably
1055             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1056             tokenizer". More tokens are now consumed in "complete" methods in a
1057             typical program than are handled by the normal char-by-char methods.
1058              
1059             Many of the these complete-handlers were implemented during the writing
1060             of the Lexer, and this has allowed the full parser to maintain around
1061             1000 LPGC despite the increasing weight of the Lexer.
1062              
1063             =head2 Making It Faster - Porting To C (In Progress)
1064              
1065             While it would be extraordinarily difficult to port all of the Tokenizer
1066             to C, work has started on a L "accelerator" package which acts as
1067             a separate and automatically-detected add-on to the main PPI package.
1068              
1069             L implements faster versions of a variety of functions scattered
1070             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1071             various other places, and implements them identically in XS/C.
1072              
1073             In particular, the skip-ahead methods from the Quote Engine would appear
1074             to be extremely amenable to being done in C, and a number of other
1075             functions could be cherry-picked one at a time and implemented in C.
1076              
1077             Each method is heavily tested to ensure that the functionality is
1078             identical, and a versioning mechanism is included to ensure that if a
1079             function gets out of sync, L will degrade gracefully and just
1080             not replace that single method.
1081              
1082             =head1 TO DO
1083              
1084             - Add an option to reset or seek the token stream...
1085              
1086             - Implement more Tokenizer functions in L
1087              
1088             =head1 SUPPORT
1089              
1090             See the L in the main module.
1091              
1092             =head1 AUTHOR
1093              
1094             Adam Kennedy Eadamk@cpan.orgE
1095              
1096             =head1 COPYRIGHT
1097              
1098             Copyright 2001 - 2011 Adam Kennedy.
1099              
1100             This program is free software; you can redistribute
1101             it and/or modify it under the same terms as Perl itself.
1102              
1103             The full text of the license can be found in the
1104             LICENSE file included with this module.
1105              
1106             =cut