File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 221 252 87.7
branch 113 144 78.4
condition 27 37 72.9
subroutine 29 33 87.8
pod 5 5 100.0
total 395 471 83.8


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C is implemented separately
58             from the lexer L.
59              
60             The job of C is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 66     66   478 use strict;
  66         159  
  66         3249  
82 66     66   400 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  66         180  
  66         5875  
83 66     66   515 use List::Util 1.33 ();
  66         1923  
  66         1663  
84 66     66   359 use PPI::Util ();
  66         147  
  66         1715  
85 66     66   550 use PPI::Element ();
  66         147  
  66         1398  
86 66     66   389 use PPI::Token ();
  66         159  
  66         1410  
87 66     66   368 use PPI::Exception ();
  66         138  
  66         1476  
88 66     66   33814 use PPI::Exception::ParserRejection ();
  66         227  
  66         1611  
89 66     66   379 use PPI::Document ();
  66         153  
  66         262928  
90              
91             our $VERSION = '1.28401'; # TRIAL
92              
93             # The x operator cannot follow most Perl operators, implying that
94             # anything beginning with x following an operator is a word.
95             # These are the exceptions.
96             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97              
98             # The x operator cannot follow most structure elements, implying that
99             # anything beginning with x following a structure element is a word.
100             # These are the exceptions.
101             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102              
103             # Something that looks like the x operator but follows a word
104             # is usually that word's argument.
105             # These are the exceptions.
106             # chop, chomp, dump are ambiguous because they can have either parms
107             # or no parms.
108             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109             endgrent
110             endhostent
111             endnetent
112             endprotoent
113             endpwent
114             endservent
115             fork
116             getgrent
117             gethostent
118             getlogin
119             getnetent
120             getppid
121             getprotoent
122             getpwent
123             getservent
124             setgrent
125             setpwent
126             time
127             times
128             wait
129             wantarray
130             __SUB__
131             );
132              
133              
134              
135             #####################################################################
136             # Creation and Initialization
137              
138             =pod
139              
140             =head2 new $file | \@lines | \$source
141              
142             The main C constructor creates a new Tokenizer object. These
143             objects have no configuration parameters, and can only be used once,
144             to tokenize a single perl source file.
145              
146             It takes as argument either a normal scalar containing source code,
147             a reference to a scalar containing source code, or a reference to an
148             ARRAY containing newline-terminated lines of source code.
149              
150             Returns a new C object on success, or throws a
151             L exception on error.
152              
153             =cut
154              
155             sub new {
156 16891   33 16891 1 84199 my $class = ref($_[0]) || $_[0];
157              
158             # Create the empty tokenizer struct
159 16891         248035 my $self = bless {
160             # Source code
161             source => undef,
162             source_bytes => undef,
163             document => undef,
164              
165             # Line buffer
166             line => undef,
167             line_length => undef,
168             line_cursor => undef,
169             line_count => 0,
170              
171             # Parse state
172             token => undef,
173             class => 'PPI::Token::BOM',
174             zone => 'PPI::Token::Whitespace',
175             features => undef,
176              
177             # Output token buffer
178             tokens => [],
179             token_cursor => 0,
180             token_eof => 0,
181              
182             # Perl 6 blocks
183             perl6 => [],
184             }, $class;
185              
186 16891 50       120766 if ( ! defined $_[1] ) {
    100          
    50          
    0          
187             # We weren't given anything
188 0         0 PPI::Exception->throw("No source provided to Tokenizer");
189              
190             } elsif ( ! ref $_[1] ) {
191 506         3125 my $source = PPI::Util::_slurp($_[1]);
192 506 50       2002 if ( ref $source ) {
193             # Content returned by reference
194 506         2295 $self->{source} = $$source;
195             } else {
196             # Errors returned as a string
197 0         0 return( $source );
198             }
199              
200             } elsif ( _SCALAR0($_[1]) ) {
201 16385         31669 $self->{source} = ${$_[1]};
  16385         58660  
202              
203             } elsif ( _ARRAY0($_[1]) ) {
204 0         0 $self->{source} = join '', map { "\n" } @{$_[1]};
  0         0  
  0         0  
205              
206             } else {
207             # We don't support whatever this is
208 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
209             }
210              
211             # We can't handle a null string
212 16891         60945 $self->{source_bytes} = length $self->{source};
213 16891 100       54311 if ( $self->{source_bytes} ) {
214             # Split on local newlines
215 16887         787563 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
216 16887         391689 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
217              
218             } else {
219 4         13 $self->{source} = [ ];
220             }
221              
222             ### EVIL
223             # I'm explaining this earlier than I should so you can understand
224             # why I'm about to do something that looks very strange. There's
225             # a problem with the Tokenizer, in that tokens tend to change
226             # classes as each letter is added, but they don't get allocated
227             # their definite final class until the "end" of the token, the
228             # detection of which occurs in about a hundred different places,
229             # all through various crufty code (that triples the speed).
230             #
231             # However, in general, this does not apply to tokens in which a
232             # whitespace character is valid, such as comments, whitespace and
233             # big strings.
234             #
235             # So what we do is add a space to the end of the source. This
236             # triggers normal "end of token" functionality for all cases. Then,
237             # once the tokenizer hits end of file, it examines the last token to
238             # manually either remove the ' ' token, or chop it off the end of
239             # a longer one in which the space would be valid.
240 16891 100   72975   106565 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  72975 100       240233  
  16891 100       126295  
241 10         41 $self->{source_eof_chop} = '';
242             } elsif ( ! defined $self->{source}->[0] ) {
243 4         12 $self->{source_eof_chop} = '';
244             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
245 1127         4479 $self->{source_eof_chop} = '';
246             } else {
247 15750         48124 $self->{source_eof_chop} = 1;
248 15750         46873 $self->{source}->[-1] .= ' ';
249             }
250              
251 16891         102010 $self;
252             }
253              
254             sub _document {
255 16800     16800   45277 my $self = shift;
256 16800 50       66372 return @_ ? $self->{document} = shift : $self->{document};
257             }
258              
259              
260              
261              
262              
263             #####################################################################
264             # Main Public Methods
265              
266             =pod
267              
268             =head2 get_token
269              
270             When using the PPI::Tokenizer object as an iterator, the C
271             method is the primary method that is used. It increments the cursor
272             and returns the next Token in the output array.
273              
274             The actual parsing of the file is done only as-needed, and a line at
275             a time. When C hits the end of the token array, it will
276             cause the parser to pull in the next line and parse it, continuing
277             as needed until there are more tokens on the output array that
278             get_token can then return.
279              
280             This means that a number of Tokenizer objects can be created, and
281             won't consume significant CPU until you actually begin to pull tokens
282             from it.
283              
284             Return a L object on success, C<0> if the Tokenizer had
285             reached the end of the file, or C on error.
286              
287             =cut
288              
289             sub get_token {
290 394502     394502 1 632170 my $self = shift;
291              
292             # Shortcut for EOF
293 394502 50 66     1033047 if ( $self->{token_eof}
294 13486         47849 and $self->{token_cursor} > scalar @{$self->{tokens}}
295             ) {
296 0         0 return 0;
297             }
298              
299             # Return the next token if we can
300 394502 100       1467622 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
301 316269         517557 $self->{token_cursor}++;
302 316269         1351530 return $token;
303             }
304              
305 78233         122912 my $line_rv;
306              
307             # Catch exceptions and return undef, so that we
308             # can start to convert code to exception-based code.
309 78233         136576 my $rv = eval {
310             # No token, we need to get some more
311 78233         226066 while ( $line_rv = $self->_process_next_line ) {
312             # If there is something in the buffer, return it
313             # The defined() prevents a ton of calls to PPI::Util::TRUE
314 69135 100       229279 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
315 47948         88345 $self->{token_cursor}++;
316 47948         134256 return $token;
317             }
318             }
319 30284         61552 return undef;
320             };
321 78233 100       290102 if ( $@ ) {
    100          
322 1 50       13 if ( _INSTANCE($@, 'PPI::Exception') ) {
323 1         4 $@->throw;
324             } else {
325 0         0 my $errstr = $@;
326 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
327 0         0 PPI::Exception->throw( $errstr );
328             }
329             } elsif ( $rv ) {
330 47948         314547 return $rv;
331             }
332              
333 30284 50       66636 if ( defined $line_rv ) {
334             # End of file, but we can still return things from the buffer
335 30284 50       91918 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
336 0         0 $self->{token_cursor}++;
337 0         0 return $token;
338             }
339              
340             # Set our token end of file flag
341 30284         55402 $self->{token_eof} = 1;
342 30284         156354 return 0;
343             }
344              
345             # Error, pass it up to our caller
346 0         0 undef;
347             }
348              
349             =pod
350              
351             =head2 all_tokens
352              
353             When not being used as an iterator, the C method tells
354             the Tokenizer to parse the entire file and return all of the tokens
355             in a single ARRAY reference.
356              
357             It should be noted that C does B interfere with the
358             use of the Tokenizer object as an iterator (does not modify the token
359             cursor) and use of the two different mechanisms can be mixed safely.
360              
361             Returns a reference to an ARRAY of L objects on success
362             or throws an exception on error.
363              
364             =cut
365              
366             sub all_tokens {
367 5     5 1 34 my $self = shift;
368              
369             # Catch exceptions and return undef, so that we
370             # can start to convert code to exception-based code.
371 5         15 my $ok = eval {
372             # Process lines until we get EOF
373 5 50       21 unless ( $self->{token_eof} ) {
374 5         11 my $rv;
375 5         22 while ( $rv = $self->_process_next_line ) {}
376 5 50       14 unless ( defined $rv ) {
377 0         0 PPI::Exception->throw("Error while processing source");
378             }
379              
380             # Clean up the end of the tokenizer
381 5         19 $self->_clean_eof;
382             }
383 5         12 1;
384             };
385 5 50       18 if ( !$ok ) {
386 0         0 my $errstr = $@;
387 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
388 0         0 PPI::Exception->throw( $errstr );
389             }
390              
391             # End of file, return a copy of the token array.
392 5         13 return [ @{$self->{tokens}} ];
  5         35  
393             }
394              
395             =pod
396              
397             =head2 increment_cursor
398              
399             Although exposed as a public method, C is implemented
400             for expert use only, when writing lexers or other components that work
401             directly on token streams.
402              
403             It manually increments the token cursor forward through the file, in effect
404             "skipping" the next token.
405              
406             Return true if the cursor is incremented, C<0> if already at the end of
407             the file, or C on error.
408              
409             =cut
410              
411             sub increment_cursor {
412             # Do this via the get_token method, which makes sure there
413             # is actually a token there to move to.
414 0 0   0 1 0 $_[0]->get_token and 1;
415             }
416              
417             =pod
418              
419             =head2 decrement_cursor
420              
421             Although exposed as a public method, C is implemented
422             for expert use only, when writing lexers or other components that work
423             directly on token streams.
424              
425             It manually decrements the token cursor backwards through the file, in
426             effect "rolling back" the token stream. And indeed that is what it is
427             primarily intended for, when the component that is consuming the token
428             stream needs to implement some sort of "roll back" feature in its use
429             of the token stream.
430              
431             Return true if the cursor is decremented, C<0> if already at the
432             beginning of the file, or C on error.
433              
434             =cut
435              
436             sub decrement_cursor {
437 0     0 1 0 my $self = shift;
438              
439             # Check for the beginning of the file
440 0 0       0 return 0 unless $self->{token_cursor};
441              
442             # Decrement the token cursor
443 0         0 $self->{token_eof} = 0;
444 0         0 --$self->{token_cursor};
445             }
446              
447              
448              
449              
450              
451             #####################################################################
452             # Working With Source
453              
454             # Fetches the next line from the input line buffer
455             # Returns undef at EOF.
456             sub _get_line {
457 107380     107380   167078 my $self = shift;
458 107380 100       311279 return undef unless $self->{source}; # EOF hit previously
459              
460             # Pull off the next line
461 91367         143483 my $line = shift @{$self->{source}};
  91367         255242  
462              
463             # Flag EOF if we hit it
464 91367 100       245204 $self->{source} = undef unless defined $line;
465              
466             # Return the line (or EOF flag)
467 91367         231537 return $line; # string or undef
468             }
469              
470             # Fetches the next line, ready to process
471             # Returns 1 on success
472             # Returns 0 on EOF
473             sub _fill_line {
474 105000     105000   165830 my $self = shift;
475 105000         191301 my $inscan = shift;
476              
477             # Get the next line
478 105000         252049 my $line = $self->_get_line;
479 105000 100       246119 unless ( defined $line ) {
480             # End of file
481 32343 100       72993 unless ( $inscan ) {
482 30289         74640 delete $self->{line};
483 30289         54955 delete $self->{line_cursor};
484 30289         47345 delete $self->{line_length};
485 30289         92640 return 0;
486             }
487              
488             # In the scan version, just set the cursor to the end
489             # of the line, and the rest should just cascade out.
490 2054         4537 $self->{line_cursor} = $self->{line_length};
491 2054         5938 return 0;
492             }
493              
494             # Populate the appropriate variables
495 72657         185204 $self->{line} = $line;
496 72657         136566 $self->{line_cursor} = -1;
497 72657         143715 $self->{line_length} = length $line;
498 72657         129431 $self->{line_count}++;
499              
500 72657         205339 1;
501             }
502              
503             # Get the current character
504             sub _char {
505 0     0   0 my $self = shift;
506 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
507             }
508              
509              
510              
511              
512              
513             ####################################################################
514             # Per line processing methods
515              
516             # Processes the next line
517             # Returns 1 on success completion
518             # Returns 0 if EOF
519             # Returns undef on error
520             sub _process_next_line {
521 99436     99436   168452 my $self = shift;
522              
523             # Fill the line buffer
524 99436         163550 my $rv;
525 99436 100       230198 unless ( $rv = $self->_fill_line ) {
526 30289 50       74055 return undef unless defined $rv;
527              
528             # End of file, finalize last token
529 30289         86188 $self->_finalize_token;
530 30289         88106 return 0;
531             }
532              
533             # Run the __TOKENIZER__on_line_start
534 69147         341768 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
535 69147 100       175106 unless ( $rv ) {
536             # If there are no more source lines, then clean up
537 28255 100 66     82873 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  28255         88190  
538 310         1083 $self->_clean_eof;
539             }
540              
541             # Defined but false means next line
542 28255 50       94991 return 1 if defined $rv;
543 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
544             }
545              
546             # If we can't deal with the entire line, process char by char
547 40892         119768 while ( $rv = $self->_process_next_char ) {}
548 40891 50       110533 unless ( defined $rv ) {
549 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
550             }
551              
552             # Trigger any action that needs to happen at the end of a line
553 40891         170517 $self->{class}->__TOKENIZER__on_line_end( $self );
554              
555             # If there are no more source lines, then clean up
556 40891 100 100     159199 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  38594         153727  
557 16576         55073 return $self->_clean_eof;
558             }
559              
560 24315         72155 return 1;
561             }
562              
563              
564              
565              
566              
567             #####################################################################
568             # Per-character processing methods
569              
570             # Process on a per-character basis.
571             # Note that due the high number of times this gets
572             # called, it has been fairly heavily in-lined, so the code
573             # might look a bit ugly and duplicated.
574             sub _process_next_char {
575 454535     454535   728658 my $self = shift;
576              
577             ### FIXME - This checks for a screwed up condition that triggers
578             ### several warnings, amongst other things.
579 454535 50 33     1805215 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
580             # $DB::single = 1;
581 0         0 return undef;
582             }
583              
584 454535         710093 $self->{line_cursor}++;
585 454535 100       866484 return 0 if $self->_at_line_end;
586              
587             # Pass control to the token class
588 413644         636155 my $result;
589 413644 100       1332407 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
590             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
591 78882 50       342537 return defined $result ? 1 : undef;
592             }
593              
594             # We will need the value of the current character
595 334761         806516 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
596 334761 100       758044 if ( $result eq '1' ) {
597             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
598             # the character is part of it.
599              
600             # Add the character
601 59061 50       128112 if ( defined $self->{token} ) {
602 59061         149954 $self->{token}->{content} .= $char;
603             } else {
604 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
605             }
606              
607 59061         199358 return 1;
608             }
609              
610             # We have been provided with the name of a class
611 275700 100       828179 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
612             # New class
613 106783         285265 $self->_new_token( $result, $char );
614             } elsif ( defined $self->{token} ) {
615             # Same class as current
616 33516         77442 $self->{token}->{content} .= $char;
617             } else {
618             # Same class, but no current
619 135401 50       424601 defined($self->{token} = $self->{class}->new($char)) or return undef;
620             }
621              
622 275700         922816 1;
623             }
624              
625             sub _at_line_end {
626 454535     454535   796325 my ($self) = @_;
627 454535         1333309 return $self->{line_cursor} >= $self->{line_length};
628             }
629              
630              
631              
632              
633              
634             #####################################################################
635             # Altering Tokens in Tokenizer
636              
637             # Finish the end of a token.
638             # Returns the resulting parse class as a convenience.
639             sub _finalize_token {
640 407744     407744   654981 my $self = shift;
641 407744 100       905021 return $self->{class} unless defined $self->{token};
642              
643             # Add the token to the token buffer
644 377453         542530 push @{ $self->{tokens} }, $self->{token};
  377453         907990  
645 377453         647310 $self->{token} = undef;
646              
647             # Return the parse class to that of the zone we are in
648 377453         1195313 $self->{class} = $self->{zone};
649             }
650              
651             # Creates a new token and sets it in the tokenizer
652             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
653             sub _new_token {
654 242050     242050   370398 my $self = shift;
655             # throw PPI::Exception() unless @_;
656 242050 100       695468 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
657             ? shift : 'PPI::Token::' . shift;
658              
659             # Finalize any existing token
660 242050 100       727180 $self->_finalize_token if defined $self->{token};
661              
662             # Create the new token and update the parse class
663 242050 50       954957 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
664 242050         472507 $self->{class} = $class;
665              
666 242050         435881 1;
667             }
668              
669             # At the end of the file, we need to clean up the results of the erroneous
670             # space that we inserted at the beginning of the process.
671             sub _clean_eof {
672 16891     16891   32118 my $self = shift;
673              
674             # Finish any partially completed token
675 16891 100       53175 $self->_finalize_token if $self->{token};
676              
677             # Find the last token, and if it has no content, kill it.
678             # There appears to be some evidence that such "null tokens" are
679             # somehow getting created accidentally.
680 16891         37007 my $last_token = $self->{tokens}->[ -1 ];
681 16891 50       61524 unless ( length $last_token->{content} ) {
682 0         0 pop @{$self->{tokens}};
  0         0  
683             }
684              
685             # Now, if the last character of the last token is a space we added,
686             # chop it off, deleting the token if there's nothing else left.
687 16891 100       50629 if ( $self->{source_eof_chop} ) {
688 15478         37064 $last_token = $self->{tokens}->[ -1 ];
689 15478         107799 $last_token->{content} =~ s/ $//;
690 15478 100       58279 unless ( length $last_token->{content} ) {
691             # Popping token
692 13205         26475 pop @{$self->{tokens}};
  13205         30672  
693             }
694              
695             # The hack involving adding an extra space is now reversed, and
696             # now nobody will ever know. The perfect crime!
697 15478         42627 $self->{source_eof_chop} = '';
698             }
699              
700 16891         80661 1;
701             }
702              
703              
704              
705              
706              
707             #####################################################################
708             # Utility Methods
709              
710             # Context
711             sub _last_token {
712 0     0   0 $_[0]->{tokens}->[-1];
713             }
714              
715             sub _last_significant_token {
716 3090     3090   6509 my $self = shift;
717 3090         5514 my $cursor = $#{ $self->{tokens} };
  3090         8060  
718 3090         13122 while ( $cursor >= 0 ) {
719 4176         8680 my $token = $self->{tokens}->[$cursor--];
720 4176 100       19791 return $token if $token->significant;
721             }
722 401         1302 return;
723             }
724              
725             # Get an array ref of previous significant tokens.
726             # Like _last_significant_token except it gets more than just one token
727             # Returns array with 0 to x entries
728             sub _previous_significant_tokens {
729 153268     153268   241521 my $self = shift;
730 153268   50     343011 my $count = shift || 1;
731 153268         223479 my $cursor = $#{ $self->{tokens} };
  153268         327830  
732              
733 153268         260441 my @tokens;
734 153268         374339 while ( $cursor >= 0 ) {
735 248241         442262 my $token = $self->{tokens}->[$cursor--];
736 248241 100       745760 next if not $token->significant;
737 160199         294957 push @tokens, $token;
738 160199 100       416038 last if @tokens >= $count;
739             }
740              
741 153268         423548 return @tokens;
742             }
743              
744             my %OBVIOUS_CLASS = (
745             'PPI::Token::Symbol' => 'operator',
746             'PPI::Token::Magic' => 'operator',
747             'PPI::Token::Number' => 'operator',
748             'PPI::Token::ArrayIndex' => 'operator',
749             'PPI::Token::Quote::Double' => 'operator',
750             'PPI::Token::Quote::Interpolate' => 'operator',
751             'PPI::Token::Quote::Literal' => 'operator',
752             'PPI::Token::Quote::Single' => 'operator',
753             'PPI::Token::QuoteLike::Backtick' => 'operator',
754             'PPI::Token::QuoteLike::Command' => 'operator',
755             'PPI::Token::QuoteLike::Readline' => 'operator',
756             'PPI::Token::QuoteLike::Regexp' => 'operator',
757             'PPI::Token::QuoteLike::Words' => 'operator',
758             );
759              
760             my %OBVIOUS_CONTENT = (
761             '(' => 'operand',
762             '{' => 'operand',
763             '[' => 'operand',
764             ';' => 'operand',
765             '}' => 'operator',
766             );
767              
768              
769             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
770              
771             # Try to determine operator/operand context, if possible.
772             # Returns "operator", "operand", or "" if unknown.
773             sub _opcontext {
774 7305     7305   12208 my $self = shift;
775 7305         19060 my @tokens = $self->_previous_significant_tokens(1);
776 7305         12406 my $p0 = $tokens[0];
777 7305 100       33830 return '' if not $p0;
778 7173         16108 my $c0 = ref $p0;
779              
780             # Map the obvious cases
781 7173 100       31992 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
782 2257 100       9686 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
783              
784             # Most of the time after an operator, we are an operand
785 1682 100       15548 return 'operand' if $p0->isa('PPI::Token::Operator');
786              
787             # If there's NOTHING, it's operand
788 1476 50       5027 return 'operand' if $p0->content eq '';
789              
790             # Otherwise, we don't know
791 1476         6074 return ''
792             }
793              
794             # Assuming we are currently parsing the word 'x', return true
795             # if previous tokens imply the x is an operator, false otherwise.
796             sub _current_x_is_operator {
797 1108     1108   2494 my ( $self ) = @_;
798 1108 100       1799 return if !@{$self->{tokens}};
  1108         4197  
799              
800 905         2979 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
801 905 50       4217 return if !$prev;
802              
803 905 100       7556 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
804              
805             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
806 746   100     7417 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
807             && !$prev->isa('PPI::Token::Label')
808             ;
809             }
810              
811              
812             # Assuming we are at the end of parsing the current token that could be a word,
813             # a wordlike operator, or a version string, try to determine whether context
814             # before or after it forces it to be a bareword. This method is only useful
815             # during tokenization.
816             sub __current_token_is_forced_word {
817 33255     33255   81083 my ( $t, $word ) = @_;
818              
819             # Check if forced by preceding tokens.
820              
821 33255         83848 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
822 33255 100       116539 if ( !$prev ) {
823 8867         36149 pos $t->{line} = $t->{line_cursor};
824             }
825             else {
826 24388         61750 my $content = $prev->{content};
827              
828             # We are forced if we are a method name.
829             # '->' will always be an operator, so we don't check its type.
830 24388 100       65742 return 1 if $content eq '->';
831              
832             # If we are contained in a pair of curly braces, we are probably a
833             # forced bareword hash key. '{' is never a word or operator, so we
834             # don't check its type.
835 24259         96072 pos $t->{line} = $t->{line_cursor};
836 24259 100 100     104532 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
837              
838             # sub, package, use, and no all indicate that what immediately follows
839             # is a word not an operator or (in the case of sub and package) a
840             # version string. However, we don't want to be fooled by 'package
841             # package v10' or 'use no v10'. We're a forced package unless we're
842             # preceded by 'package sub', in which case we're a version string.
843             # We also have to make sure that the sub/package/etc doing the forcing
844             # is not a method call.
845 24031 100       80691 if( $USUALLY_FORCES{$content}) {
846 5631 100 66     34067 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
847 5621 100       36584 return 1 if not $prevprev;
848 236 100 100     989 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
849 6         46 return;
850             }
851             }
852             # pos on $t->{line} is guaranteed to be set at this point.
853              
854             # Check if forced by following tokens.
855              
856             # If the word is followed by => it is probably a word, not a regex.
857 27267 100       113639 return 1 if $t->{line} =~ /\G\s*=>/gc;
858              
859             # Otherwise we probably aren't forced
860 26477         230783 return '';
861             }
862              
863             sub _features {
864 35     35   60 my $self = shift;
865 35 50 0     132 return @_ ? $self->{features} = shift : $self->{features} || {};
866             }
867              
868             sub _current_token_has_signatures_active {
869 8494     8494   14675 my $self = shift;
870 8494         31700 return $self->{features}{signatures};
871             }
872              
873             1;
874              
875             =pod
876              
877             =head1 NOTES
878              
879             =head2 How the Tokenizer Works
880              
881             Understanding the Tokenizer is not for the faint-hearted. It is by far
882             the most complex and twisty piece of perl I've ever written that is actually
883             still built properly and isn't a terrible spaghetti-like mess. In fact, you
884             probably want to skip this section.
885              
886             But if you really want to understand, well then here goes.
887              
888             =head2 Source Input and Clean Up
889              
890             The Tokenizer starts by taking source in a variety of forms, sucking it
891             all in and merging into one big string, and doing our own internal line
892             split, using a "universal line separator" which allows the Tokenizer to
893             take source for any platform (and even supports a few known types of
894             broken newlines caused by mixed mac/pc/*nix editor screw ups).
895              
896             The resulting array of lines is used to feed the tokenizer, and is also
897             accessed directly by the heredoc-logic to do the line-oriented part of
898             here-doc support.
899              
900             =head2 Doing Things the Old Fashioned Way
901              
902             Due to the complexity of perl, and after 2 previously aborted parser
903             attempts, in the end the tokenizer was fashioned around a line-buffered
904             character-by-character method.
905              
906             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
907             and then iterates a cursor along it. At each cursor position, a method is
908             called in whatever token class we are currently in, which will examine the
909             character at the current position, and handle it.
910              
911             As the handler methods in the various token classes are called, they
912             build up an output token array for the source code.
913              
914             Various parts of the Tokenizer use look-ahead, arbitrary-distance
915             look-behind (although currently the maximum is three significant tokens),
916             or both, and various other heuristic guesses.
917              
918             I've been told it is officially termed a I<"backtracking parser
919             with infinite lookaheads">.
920              
921             =head2 State Variables
922              
923             Aside from the current line and the character cursor, the Tokenizer
924             maintains a number of different state variables.
925              
926             =over
927              
928             =item Current Class
929              
930             The Tokenizer maintains the current token class at all times. Much of the
931             time is just going to be the "Whitespace" class, which is what the base of
932             a document is. As the tokenizer executes the various character handlers,
933             the class changes a lot as it moves a long. In fact, in some instances,
934             the character handler may not handle the character directly itself, but
935             rather change the "current class" and then hand off to the character
936             handler for the new class.
937              
938             Because of this, and some other things I'll deal with later, the number of
939             times the character handlers are called does not in fact have a direct
940             relationship to the number of actual characters in the document.
941              
942             =item Current Zone
943              
944             Rather than create a class stack to allow for infinitely nested layers of
945             classes, the Tokenizer recognises just a single layer.
946              
947             To put it a different way, in various parts of the file, the Tokenizer will
948             recognise different "base" or "substrate" classes. When a Token such as a
949             comment or a number is finalised by the tokenizer, it "falls back" to the
950             base state.
951              
952             This allows proper tokenization of special areas such as __DATA__
953             and __END__ blocks, which also contain things like comments and POD,
954             without allowing the creation of any significant Tokens inside these areas.
955              
956             For the main part of a document we use L for this,
957             with the idea being that code is "floating in a sea of whitespace".
958              
959             =item Current Token
960              
961             The final main state variable is the "current token". This is the Token
962             that is currently being built by the Tokenizer. For certain types, it
963             can be manipulated and morphed and change class quite a bit while being
964             assembled, as the Tokenizer's understanding of the token content changes.
965              
966             When the Tokenizer is confident that it has seen the end of the Token, it
967             will be "finalized", which adds it to the output token array and resets
968             the current class to that of the zone that we are currently in.
969              
970             I should also note at this point that the "current token" variable is
971             optional. The Tokenizer is capable of knowing what class it is currently
972             set to, without actually having accumulated any characters in the Token.
973              
974             =back
975              
976             =head2 Making It Faster
977              
978             As I'm sure you can imagine, calling several different methods for each
979             character and running regexes and other complex heuristics made the first
980             fully working version of the tokenizer extremely slow.
981              
982             During testing, I created a metric to measure parsing speed called
983             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
984             cycles on a typical single-core CPU, and so a Tokenizer running at
985             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
986             code when running on a 1200 MHz processor.
987              
988             The first working version of the tokenizer ran at only 350 LPGC, so to
989             tokenize a typical large module such as L took
990             10-15 seconds. This sluggishness made it unpractical for many uses.
991              
992             So in the current parser, there are multiple layers of optimisation
993             very carefully built in to the basic. This has brought the tokenizer
994             up to a more reasonable 1000 LPGC, at the expense of making the code
995             quite a bit twistier.
996              
997             =head2 Making It Faster - Whole Line Classification
998              
999             The first step in the optimisation process was to add a hew handler to
1000             enable several of the more basic classes (whitespace, comments) to be
1001             able to be parsed a line at a time. At the start of each line, a
1002             special optional handler (only supported by a few classes) is called to
1003             check and see if the entire line can be parsed in one go.
1004              
1005             This is used mainly to handle things like POD, comments, empty lines,
1006             and a few other minor special cases.
1007              
1008             =head2 Making It Faster - Inlining
1009              
1010             The second stage of the optimisation involved inlining a small
1011             number of critical methods that were repeated an extremely high number
1012             of times. Profiling suggested that there were about 1,000,000 individual
1013             method calls per gigacycle, and by cutting these by two thirds a significant
1014             speed improvement was gained, in the order of about 50%.
1015              
1016             You may notice that many methods in the C code look
1017             very nested and long hand. This is primarily due to this inlining.
1018              
1019             At around this time, some statistics code that existed in the early
1020             versions of the parser was also removed, as it was determined that
1021             it was consuming around 15% of the CPU for the entire parser, while
1022             making the core more complicated.
1023              
1024             A judgment call was made that with the difficulties likely to be
1025             encountered with future planned enhancements, and given the relatively
1026             high cost involved, the statistics features would be removed from the
1027             Tokenizer.
1028              
1029             =head2 Making It Faster - Quote Engine
1030              
1031             Once inlining had reached diminishing returns, it became obvious from
1032             the profiling results that a huge amount of time was being spent
1033             stepping a char at a time though long, simple and "syntactically boring"
1034             code such as comments and strings.
1035              
1036             The existing regex engine was expanded to also encompass quotes and
1037             other quote-like things, and a special abstract base class was added
1038             that provided a number of specialised parsing methods that would "scan
1039             ahead", looking out ahead to find the end of a string, and updating
1040             the cursor to leave it in a valid position for the next call.
1041              
1042             This is also the point at which the number of character handler calls began
1043             to greatly differ from the number of characters. But it has been done
1044             in a way that allows the parser to retain the power of the original
1045             version at the critical points, while skipping through the "boring bits"
1046             as needed for additional speed.
1047              
1048             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1049             for the first time.
1050              
1051             =head2 Making It Faster - The "Complete" Mechanism
1052              
1053             As it became evident that great speed increases were available by using
1054             this "skipping ahead" mechanism, a new handler method was added that
1055             explicitly handles the parsing of an entire token, where the structure
1056             of the token is relatively simple. Tokens such as symbols fit this case,
1057             as once we are passed the initial sigil and word char, we know that we
1058             can skip ahead and "complete" the rest of the token much more easily.
1059              
1060             A number of these have been added for most or possibly all of the common
1061             cases, with most of these "complete" handlers implemented using regular
1062             expressions.
1063              
1064             In fact, so many have been added that at this point, you could arguably
1065             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1066             tokenizer". More tokens are now consumed in "complete" methods in a
1067             typical program than are handled by the normal char-by-char methods.
1068              
1069             Many of the these complete-handlers were implemented during the writing
1070             of the Lexer, and this has allowed the full parser to maintain around
1071             1000 LPGC despite the increasing weight of the Lexer.
1072              
1073             =head2 Making It Faster - Porting To C (In Progress)
1074              
1075             While it would be extraordinarily difficult to port all of the Tokenizer
1076             to C, work has started on a L "accelerator" package which acts as
1077             a separate and automatically-detected add-on to the main PPI package.
1078              
1079             L implements faster versions of a variety of functions scattered
1080             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1081             various other places, and implements them identically in XS/C.
1082              
1083             In particular, the skip-ahead methods from the Quote Engine would appear
1084             to be extremely amenable to being done in C, and a number of other
1085             functions could be cherry-picked one at a time and implemented in C.
1086              
1087             Each method is heavily tested to ensure that the functionality is
1088             identical, and a versioning mechanism is included to ensure that if a
1089             function gets out of sync, L will degrade gracefully and just
1090             not replace that single method.
1091              
1092             =head1 TO DO
1093              
1094             - Add an option to reset or seek the token stream...
1095              
1096             - Implement more Tokenizer functions in L
1097              
1098             =head1 SUPPORT
1099              
1100             See the L in the main module.
1101              
1102             =head1 AUTHOR
1103              
1104             Adam Kennedy Eadamk@cpan.orgE
1105              
1106             =head1 COPYRIGHT
1107              
1108             Copyright 2001 - 2011 Adam Kennedy.
1109              
1110             This program is free software; you can redistribute
1111             it and/or modify it under the same terms as Perl itself.
1112              
1113             The full text of the license can be found in the
1114             LICENSE file included with this module.
1115              
1116             =cut