File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 226 257 87.9
branch 115 144 79.8
condition 36 47 76.6
subroutine 28 32 87.5
pod 5 5 100.0
total 410 485 84.5


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C<PPI::Tokenizer> is implemented separately
58             from the lexer L<PPI::Lexer>.
59              
60             The job of C<PPI::Tokenizer> is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B<VERY> flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C<PPI::Tokenizer> itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 66     66   375 use strict;
  66         107  
  66         2392  
82 66     66   263 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  66         103  
  66         3600  
83 66     66   314 use List::Util 1.33 ();
  66         1200  
  66         1169  
84 66     66   220 use PPI::Util ();
  66         134  
  66         1132  
85 66     66   463 use PPI::Element ();
  66         120  
  66         831  
86 66     66   258 use PPI::Token ();
  66         121  
  66         917  
87 66     66   222 use PPI::Exception ();
  66         119  
  66         814  
88 66     66   23143 use PPI::Exception::ParserRejection ();
  66         161  
  66         1070  
89 66     66   261 use PPI::Document ();
  66         114  
  66         178881  
90              
91             our $VERSION = '1.284';
92              
93             # The x operator cannot follow most Perl operators, implying that
94             # anything beginning with x following an operator is a word.
95             # These are the exceptions.
96             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97              
98             # The x operator cannot follow most structure elements, implying that
99             # anything beginning with x following a structure element is a word.
100             # These are the exceptions.
101             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102              
103             # Something that looks like the x operator but follows a word
104             # is usually that word's argument.
105             # These are the exceptions.
106             # chop, chomp, dump are ambiguous because they can have either parms
107             # or no parms.
108             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109             endgrent
110             endhostent
111             endnetent
112             endprotoent
113             endpwent
114             endservent
115             fork
116             getgrent
117             gethostent
118             getlogin
119             getnetent
120             getppid
121             getprotoent
122             getpwent
123             getservent
124             setgrent
125             setpwent
126             time
127             times
128             wait
129             wantarray
130             __SUB__
131             );
132              
133              
134              
135             #####################################################################
136             # Creation and Initialization
137              
138             =pod
139              
140             =head2 new $file | \@lines | \$source
141              
142             The main C<new> constructor creates a new Tokenizer object. These
143             objects have no configuration parameters, and can only be used once,
144             to tokenize a single perl source file.
145              
146             It takes as argument either a normal scalar containing source code,
147             a reference to a scalar containing source code, or a reference to an
148             ARRAY containing newline-terminated lines of source code.
149              
150             Returns a new C<PPI::Tokenizer> object on success, or throws a
151             L<PPI::Exception> exception on error.
152              
153             =cut
154              
155             sub new {
156 16891   33 16891 1 51130 my $class = ref($_[0]) || $_[0];
157              
158             # Create the empty tokenizer struct
159 16891         144421 my $self = bless {
160             # Source code
161             source => undef,
162             source_bytes => undef,
163             document => undef,
164              
165             # Line buffer
166             line => undef,
167             line_length => undef,
168             line_cursor => undef,
169             line_count => 0,
170              
171             # Parse state
172             token => undef,
173             class => 'PPI::Token::BOM',
174             zone => 'PPI::Token::Whitespace',
175              
176             # Output token buffer
177             tokens => [],
178             token_cursor => 0,
179             token_eof => 0,
180              
181             # Perl 6 blocks
182             perl6 => [],
183             }, $class;
184              
185 16891 50       78439 if ( ! defined $_[1] ) {
    100          
    50          
    0          
186             # We weren't given anything
187 0         0 PPI::Exception->throw("No source provided to Tokenizer");
188              
189             } elsif ( ! ref $_[1] ) {
190 506         2772 my $source = PPI::Util::_slurp($_[1]);
191 506 50       1448 if ( ref $source ) {
192             # Content returned by reference
193 506         1829 $self->{source} = $$source;
194             } else {
195             # Errors returned as a string
196 0         0 return( $source );
197             }
198              
199             } elsif ( _SCALAR0($_[1]) ) {
200 16385         18710 $self->{source} = ${$_[1]};
  16385         32940  
201              
202             } elsif ( _ARRAY0($_[1]) ) {
203 0         0 $self->{source} = join '', map { "\n" } @{$_[1]};
  0         0  
  0         0  
204              
205             } else {
206             # We don't support whatever this is
207 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
208             }
209              
210             # We can't handle a null string
211 16891         32020 $self->{source_bytes} = length $self->{source};
212 16891 100       30595 if ( $self->{source_bytes} ) {
213             # Split on local newlines
214 16887         461918 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
215 16887         235157 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
216              
217             } else {
218 4         6 $self->{source} = [ ];
219             }
220              
221             ### EVIL
222             # I'm explaining this earlier than I should so you can understand
223             # why I'm about to do something that looks very strange. There's
224             # a problem with the Tokenizer, in that tokens tend to change
225             # classes as each letter is added, but they don't get allocated
226             # their definite final class until the "end" of the token, the
227             # detection of which occurs in about a hundred different places,
228             # all through various crufty code (that triples the speed).
229             #
230             # However, in general, this does not apply to tokens in which a
231             # whitespace character is valid, such as comments, whitespace and
232             # big strings.
233             #
234             # So what we do is add a space to the end of the source. This
235             # triggers normal "end of token" functionality for all cases. Then,
236             # once the tokenizer hits end of file, it examines the last token to
237             # manually either remove the ' ' token, or chop it off the end of
238             # a longer one in which the space would be valid.
239 16891 100   72876   60950 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  72876 100       143707  
  16891 100       67715  
240 10         27 $self->{source_eof_chop} = '';
241             } elsif ( ! defined $self->{source}->[0] ) {
242 4         8 $self->{source_eof_chop} = '';
243             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
244 1128         3317 $self->{source_eof_chop} = '';
245             } else {
246 15749         29939 $self->{source_eof_chop} = 1;
247 15749         29758 $self->{source}->[-1] .= ' ';
248             }
249              
250 16891         71751 $self;
251             }
252              
253             sub _document {
254 18091     18091   22050 my $self = shift;
255 18091 100       46574 return @_ ? $self->{document} = shift : $self->{document};
256             }
257              
258              
259              
260              
261              
262             #####################################################################
263             # Main Public Methods
264              
265             =pod
266              
267             =head2 get_token
268              
269             When using the PPI::Tokenizer object as an iterator, the C<get_token>
270             method is the primary method that is used. It increments the cursor
271             and returns the next Token in the output array.
272              
273             The actual parsing of the file is done only as-needed, and a line at
274             a time. When C<get_token> hits the end of the token array, it will
275             cause the parser to pull in the next line and parse it, continuing
276             as needed until there are more tokens on the output array that
277             get_token can then return.
278              
279             This means that a number of Tokenizer objects can be created, and
280             won't consume significant CPU until you actually begin to pull tokens
281             from it.
282              
283             Return a L<PPI::Token> object on success, C<0> if the Tokenizer had
284             reached the end of the file, or C<undef> on error.
285              
286             =cut
287              
288             sub get_token {
289 393053     393053 1 412323 my $self = shift;
290              
291             # Shortcut for EOF
292 393053 50 66     641467 if ( $self->{token_eof}
293 13486         26801 and $self->{token_cursor} > scalar @{$self->{tokens}}
294             ) {
295 0         0 return 0;
296             }
297              
298             # Return the next token if we can
299 393053 100       887576 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
300 314867         340537 $self->{token_cursor}++;
301 314867         846527 return $token;
302             }
303              
304 78186         84646 my $line_rv;
305              
306             # Catch exceptions and return undef, so that we
307             # can start to convert code to exception-based code.
308 78186         92642 my $rv = eval {
309             # No token, we need to get some more
310 78186         142600 while ( $line_rv = $self->_process_next_line ) {
311             # If there is something in the buffer, return it
312             # The defined() prevents a ton of calls to PPI::Util::TRUE
313 69073 100       152651 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
314 47901         55166 $self->{token_cursor}++;
315 47901         87177 return $token;
316             }
317             }
318 30284         40642 return undef;
319             };
320 78186 100       186650 if ( $@ ) {
    100          
321 1 50       7 if ( _INSTANCE($@, 'PPI::Exception') ) {
322 1         3 $@->throw;
323             } else {
324 0         0 my $errstr = $@;
325 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
326 0         0 PPI::Exception->throw( $errstr );
327             }
328             } elsif ( $rv ) {
329 47901         215797 return $rv;
330             }
331              
332 30284 50       46285 if ( defined $line_rv ) {
333             # End of file, but we can still return things from the buffer
334 30284 50       52724 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
335 0         0 $self->{token_cursor}++;
336 0         0 return $token;
337             }
338              
339             # Set our token end of file flag
340 30284         35734 $self->{token_eof} = 1;
341 30284         83472 return 0;
342             }
343              
344             # Error, pass it up to our caller
345 0         0 undef;
346             }
347              
348             =pod
349              
350             =head2 all_tokens
351              
352             When not being used as an iterator, the C<all_tokens> method tells
353             the Tokenizer to parse the entire file and return all of the tokens
354             in a single ARRAY reference.
355              
356             It should be noted that C<all_tokens> does B<NOT> interfere with the
357             use of the Tokenizer object as an iterator (does not modify the token
358             cursor) and use of the two different mechanisms can be mixed safely.
359              
360             Returns a reference to an ARRAY of L<PPI::Token> objects on success
361             or throws an exception on error.
362              
363             =cut
364              
365             sub all_tokens {
366 5     5 1 32 my $self = shift;
367              
368             # Catch exceptions and return undef, so that we
369             # can start to convert code to exception-based code.
370 5         10 my $ok = eval {
371             # Process lines until we get EOF
372 5 50       12 unless ( $self->{token_eof} ) {
373 5         6 my $rv;
374 5         15 while ( $rv = $self->_process_next_line ) {}
375 5 50       9 unless ( defined $rv ) {
376 0         0 PPI::Exception->throw("Error while processing source");
377             }
378              
379             # Clean up the end of the tokenizer
380 5         8 $self->_clean_eof;
381             }
382 5         6 1;
383             };
384 5 50       11 if ( !$ok ) {
385 0         0 my $errstr = $@;
386 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
387 0         0 PPI::Exception->throw( $errstr );
388             }
389              
390             # End of file, return a copy of the token array.
391 5         7 return [ @{$self->{tokens}} ];
  5         19  
392             }
393              
394             =pod
395              
396             =head2 increment_cursor
397              
398             Although exposed as a public method, C<increment_cursor> is implemented
399             for expert use only, when writing lexers or other components that work
400             directly on token streams.
401              
402             It manually increments the token cursor forward through the file, in effect
403             "skipping" the next token.
404              
405             Return true if the cursor is incremented, C<0> if already at the end of
406             the file, or C<undef> on error.
407              
408             =cut
409              
410             sub increment_cursor {
411             # Do this via the get_token method, which makes sure there
412             # is actually a token there to move to.
413 0 0   0 1 0 $_[0]->get_token and 1;
414             }
415              
416             =pod
417              
418             =head2 decrement_cursor
419              
420             Although exposed as a public method, C<decrement_cursor> is implemented
421             for expert use only, when writing lexers or other components that work
422             directly on token streams.
423              
424             It manually decrements the token cursor backwards through the file, in
425             effect "rolling back" the token stream. And indeed that is what it is
426             primarily intended for, when the component that is consuming the token
427             stream needs to implement some sort of "roll back" feature in its use
428             of the token stream.
429              
430             Return true if the cursor is decremented, C<0> if already at the
431             beginning of the file, or C<undef> on error.
432              
433             =cut
434              
435             sub decrement_cursor {
436 0     0 1 0 my $self = shift;
437              
438             # Check for the beginning of the file
439 0 0       0 return 0 unless $self->{token_cursor};
440              
441             # Decrement the token cursor
442 0         0 $self->{token_eof} = 0;
443 0         0 --$self->{token_cursor};
444             }
445              
446              
447              
448              
449              
450             #####################################################################
451             # Working With Source
452              
453             # Fetches the next line from the input line buffer
454             # Returns undef at EOF.
455             sub _get_line {
456 107296     107296   105942 my $self = shift;
457 107296 100       184744 return undef unless $self->{source}; # EOF hit previously
458              
459             # Pull off the next line
460 91268         86324 my $line = shift @{$self->{source}};
  91268         159205  
461              
462             # Flag EOF if we hit it
463 91268 100       159248 $self->{source} = undef unless defined $line;
464              
465             # Return the line (or EOF flag)
466 91268         144027 return $line; # string or undef
467             }
468              
469             # Fetches the next line, ready to process
470             # Returns 1 on success
471             # Returns 0 on EOF
472             sub _fill_line {
473 104920     104920   110056 my $self = shift;
474 104920         120543 my $inscan = shift;
475              
476             # Get the next line
477 104920         157160 my $line = $self->_get_line;
478 104920 100       165799 unless ( defined $line ) {
479             # End of file
480 32358 100       48043 unless ( $inscan ) {
481 30289         49595 delete $self->{line};
482 30289         35579 delete $self->{line_cursor};
483 30289         31949 delete $self->{line_length};
484 30289         57073 return 0;
485             }
486              
487             # In the scan version, just set the cursor to the end
488             # of the line, and the rest should just cascade out.
489 2069         2820 $self->{line_cursor} = $self->{line_length};
490 2069         3711 return 0;
491             }
492              
493             # Populate the appropriate variables
494 72562         125440 $self->{line} = $line;
495 72562         89217 $self->{line_cursor} = -1;
496 72562         89236 $self->{line_length} = length $line;
497 72562         81973 $self->{line_count}++;
498              
499 72562         133184 1;
500             }
501              
502             # Get the current character
503             sub _char {
504 0     0   0 my $self = shift;
505 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
506             }
507              
508              
509              
510              
511              
512             ####################################################################
513             # Per line processing methods
514              
515             # Processes the next line
516             # Returns 1 on success completion
517             # Returns 0 if EOF
518             # Returns undef on error
519             sub _process_next_line {
520 99374     99374   108624 my $self = shift;
521              
522             # Fill the line buffer
523 99374         100927 my $rv;
524 99374 100       153342 unless ( $rv = $self->_fill_line ) {
525 30289 50       46150 return undef unless defined $rv;
526              
527             # End of file, finalize last token
528 30289         52126 $self->_finalize_token;
529 30289         56452 return 0;
530             }
531              
532             # Run the __TOKENIZER__on_line_start
533 69085         216415 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
534 69085 100       113828 unless ( $rv ) {
535             # If there are no more source lines, then clean up
536 28249 100 66     50353 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  28249         58676  
537 311         921 $self->_clean_eof;
538             }
539              
540             # Defined but false means next line
541 28249 50       60307 return 1 if defined $rv;
542 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
543             }
544              
545             # If we can't deal with the entire line, process char by char
546 40836         79757 while ( $rv = $self->_process_next_char ) {}
547 40835 50       72085 unless ( defined $rv ) {
548 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
549             }
550              
551             # Trigger any action that needs to happen at the end of a line
552 40835         106177 $self->{class}->__TOKENIZER__on_line_end( $self );
553              
554             # If there are no more source lines, then clean up
555 40835 100 100     101905 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  38525         100578  
556 16575         34469 return $self->_clean_eof;
557             }
558              
559 24260         48886 return 1;
560             }
561              
562              
563              
564              
565              
566             #####################################################################
567             # Per-character processing methods
568              
569             # Process on a per-character basis.
570             # Note that due the high number of times this gets
571             # called, it has been fairly heavily in-lined, so the code
572             # might look a bit ugly and duplicated.
573             sub _process_next_char {
574 453121     453121   480252 my $self = shift;
575              
576             ### FIXME - This checks for a screwed up condition that triggers
577             ### several warnings, amongst other things.
578 453121 50 33     1074671 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
579             # $DB::single = 1;
580 0         0 return undef;
581             }
582              
583 453121         453682 $self->{line_cursor}++;
584 453121 100       563257 return 0 if $self->_at_line_end;
585              
586             # Pass control to the token class
587 412286         426941 my $result;
588 412286 100       818547 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
589             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
590 78337 50       212039 return defined $result ? 1 : undef;
591             }
592              
593             # We will need the value of the current character
594 333948         488605 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
595 333948 100       477829 if ( $result eq '1' ) {
596             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
597             # the character is part of it.
598              
599             # Add the character
600 58959 50       86877 if ( defined $self->{token} ) {
601 58959         92353 $self->{token}->{content} .= $char;
602             } else {
603 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
604             }
605              
606 58959         119399 return 1;
607             }
608              
609             # We have been provided with the name of a class
610 274989 100       478351 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
611             # New class
612 106352         162840 $self->_new_token( $result, $char );
613             } elsif ( defined $self->{token} ) {
614             # Same class as current
615 33425         47334 $self->{token}->{content} .= $char;
616             } else {
617             # Same class, but no current
618 135212 50       243502 defined($self->{token} = $self->{class}->new($char)) or return undef;
619             }
620              
621 274989         569931 1;
622             }
623              
624             sub _at_line_end {
625 453121     453121   510739 my ($self) = @_;
626 453121         839077 return $self->{line_cursor} >= $self->{line_length};
627             }
628              
629              
630              
631              
632              
633             #####################################################################
634             # Altering Tokens in Tokenizer
635              
636             # Finish the end of a token.
637             # Returns the resulting parse class as a convenience.
638             sub _finalize_token {
639 406286     406286   407453 my $self = shift;
640 406286 100       582846 return $self->{class} unless defined $self->{token};
641              
642             # Add the token to the token buffer
643 375995         356560 push @{ $self->{tokens} }, $self->{token};
  375995         567903  
644 375995         428491 $self->{token} = undef;
645              
646             # Return the parse class to that of the zone we are in
647 375995         727951 $self->{class} = $self->{zone};
648             }
649              
650             # Creates a new token and sets it in the tokenizer
651             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
652             sub _new_token {
653 240781     240781   242326 my $self = shift;
654             # throw PPI::Exception() unless @_;
655 240781 100       426631 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
656             ? shift : 'PPI::Token::' . shift;
657              
658             # Finalize any existing token
659 240781 100       511821 $self->_finalize_token if defined $self->{token};
660              
661             # Create the new token and update the parse class
662 240781 50       571862 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
663 240781         310980 $self->{class} = $class;
664              
665 240781         283779 1;
666             }
667              
668             # At the end of the file, we need to clean up the results of the erroneous
669             # space that we inserted at the beginning of the process.
670             sub _clean_eof {
671 16891     16891   19220 my $self = shift;
672              
673             # Finish any partially completed token
674 16891 100       31433 $self->_finalize_token if $self->{token};
675              
676             # Find the last token, and if it has no content, kill it.
677             # There appears to be some evidence that such "null tokens" are
678             # somehow getting created accidentally.
679 16891         22979 my $last_token = $self->{tokens}->[ -1 ];
680 16891 50       31368 unless ( length $last_token->{content} ) {
681 0         0 pop @{$self->{tokens}};
  0         0  
682             }
683              
684             # Now, if the last character of the last token is a space we added,
685             # chop it off, deleting the token if there's nothing else left.
686 16891 100       29927 if ( $self->{source_eof_chop} ) {
687 15477         21296 $last_token = $self->{tokens}->[ -1 ];
688 15477         66420 $last_token->{content} =~ s/ $//;
689 15477 100       30593 unless ( length $last_token->{content} ) {
690             # Popping token
691 13196         14687 pop @{$self->{tokens}};
  13196         18706  
692             }
693              
694             # The hack involving adding an extra space is now reversed, and
695             # now nobody will ever know. The perfect crime!
696 15477         24802 $self->{source_eof_chop} = '';
697             }
698              
699 16891         48357 1;
700             }
701              
702              
703              
704              
705              
706             #####################################################################
707             # Utility Methods
708              
709             # Context
710             sub _last_token {
711 0     0   0 $_[0]->{tokens}->[-1];
712             }
713              
714             sub _last_significant_token {
715 3132     3132   4151 my $self = shift;
716 3132         3423 my $cursor = $#{ $self->{tokens} };
  3132         5137  
717 3132         6969 while ( $cursor >= 0 ) {
718 4199         6158 my $token = $self->{tokens}->[$cursor--];
719 4199 100       12602 return $token if $token->significant;
720             }
721 406         806 return;
722             }
723              
724             # Get an array ref of previous significant tokens.
725             # Like _last_significant_token except it gets more than just one token
726             # Returns array with 0 to x entries
727             sub _previous_significant_tokens {
728 168509     168509   181230 my $self = shift;
729 168509   50     236306 my $count = shift || 1;
730 168509         154373 my $cursor = $#{ $self->{tokens} };
  168509         215124  
731              
732 168509         178009 my @tokens;
733 168509         251770 while ( $cursor >= 0 ) {
734 410241         436146 my $token = $self->{tokens}->[$cursor--];
735 410241 100       696060 next if not $token->significant;
736 291122         301094 push @tokens, $token;
737 291122 100       456958 last if @tokens >= $count;
738             }
739              
740 168509         319365 return @tokens;
741             }
742              
743             my %OBVIOUS_CLASS = (
744             'PPI::Token::Symbol' => 'operator',
745             'PPI::Token::Magic' => 'operator',
746             'PPI::Token::Number' => 'operator',
747             'PPI::Token::ArrayIndex' => 'operator',
748             'PPI::Token::Quote::Double' => 'operator',
749             'PPI::Token::Quote::Interpolate' => 'operator',
750             'PPI::Token::Quote::Literal' => 'operator',
751             'PPI::Token::Quote::Single' => 'operator',
752             'PPI::Token::QuoteLike::Backtick' => 'operator',
753             'PPI::Token::QuoteLike::Command' => 'operator',
754             'PPI::Token::QuoteLike::Readline' => 'operator',
755             'PPI::Token::QuoteLike::Regexp' => 'operator',
756             'PPI::Token::QuoteLike::Words' => 'operator',
757             );
758              
759             my %OBVIOUS_CONTENT = (
760             '(' => 'operand',
761             '{' => 'operand',
762             '[' => 'operand',
763             ';' => 'operand',
764             '}' => 'operator',
765             );
766              
767              
768             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
769              
770             # Try to determine operator/operand context, if possible.
771             # Returns "operator", "operand", or "" if unknown.
772             sub _opcontext {
773 7255     7255   10161 my $self = shift;
774 7255         15476 my @tokens = $self->_previous_significant_tokens(1);
775 7255         10456 my $p0 = $tokens[0];
776 7255 100       21940 return '' if not $p0;
777 7136         11851 my $c0 = ref $p0;
778              
779             # Map the obvious cases
780 7136 100       26812 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
781 2256 100       5641 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
782              
783             # Most of the time after an operator, we are an operand
784 1688 100       7511 return 'operand' if $p0->isa('PPI::Token::Operator');
785              
786             # If there's NOTHING, it's operand
787 1483 50       2825 return 'operand' if $p0->content eq '';
788              
789             # Otherwise, we don't know
790 1483         3253 return ''
791             }
792              
793             # Assuming we are currently parsing the word 'x', return true
794             # if previous tokens imply the x is an operator, false otherwise.
795             sub _current_x_is_operator {
796 1119     1119   1563 my ( $self ) = @_;
797 1119 100       1220 return if !@{$self->{tokens}};
  1119         2588  
798              
799 912         2166 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
800 912 50       3077 return if !$prev;
801              
802 912 100       3786 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
803              
804             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
805 755   100     6525 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
806             && !$prev->isa('PPI::Token::Label')
807             ;
808             }
809              
810              
811             # Assuming we are at the end of parsing the current token that could be a word,
812             # a wordlike operator, or a version string, try to determine whether context
813             # before or after it forces it to be a bareword. This method is only useful
814             # during tokenization.
815             sub __current_token_is_forced_word {
816 33124     33124   53689 my ( $t, $word ) = @_;
817              
818             # Check if forced by preceding tokens.
819              
820 33124         53244 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
821 33124 100       78494 if ( !$prev ) {
822 8864         20367 pos $t->{line} = $t->{line_cursor};
823             }
824             else {
825 24260         42847 my $content = $prev->{content};
826              
827             # We are forced if we are a method name.
828             # '->' will always be an operator, so we don't check its type.
829 24260 100       42276 return 1 if $content eq '->';
830              
831             # If we are contained in a pair of curly braces, we are probably a
832             # forced bareword hash key. '{' is never a word or operator, so we
833             # don't check its type.
834 24131         53101 pos $t->{line} = $t->{line_cursor};
835 24131 100 100     69472 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
836              
837             # sub, package, use, and no all indicate that what immediately follows
838             # is a word not an operator or (in the case of sub and package) a
839             # version string. However, we don't want to be fooled by 'package
840             # package v10' or 'use no v10'. We're a forced package unless we're
841             # preceded by 'package sub', in which case we're a version string.
842             # We also have to make sure that the sub/package/etc doing the forcing
843             # is not a method call.
844 23904 100       57936 if( $USUALLY_FORCES{$content}) {
845 5631 100 66     22214 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
846 5621 100       25594 return 1 if not $prevprev;
847 236 100 100     568 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
848 6         26 return;
849             }
850             }
851             # pos on $t->{line} is guaranteed to be set at this point.
852              
853             # Check if forced by following tokens.
854              
855             # If the word is followed by => it is probably a word, not a regex.
856 27137 100       72755 return 1 if $t->{line} =~ /\G\s*=>/gc;
857              
858             # Otherwise we probably aren't forced
859 26347         139680 return '';
860             }
861              
862             sub _current_token_has_signatures_active {
863 8512     8512   14148 my ($t) = @_;
864              
865             # Get at least the three previous significant tokens, and extend the
866             # retrieval range to include at least one token that can walk the
867             # already generated tree. (i.e. has a parent)
868 8512         16675 my ( $tokens_to_get, @tokens ) = (3);
869 8512   66     23749 while ( !@tokens or ( $tokens[-1] and !$tokens[-1]->parent ) ) {
      100        
870 24070         36777 @tokens = $t->_previous_significant_tokens($tokens_to_get);
871 24070 100       37595 last if @tokens < $tokens_to_get;
872 22774         80727 $tokens_to_get++;
873             }
874              
875 8512         24778 my ($closest_parented_token) = grep $_->parent, @tokens;
876 8512   66     25652 $closest_parented_token ||= $t->_document || $t->_document(PPI::Document->new);
      66        
877 8512         21501 return $closest_parented_token->presumed_features->{signatures}, @tokens;
878             }
879              
880             1;
881              
882             =pod
883              
884             =head1 NOTES
885              
886             =head2 How the Tokenizer Works
887              
888             Understanding the Tokenizer is not for the faint-hearted. It is by far
889             the most complex and twisty piece of perl I've ever written that is actually
890             still built properly and isn't a terrible spaghetti-like mess. In fact, you
891             probably want to skip this section.
892              
893             But if you really want to understand, well then here goes.
894              
895             =head2 Source Input and Clean Up
896              
897             The Tokenizer starts by taking source in a variety of forms, sucking it
898             all in and merging into one big string, and doing our own internal line
899             split, using a "universal line separator" which allows the Tokenizer to
900             take source for any platform (and even supports a few known types of
901             broken newlines caused by mixed mac/pc/*nix editor screw ups).
902              
903             The resulting array of lines is used to feed the tokenizer, and is also
904             accessed directly by the heredoc-logic to do the line-oriented part of
905             here-doc support.
906              
907             =head2 Doing Things the Old Fashioned Way
908              
909             Due to the complexity of perl, and after 2 previously aborted parser
910             attempts, in the end the tokenizer was fashioned around a line-buffered
911             character-by-character method.
912              
913             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
914             and then iterates a cursor along it. At each cursor position, a method is
915             called in whatever token class we are currently in, which will examine the
916             character at the current position, and handle it.
917              
918             As the handler methods in the various token classes are called, they
919             build up an output token array for the source code.
920              
921             Various parts of the Tokenizer use look-ahead, arbitrary-distance
922             look-behind (although currently the maximum is three significant tokens),
923             or both, and various other heuristic guesses.
924              
925             I've been told it is officially termed a I<"backtracking parser
926             with infinite lookaheads">.
927              
928             =head2 State Variables
929              
930             Aside from the current line and the character cursor, the Tokenizer
931             maintains a number of different state variables.
932              
933             =over
934              
935             =item Current Class
936              
937             The Tokenizer maintains the current token class at all times. Much of the
938             time is just going to be the "Whitespace" class, which is what the base of
939             a document is. As the tokenizer executes the various character handlers,
940             the class changes a lot as it moves a long. In fact, in some instances,
941             the character handler may not handle the character directly itself, but
942             rather change the "current class" and then hand off to the character
943             handler for the new class.
944              
945             Because of this, and some other things I'll deal with later, the number of
946             times the character handlers are called does not in fact have a direct
947             relationship to the number of actual characters in the document.
948              
949             =item Current Zone
950              
951             Rather than create a class stack to allow for infinitely nested layers of
952             classes, the Tokenizer recognises just a single layer.
953              
954             To put it a different way, in various parts of the file, the Tokenizer will
955             recognise different "base" or "substrate" classes. When a Token such as a
956             comment or a number is finalised by the tokenizer, it "falls back" to the
957             base state.
958              
959             This allows proper tokenization of special areas such as __DATA__
960             and __END__ blocks, which also contain things like comments and POD,
961             without allowing the creation of any significant Tokens inside these areas.
962              
963             For the main part of a document we use L<PPI::Token::Whitespace> for this,
964             with the idea being that code is "floating in a sea of whitespace".
965              
966             =item Current Token
967              
968             The final main state variable is the "current token". This is the Token
969             that is currently being built by the Tokenizer. For certain types, it
970             can be manipulated and morphed and change class quite a bit while being
971             assembled, as the Tokenizer's understanding of the token content changes.
972              
973             When the Tokenizer is confident that it has seen the end of the Token, it
974             will be "finalized", which adds it to the output token array and resets
975             the current class to that of the zone that we are currently in.
976              
977             I should also note at this point that the "current token" variable is
978             optional. The Tokenizer is capable of knowing what class it is currently
979             set to, without actually having accumulated any characters in the Token.
980              
981             =back
982              
983             =head2 Making It Faster
984              
985             As I'm sure you can imagine, calling several different methods for each
986             character and running regexes and other complex heuristics made the first
987             fully working version of the tokenizer extremely slow.
988              
989             During testing, I created a metric to measure parsing speed called
990             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
991             cycles on a typical single-core CPU, and so a Tokenizer running at
992             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
993             code when running on a 1200 MHz processor.
994              
995             The first working version of the tokenizer ran at only 350 LPGC, so to
996             tokenize a typical large module such as L<ExtUtils::MakeMaker> took
997             10-15 seconds. This sluggishness made it unpractical for many uses.
998              
999             So in the current parser, there are multiple layers of optimisation
1000             very carefully built in to the basic. This has brought the tokenizer
1001             up to a more reasonable 1000 LPGC, at the expense of making the code
1002             quite a bit twistier.
1003              
1004             =head2 Making It Faster - Whole Line Classification
1005              
1006             The first step in the optimisation process was to add a hew handler to
1007             enable several of the more basic classes (whitespace, comments) to be
1008             able to be parsed a line at a time. At the start of each line, a
1009             special optional handler (only supported by a few classes) is called to
1010             check and see if the entire line can be parsed in one go.
1011              
1012             This is used mainly to handle things like POD, comments, empty lines,
1013             and a few other minor special cases.
1014              
1015             =head2 Making It Faster - Inlining
1016              
1017             The second stage of the optimisation involved inlining a small
1018             number of critical methods that were repeated an extremely high number
1019             of times. Profiling suggested that there were about 1,000,000 individual
1020             method calls per gigacycle, and by cutting these by two thirds a significant
1021             speed improvement was gained, in the order of about 50%.
1022              
1023             You may notice that many methods in the C<PPI::Tokenizer> code look
1024             very nested and long hand. This is primarily due to this inlining.
1025              
1026             At around this time, some statistics code that existed in the early
1027             versions of the parser was also removed, as it was determined that
1028             it was consuming around 15% of the CPU for the entire parser, while
1029             making the core more complicated.
1030              
1031             A judgment call was made that with the difficulties likely to be
1032             encountered with future planned enhancements, and given the relatively
1033             high cost involved, the statistics features would be removed from the
1034             Tokenizer.
1035              
1036             =head2 Making It Faster - Quote Engine
1037              
1038             Once inlining had reached diminishing returns, it became obvious from
1039             the profiling results that a huge amount of time was being spent
1040             stepping a char at a time though long, simple and "syntactically boring"
1041             code such as comments and strings.
1042              
1043             The existing regex engine was expanded to also encompass quotes and
1044             other quote-like things, and a special abstract base class was added
1045             that provided a number of specialised parsing methods that would "scan
1046             ahead", looking out ahead to find the end of a string, and updating
1047             the cursor to leave it in a valid position for the next call.
1048              
1049             This is also the point at which the number of character handler calls began
1050             to greatly differ from the number of characters. But it has been done
1051             in a way that allows the parser to retain the power of the original
1052             version at the critical points, while skipping through the "boring bits"
1053             as needed for additional speed.
1054              
1055             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1056             for the first time.
1057              
1058             =head2 Making It Faster - The "Complete" Mechanism
1059              
1060             As it became evident that great speed increases were available by using
1061             this "skipping ahead" mechanism, a new handler method was added that
1062             explicitly handles the parsing of an entire token, where the structure
1063             of the token is relatively simple. Tokens such as symbols fit this case,
1064             as once we are passed the initial sigil and word char, we know that we
1065             can skip ahead and "complete" the rest of the token much more easily.
1066              
1067             A number of these have been added for most or possibly all of the common
1068             cases, with most of these "complete" handlers implemented using regular
1069             expressions.
1070              
1071             In fact, so many have been added that at this point, you could arguably
1072             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1073             tokenizer". More tokens are now consumed in "complete" methods in a
1074             typical program than are handled by the normal char-by-char methods.
1075              
1076             Many of the these complete-handlers were implemented during the writing
1077             of the Lexer, and this has allowed the full parser to maintain around
1078             1000 LPGC despite the increasing weight of the Lexer.
1079              
1080             =head2 Making It Faster - Porting To C (In Progress)
1081              
1082             While it would be extraordinarily difficult to port all of the Tokenizer
1083             to C, work has started on a L<PPI::XS> "accelerator" package which acts as
1084             a separate and automatically-detected add-on to the main PPI package.
1085              
1086             L<PPI::XS> implements faster versions of a variety of functions scattered
1087             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1088             various other places, and implements them identically in XS/C.
1089              
1090             In particular, the skip-ahead methods from the Quote Engine would appear
1091             to be extremely amenable to being done in C, and a number of other
1092             functions could be cherry-picked one at a time and implemented in C.
1093              
1094             Each method is heavily tested to ensure that the functionality is
1095             identical, and a versioning mechanism is included to ensure that if a
1096             function gets out of sync, L<PPI::XS> will degrade gracefully and just
1097             not replace that single method.
1098              
1099             =head1 TO DO
1100              
1101             - Add an option to reset or seek the token stream...
1102              
1103             - Implement more Tokenizer functions in L<PPI::XS>
1104              
1105             =head1 SUPPORT
1106              
1107             See the L<support section|PPI/SUPPORT> in the main module.
1108              
1109             =head1 AUTHOR
1110              
1111             Adam Kennedy E<lt>adamk@cpan.orgE<gt>
1112              
1113             =head1 COPYRIGHT
1114              
1115             Copyright 2001 - 2011 Adam Kennedy.
1116              
1117             This program is free software; you can redistribute
1118             it and/or modify it under the same terms as Perl itself.
1119              
1120             The full text of the license can be found in the
1121             LICENSE file included with this module.
1122              
1123             =cut