File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 222 248 89.5
branch 116 142 81.6
condition 27 37 72.9
subroutine 29 33 87.8
pod 5 5 100.0
total 399 465 85.8


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C is implemented separately
58             from the lexer L.
59              
60             The job of C is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 67     67   348 use strict;
  67         139  
  67         2261  
82 67     67   234 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  67         122  
  67         3445  
83 67     67   316 use List::Util 1.33 ();
  67         1047  
  67         1166  
84 67     67   238 use PPI::Util ();
  67         101  
  67         1103  
85 67     67   246 use PPI::Element ();
  67         362  
  67         885  
86 67     67   237 use PPI::Token ();
  67         147  
  67         790  
87 67     67   242 use PPI::Exception ();
  67         104  
  67         871  
88 67     67   24011 use PPI::Exception::ParserRejection ();
  67         159  
  67         1167  
89 67     67   261 use PPI::Document ();
  67         93  
  67         174948  
90              
91             our $VERSION = '1.287';
92              
93             # The x operator cannot follow most Perl operators, implying that
94             # anything beginning with x following an operator is a word.
95             # These are the exceptions.
96             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97              
98             # The x operator cannot follow most structure elements, implying that
99             # anything beginning with x following a structure element is a word.
100             # These are the exceptions.
101             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102              
103             # Something that looks like the x operator but follows a word
104             # is usually that word's argument.
105             # These are the exceptions.
106             # chop, chomp, dump are ambiguous because they can have either parms
107             # or no parms.
108             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109             endgrent
110             endhostent
111             endnetent
112             endprotoent
113             endpwent
114             endservent
115             fork
116             getgrent
117             gethostent
118             getlogin
119             getnetent
120             getppid
121             getprotoent
122             getpwent
123             getservent
124             setgrent
125             setpwent
126             time
127             times
128             wait
129             wantarray
130             __SUB__
131             );
132              
133              
134              
135             #####################################################################
136             # Creation and Initialization
137              
138             =pod
139              
140             =head2 new $file | \@lines | \$source
141              
142             The main C constructor creates a new Tokenizer object. These
143             objects have no configuration parameters, and can only be used once,
144             to tokenize a single perl source file.
145              
146             It takes as argument either a normal scalar containing source code,
147             a reference to a scalar containing source code, or a reference to an
148             ARRAY containing newline-terminated lines of source code.
149              
150             Returns a new C object on success, or throws a
151             L exception on error.
152              
153             =cut
154              
155             sub new {
156 16933   33 16933 1 61972 my $class = ref($_[0]) || $_[0];
157              
158             # Create the empty tokenizer struct
159 16933         142366 my $self = bless {
160             # Source code
161             source => undef,
162             source_bytes => undef,
163             document => undef,
164              
165             # Line buffer
166             line => undef,
167             line_length => undef,
168             line_cursor => undef,
169             line_count => 0,
170              
171             # Parse state
172             token => undef,
173             class => 'PPI::Token::BOM',
174             zone => 'PPI::Token::Whitespace',
175             feature_set => undef,
176              
177             # Output token buffer
178             tokens => [],
179             token_cursor => 0,
180             token_eof => 0,
181              
182             # Perl 6 blocks
183             perl6 => [],
184             }, $class;
185              
186 16933 100       58799 if ( ! ref $_[1] ) {
    100          
    50          
187 515         2662 my $source = PPI::Util::_slurp($_[1]);
188 515 100       1709 PPI::Exception->throw("Tokenizer failed to open file: $source")
189             if not ref $source;
190 513         2038 $self->{source} = $$source;
191              
192             } elsif ( _SCALAR0($_[1]) ) {
193 16415         19135 $self->{source} = ${$_[1]};
  16415         33118  
194              
195             } elsif ( _ARRAY0($_[1]) ) {
196 3         4 $self->{source} = join '', map "$_\n", @{$_[1]};
  3         16  
197              
198             } else {
199             # We don't support whatever this is
200 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
201             }
202              
203             # We can't handle a null string
204 16931         31312 $self->{source_bytes} = length $self->{source};
205 16931 100       30840 if ( $self->{source_bytes} ) {
206             # Split on local newlines
207 16927         571059 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
208 16927         276090 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
209              
210             } else {
211 4         9 $self->{source} = [ ];
212             }
213              
214             ### EVIL
215             # I'm explaining this earlier than I should so you can understand
216             # why I'm about to do something that looks very strange. There's
217             # a problem with the Tokenizer, in that tokens tend to change
218             # classes as each letter is added, but they don't get allocated
219             # their definite final class until the "end" of the token, the
220             # detection of which occurs in about a hundred different places,
221             # all through various crufty code (that triples the speed).
222             #
223             # However, in general, this does not apply to tokens in which a
224             # whitespace character is valid, such as comments, whitespace and
225             # big strings.
226             #
227             # So what we do is add a space to the end of the source. This
228             # triggers normal "end of token" functionality for all cases. Then,
229             # once the tokenizer hits end of file, it examines the last token to
230             # manually either remove the ' ' token, or chop it off the end of
231             # a longer one in which the space would be valid.
232 16931 100   76941   65833 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  76941 100       153552  
  16931 100       69426  
233 10         28 $self->{source_eof_chop} = '';
234             } elsif ( ! defined $self->{source}->[0] ) {
235 4         8 $self->{source_eof_chop} = '';
236             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
237 1130         2997 $self->{source_eof_chop} = '';
238             } else {
239 15787         29093 $self->{source_eof_chop} = 1;
240 15787         28479 $self->{source}->[-1] .= ' ';
241             }
242              
243 16931         72193 $self;
244             }
245              
246             sub _document {
247 16840     16840   19777 my $self = shift;
248 16840 50       38416 return @_ ? $self->{document} = shift : $self->{document};
249             }
250              
251              
252              
253              
254              
255             #####################################################################
256             # Main Public Methods
257              
258             =pod
259              
260             =head2 get_token
261              
262             When using the PPI::Tokenizer object as an iterator, the C
263             method is the primary method that is used. It increments the cursor
264             and returns the next Token in the output array.
265              
266             The actual parsing of the file is done only as-needed, and a line at
267             a time. When C hits the end of the token array, it will
268             cause the parser to pull in the next line and parse it, continuing
269             as needed until there are more tokens on the output array that
270             get_token can then return.
271              
272             This means that a number of Tokenizer objects can be created, and
273             won't consume significant CPU until you actually begin to pull tokens
274             from it.
275              
276             Return a L object on success, C<0> if the Tokenizer had
277             reached the end of the file, or C on error.
278              
279             =cut
280              
281             sub get_token {
282 595948     595948 1 590452 my $self = shift;
283              
284             # Shortcut for EOF
285 595948 50 66     910719 if ( $self->{token_eof}
286 13579         26825 and $self->{token_cursor} > scalar @{$self->{tokens}}
287             ) {
288 0         0 return 0;
289             }
290              
291             # Return the next token if we can
292 595948 100       1219153 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293 513705         523611 $self->{token_cursor}++;
294 513705         1228165 return $token;
295             }
296              
297 82243         84026 my $line_rv;
298              
299             # Catch exceptions and return undef, so that we
300             # can start to convert code to exception-based code.
301 82243         89945 my $rv = eval {
302             # No token, we need to get some more
303 82243         149175 while ( $line_rv = $self->_process_next_line ) {
304             # If there is something in the buffer, return it
305             # The defined() prevents a ton of calls to PPI::Util::TRUE
306 73141 100       155455 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307 51825         54739 $self->{token_cursor}++;
308 51825         89810 return $token;
309             }
310             }
311 30417         38778 return undef;
312             };
313 82243 100       190390 if ( $@ ) {
    100          
314 1 50       70 if ( _INSTANCE($@, 'PPI::Exception') ) {
315 1         6 $@->throw;
316             } else {
317 0         0 my $errstr = $@;
318 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
319 0         0 PPI::Exception->throw( $errstr );
320             }
321             } elsif ( $rv ) {
322 51825         227615 return $rv;
323             }
324              
325 30417 50       43641 if ( defined $line_rv ) {
326             # End of file, but we can still return things from the buffer
327 30417 50       49442 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328 0         0 $self->{token_cursor}++;
329 0         0 return $token;
330             }
331              
332             # Set our token end of file flag
333 30417         32990 $self->{token_eof} = 1;
334 30417         82925 return 0;
335             }
336              
337             # Error, pass it up to our caller
338 0         0 undef;
339             }
340              
341             =pod
342              
343             =head2 all_tokens
344              
345             When not being used as an iterator, the C method tells
346             the Tokenizer to parse the entire file and return all of the tokens
347             in a single ARRAY reference.
348              
349             It should be noted that C does B interfere with the
350             use of the Tokenizer object as an iterator (does not modify the token
351             cursor) and use of the two different mechanisms can be mixed safely.
352              
353             Returns a reference to an ARRAY of L objects on success
354             or throws an exception on error.
355              
356             =cut
357              
358             sub all_tokens {
359 5     5 1 15 my $self = shift;
360              
361             # Catch exceptions and return undef, so that we
362             # can start to convert code to exception-based code.
363 5         7 my $ok = eval {
364             # Process lines until we get EOF
365 5 50       12 unless ( $self->{token_eof} ) {
366 5         5 my $rv;
367 5         11 while ( $rv = $self->_process_next_line ) {}
368 5 50       8 unless ( defined $rv ) {
369 0         0 PPI::Exception->throw("Error while processing source");
370             }
371              
372             # Clean up the end of the tokenizer
373 5         8 $self->_clean_eof;
374             }
375 5         6 1;
376             };
377 5 50       22 if ( !$ok ) {
378 0         0 my $errstr = $@;
379 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
380 0         0 PPI::Exception->throw( $errstr );
381             }
382              
383             # End of file, return a copy of the token array.
384 5         7 return [ @{$self->{tokens}} ];
  5         18  
385             }
386              
387             =pod
388              
389             =head2 increment_cursor
390              
391             Although exposed as a public method, C is implemented
392             for expert use only, when writing lexers or other components that work
393             directly on token streams.
394              
395             It manually increments the token cursor forward through the file, in effect
396             "skipping" the next token.
397              
398             Return true if the cursor is incremented, C<0> if already at the end of
399             the file, or C on error.
400              
401             =cut
402              
403             sub increment_cursor {
404             # Do this via the get_token method, which makes sure there
405             # is actually a token there to move to.
406 0 0   0 1 0 $_[0]->get_token and 1;
407             }
408              
409             =pod
410              
411             =head2 decrement_cursor
412              
413             Although exposed as a public method, C is implemented
414             for expert use only, when writing lexers or other components that work
415             directly on token streams.
416              
417             It manually decrements the token cursor backwards through the file, in
418             effect "rolling back" the token stream. And indeed that is what it is
419             primarily intended for, when the component that is consuming the token
420             stream needs to implement some sort of "roll back" feature in its use
421             of the token stream.
422              
423             Return true if the cursor is decremented, C<0> if already at the
424             beginning of the file, or C on error.
425              
426             =cut
427              
428             sub decrement_cursor {
429 0     0 1 0 my $self = shift;
430              
431             # Check for the beginning of the file
432 0 0       0 return 0 unless $self->{token_cursor};
433              
434             # Decrement the token cursor
435 0         0 $self->{token_eof} = 0;
436 0         0 --$self->{token_cursor};
437             }
438              
439              
440              
441              
442              
443             #####################################################################
444             # Working With Source
445              
446             # Fetches the next line from the input line buffer
447             # Returns undef at EOF.
448             sub _get_line {
449 111478     111478   105701 my $self = shift;
450 111478 100       178540 return undef unless $self->{source}; # EOF hit previously
451              
452             # Pull off the next line
453 95373         92663 my $line = shift @{$self->{source}};
  95373         156772  
454              
455             # Flag EOF if we hit it
456 95373 100       156340 $self->{source} = undef unless defined $line;
457              
458             # Return the line (or EOF flag)
459 95373         148428 return $line; # string or undef
460             }
461              
462             # Fetches the next line, ready to process
463             # Returns 1 on success
464             # Returns 0 on EOF
465             sub _fill_line {
466 109101     109101   109663 my $self = shift;
467 109101         114200 my $inscan = shift;
468              
469             # Get the next line
470 109101         156399 my $line = $self->_get_line;
471 109101 100       166347 unless ( defined $line ) {
472             # End of file
473 32475 100       46924 unless ( $inscan ) {
474 30422         45815 delete $self->{line};
475 30422         33886 delete $self->{line_cursor};
476 30422         32629 delete $self->{line_length};
477 30422         53775 return 0;
478             }
479              
480             # In the scan version, just set the cursor to the end
481             # of the line, and the rest should just cascade out.
482 2053         2889 $self->{line_cursor} = $self->{line_length};
483 2053         3591 return 0;
484             }
485              
486             # Populate the appropriate variables
487 76626         122229 $self->{line} = $line;
488 76626         90405 $self->{line_cursor} = -1;
489 76626         94357 $self->{line_length} = length $line;
490 76626         83530 $self->{line_count}++;
491              
492 76626         131347 1;
493             }
494              
495             # Get the current character
496             sub _char {
497 0     0   0 my $self = shift;
498 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
499             }
500              
501              
502              
503              
504              
505             ####################################################################
506             # Per line processing methods
507              
508             # Processes the next line
509             # Returns 1 on success completion
510             # Returns 0 if EOF
511             # Returns undef on error
512             sub _process_next_line {
513 103575     103575   104449 my $self = shift;
514              
515             # Fill the line buffer
516 103575         108313 my $rv;
517 103575 100       153018 unless ( $rv = $self->_fill_line ) {
518 30422 50       43052 return undef unless defined $rv;
519              
520             # End of file, finalize last token
521 30422         52155 $self->_finalize_token;
522 30422         55111 return 0;
523             }
524              
525             # Run the __TOKENIZER__on_line_start
526 73153         191541 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527 73153 100       115893 unless ( $rv ) {
528             # If there are no more source lines, then clean up
529 28373 100 66     49490 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  28373         58691  
530 308         844 $self->_clean_eof;
531             }
532              
533             # Defined but false means next line
534 28373 50       61267 return 1 if defined $rv;
535 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
536             }
537              
538             # If we can't deal with the entire line, process char by char
539 44780         77870 while ( $rv = $self->_process_next_char ) {}
540 44779 50       76253 unless ( defined $rv ) {
541 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542             }
543              
544             # Trigger any action that needs to happen at the end of a line
545 44779         109358 $self->{class}->__TOKENIZER__on_line_end( $self );
546              
547             # If there are no more source lines, then clean up
548 44779 100 100     96820 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  42484         98134  
549 16618         31319 return $self->_clean_eof;
550             }
551              
552 28161         57358 return 1;
553             }
554              
555              
556              
557              
558              
559             #####################################################################
560             # Per-character processing methods
561              
562             # Process on a per-character basis.
563             # Note that due the high number of times this gets
564             # called, it has been fairly heavily in-lined, so the code
565             # might look a bit ugly and duplicated.
566             sub _process_next_char {
567 703888     703888   674284 my $self = shift;
568              
569             ### FIXME - This checks for a screwed up condition that triggers
570             ### several warnings, amongst other things.
571 703888 50 33     1578562 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572             # $DB::single = 1;
573 0         0 return undef;
574             }
575              
576 703888         684810 $self->{line_cursor}++;
577 703888 100       836213 return 0 if $self->_at_line_end;
578              
579             # Pass control to the token class
580 659109         637099 my $result;
581 659109 100       1193995 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583 93825 50       242003 return defined $result ? 1 : undef;
584             }
585              
586             # We will need the value of the current character
587 565283         746661 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588 565283 100       755103 if ( $result eq '1' ) {
589             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590             # the character is part of it.
591              
592             # Add the character
593 96033 50       126342 if ( defined $self->{token} ) {
594 96033         133003 $self->{token}->{content} .= $char;
595             } else {
596 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
597             }
598              
599 96033         180368 return 1;
600             }
601              
602             # We have been provided with the name of a class
603 469250 100       746769 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
604             # New class
605 228343         304106 $self->_new_token( $result, $char );
606             } elsif ( defined $self->{token} ) {
607             # Same class as current
608 44497         58820 $self->{token}->{content} .= $char;
609             } else {
610             # Same class, but no current
611 196410 50       345356 defined($self->{token} = $self->{class}->new($char)) or return undef;
612             }
613              
614 469250         870107 1;
615             }
616              
617             sub _at_line_end {
618 703888     703888   756729 my ($self) = @_;
619 703888         1207913 return $self->{line_cursor} >= $self->{line_length};
620             }
621              
622              
623              
624              
625              
626             #####################################################################
627             # Altering Tokens in Tokenizer
628              
629             # Finish the end of a token.
630             # Returns the resulting parse class as a convenience.
631             sub _finalize_token {
632 609223     609223   573060 my $self = shift;
633 609223 100       825814 return $self->{class} unless defined $self->{token};
634              
635             # Add the token to the token buffer
636 578799         535213 push @{ $self->{tokens} }, $self->{token};
  578799         812106  
637 578799         618530 $self->{token} = undef;
638              
639             # Return the parse class to that of the zone we are in
640 578799         1040773 $self->{class} = $self->{zone};
641             }
642              
643             # Creates a new token and sets it in the tokenizer
644             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
645             sub _new_token {
646 382387     382387   388503 my $self = shift;
647             # throw PPI::Exception() unless @_;
648 382387 100       639382 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
649             ? shift : 'PPI::Token::' . shift;
650              
651             # Finalize any existing token
652 382387 100       655410 $self->_finalize_token if defined $self->{token};
653              
654             # Create the new token and update the parse class
655 382387 50       777437 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
656 382387         461250 $self->{class} = $class;
657              
658 382387         432527 1;
659             }
660              
661             # At the end of the file, we need to clean up the results of the erroneous
662             # space that we inserted at the beginning of the process.
663             sub _clean_eof {
664 16931     16931   20446 my $self = shift;
665              
666             # Finish any partially completed token
667 16931 100       29286 $self->_finalize_token if $self->{token};
668              
669             # Find the last token, and if it has no content, kill it.
670             # There appears to be some evidence that such "null tokens" are
671             # somehow getting created accidentally.
672 16931         22478 my $last_token = $self->{tokens}->[ -1 ];
673 16931 50       31435 unless ( length $last_token->{content} ) {
674 0         0 pop @{$self->{tokens}};
  0         0  
675             }
676              
677             # Now, if the last character of the last token is a space we added,
678             # chop it off, deleting the token if there's nothing else left.
679 16931 100       29033 if ( $self->{source_eof_chop} ) {
680 15515         22282 $last_token = $self->{tokens}->[ -1 ];
681 15515         64383 $last_token->{content} =~ s/ $//;
682 15515 100       29584 unless ( length $last_token->{content} ) {
683             # Popping token
684 13238         13392 pop @{$self->{tokens}};
  13238         17918  
685             }
686              
687             # The hack involving adding an extra space is now reversed, and
688             # now nobody will ever know. The perfect crime!
689 15515         24959 $self->{source_eof_chop} = '';
690             }
691              
692 16931         47812 1;
693             }
694              
695              
696              
697              
698              
699             #####################################################################
700             # Utility Methods
701              
702             # Context
703             sub _last_token {
704 0     0   0 $_[0]->{tokens}->[-1];
705             }
706              
707             sub _last_significant_token {
708 3133     3133   3594 my $self = shift;
709 3133         3494 my $cursor = $#{ $self->{tokens} };
  3133         4469  
710 3133         6148 while ( $cursor >= 0 ) {
711 4202         6064 my $token = $self->{tokens}->[$cursor--];
712 4202 100       11845 return $token if $token->significant;
713             }
714 404         719 return;
715             }
716              
717             # Get an array ref of previous significant tokens.
718             # Like _last_significant_token except it gets more than just one token
719             # Returns array with 0 to x entries
720             sub _previous_significant_tokens {
721 165003     165003   163302 my $self = shift;
722 165003   50     230646 my $count = shift || 1;
723 165003         151495 my $cursor = $#{ $self->{tokens} };
  165003         210096  
724              
725 165003         174586 my @tokens;
726 165003         236998 while ( $cursor >= 0 ) {
727 266603         296814 my $token = $self->{tokens}->[$cursor--];
728 266603 100       499118 next if not $token->significant;
729 172047         194263 push @tokens, $token;
730 172047 100       277314 last if @tokens >= $count;
731             }
732              
733 165003         285878 return @tokens;
734             }
735              
736             my %OBVIOUS_CLASS = (
737             'PPI::Token::Symbol' => 'operator',
738             'PPI::Token::Magic' => 'operator',
739             'PPI::Token::Number' => 'operator',
740             'PPI::Token::ArrayIndex' => 'operator',
741             'PPI::Token::Quote::Double' => 'operator',
742             'PPI::Token::Quote::Interpolate' => 'operator',
743             'PPI::Token::Quote::Literal' => 'operator',
744             'PPI::Token::Quote::Single' => 'operator',
745             'PPI::Token::QuoteLike::Backtick' => 'operator',
746             'PPI::Token::QuoteLike::Command' => 'operator',
747             'PPI::Token::QuoteLike::Readline' => 'operator',
748             'PPI::Token::QuoteLike::Regexp' => 'operator',
749             'PPI::Token::QuoteLike::Words' => 'operator',
750             );
751              
752             my %OBVIOUS_CONTENT = (
753             '(' => 'operand',
754             '{' => 'operand',
755             '[' => 'operand',
756             ';' => 'operand',
757             '}' => 'operator',
758             );
759              
760              
761             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
762              
763             # Try to determine operator/operand context, if possible.
764             # Returns "operator", "operand", or "" if unknown.
765             sub _opcontext {
766 10949     10949   12219 my $self = shift;
767 10949         19514 my @tokens = $self->_previous_significant_tokens(1);
768 10949         13152 my $p0 = $tokens[0];
769 10949 100       27496 return '' if not $p0;
770 10827         15712 my $c0 = ref $p0;
771              
772             # Map the obvious cases
773 10827 100       31716 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
774 2265 100       4884 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
775              
776             # Most of the time after an operator, we are an operand
777 1692 100       7237 return 'operand' if $p0->isa('PPI::Token::Operator');
778              
779             # If there's NOTHING, it's operand
780 1488 50       3063 return 'operand' if $p0->content eq '';
781              
782             # Otherwise, we don't know
783 1488         3321 return ''
784             }
785              
786             # Assuming we are currently parsing the word 'x', return true
787             # if previous tokens imply the x is an operator, false otherwise.
788             sub _current_x_is_operator {
789 1124     1124   1584 my ( $self ) = @_;
790 1124 100       1475 return if !@{$self->{tokens}};
  1124         2690  
791              
792 925         1999 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
793 925 100       3008 return if !$prev;
794              
795 924 100       3732 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
796              
797             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
798 764   100     5150 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
799             && !$prev->isa('PPI::Token::Label')
800             ;
801             }
802              
803              
804             # Assuming we are at the end of parsing the current token that could be a word,
805             # a wordlike operator, or a version string, try to determine whether context
806             # before or after it forces it to be a bareword. This method is only useful
807             # during tokenization.
808             sub __current_token_is_forced_word {
809 33298     33298   49578 my ( $t, $word ) = @_;
810              
811             # Check if forced by preceding tokens.
812              
813 33298         51487 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
814 33298 100       69443 if ( !$prev ) {
815 8858         19215 pos $t->{line} = $t->{line_cursor};
816             }
817             else {
818 24440         38056 my $content = $prev->{content};
819              
820             # We are forced if we are a method name.
821             # '->' will always be an operator, so we don't check its type.
822 24440 100       41442 return 1 if $content eq '->';
823              
824             # If we are contained in a pair of curly braces, we are probably a
825             # forced bareword hash key. '{' is never a word or operator, so we
826             # don't check its type.
827 24311         48514 pos $t->{line} = $t->{line_cursor};
828 24311 100 100     63066 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
829              
830             # sub, package, use, and no all indicate that what immediately follows
831             # is a word not an operator or (in the case of sub and package) a
832             # version string. However, we don't want to be fooled by 'package
833             # package v10' or 'use no v10'. We're a forced package unless we're
834             # preceded by 'package sub', in which case we're a version string.
835             # We also have to make sure that the sub/package/etc doing the forcing
836             # is not a method call.
837 24082 100       50176 if( $USUALLY_FORCES{$content}) {
838 5631 100 66     19169 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
839 5621 100       22797 return 1 if not $prevprev;
840 236 100 100     492 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
841 6         28 return;
842             }
843             }
844             # pos on $t->{line} is guaranteed to be set at this point.
845              
846             # Check if forced by following tokens.
847              
848             # If the word is followed by => it is probably a word, not a regex.
849 27309 100       69634 return 1 if $t->{line} =~ /\G\s*=>/gc;
850              
851             # Otherwise we probably aren't forced
852 26519         133256 return '';
853             }
854              
855             sub _features {
856 37     37   60 my ( $self, $arg ) = @_;
857 37 50 0     93 return $arg ? $self->{feature_set} = $arg : $self->{feature_set} || {};
858             }
859              
860 12308     12308   38735 sub _current_token_has_signatures_active { shift->{feature_set}{signatures} }
861              
862             1;
863              
864             =pod
865              
866             =head1 NOTES
867              
868             =head2 How the Tokenizer Works
869              
870             Understanding the Tokenizer is not for the faint-hearted. It is by far
871             the most complex and twisty piece of perl I've ever written that is actually
872             still built properly and isn't a terrible spaghetti-like mess. In fact, you
873             probably want to skip this section.
874              
875             But if you really want to understand, well then here goes.
876              
877             =head2 Source Input and Clean Up
878              
879             The Tokenizer starts by taking source in a variety of forms, sucking it
880             all in and merging into one big string, and doing our own internal line
881             split, using a "universal line separator" which allows the Tokenizer to
882             take source for any platform (and even supports a few known types of
883             broken newlines caused by mixed mac/pc/*nix editor screw ups).
884              
885             The resulting array of lines is used to feed the tokenizer, and is also
886             accessed directly by the heredoc-logic to do the line-oriented part of
887             here-doc support.
888              
889             =head2 Doing Things the Old Fashioned Way
890              
891             Due to the complexity of perl, and after 2 previously aborted parser
892             attempts, in the end the tokenizer was fashioned around a line-buffered
893             character-by-character method.
894              
895             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
896             and then iterates a cursor along it. At each cursor position, a method is
897             called in whatever token class we are currently in, which will examine the
898             character at the current position, and handle it.
899              
900             As the handler methods in the various token classes are called, they
901             build up an output token array for the source code.
902              
903             Various parts of the Tokenizer use look-ahead, arbitrary-distance
904             look-behind (although currently the maximum is three significant tokens),
905             or both, and various other heuristic guesses.
906              
907             I've been told it is officially termed a I<"backtracking parser
908             with infinite lookaheads">.
909              
910             =head2 State Variables
911              
912             Aside from the current line and the character cursor, the Tokenizer
913             maintains a number of different state variables.
914              
915             =over
916              
917             =item Current Class
918              
919             The Tokenizer maintains the current token class at all times. Much of the
920             time is just going to be the "Whitespace" class, which is what the base of
921             a document is. As the tokenizer executes the various character handlers,
922             the class changes a lot as it moves a long. In fact, in some instances,
923             the character handler may not handle the character directly itself, but
924             rather change the "current class" and then hand off to the character
925             handler for the new class.
926              
927             Because of this, and some other things I'll deal with later, the number of
928             times the character handlers are called does not in fact have a direct
929             relationship to the number of actual characters in the document.
930              
931             =item Current Zone
932              
933             Rather than create a class stack to allow for infinitely nested layers of
934             classes, the Tokenizer recognises just a single layer.
935              
936             To put it a different way, in various parts of the file, the Tokenizer will
937             recognise different "base" or "substrate" classes. When a Token such as a
938             comment or a number is finalised by the tokenizer, it "falls back" to the
939             base state.
940              
941             This allows proper tokenization of special areas such as __DATA__
942             and __END__ blocks, which also contain things like comments and POD,
943             without allowing the creation of any significant Tokens inside these areas.
944              
945             For the main part of a document we use L for this,
946             with the idea being that code is "floating in a sea of whitespace".
947              
948             =item Current Token
949              
950             The final main state variable is the "current token". This is the Token
951             that is currently being built by the Tokenizer. For certain types, it
952             can be manipulated and morphed and change class quite a bit while being
953             assembled, as the Tokenizer's understanding of the token content changes.
954              
955             When the Tokenizer is confident that it has seen the end of the Token, it
956             will be "finalized", which adds it to the output token array and resets
957             the current class to that of the zone that we are currently in.
958              
959             I should also note at this point that the "current token" variable is
960             optional. The Tokenizer is capable of knowing what class it is currently
961             set to, without actually having accumulated any characters in the Token.
962              
963             =back
964              
965             =head2 Making It Faster
966              
967             As I'm sure you can imagine, calling several different methods for each
968             character and running regexes and other complex heuristics made the first
969             fully working version of the tokenizer extremely slow.
970              
971             During testing, I created a metric to measure parsing speed called
972             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
973             cycles on a typical single-core CPU, and so a Tokenizer running at
974             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
975             code when running on a 1200 MHz processor.
976              
977             The first working version of the tokenizer ran at only 350 LPGC, so to
978             tokenize a typical large module such as L took
979             10-15 seconds. This sluggishness made it unpractical for many uses.
980              
981             So in the current parser, there are multiple layers of optimisation
982             very carefully built in to the basic. This has brought the tokenizer
983             up to a more reasonable 1000 LPGC, at the expense of making the code
984             quite a bit twistier.
985              
986             =head2 Making It Faster - Whole Line Classification
987              
988             The first step in the optimisation process was to add a hew handler to
989             enable several of the more basic classes (whitespace, comments) to be
990             able to be parsed a line at a time. At the start of each line, a
991             special optional handler (only supported by a few classes) is called to
992             check and see if the entire line can be parsed in one go.
993              
994             This is used mainly to handle things like POD, comments, empty lines,
995             and a few other minor special cases.
996              
997             =head2 Making It Faster - Inlining
998              
999             The second stage of the optimisation involved inlining a small
1000             number of critical methods that were repeated an extremely high number
1001             of times. Profiling suggested that there were about 1,000,000 individual
1002             method calls per gigacycle, and by cutting these by two thirds a significant
1003             speed improvement was gained, in the order of about 50%.
1004              
1005             You may notice that many methods in the C code look
1006             very nested and long hand. This is primarily due to this inlining.
1007              
1008             At around this time, some statistics code that existed in the early
1009             versions of the parser was also removed, as it was determined that
1010             it was consuming around 15% of the CPU for the entire parser, while
1011             making the core more complicated.
1012              
1013             A judgment call was made that with the difficulties likely to be
1014             encountered with future planned enhancements, and given the relatively
1015             high cost involved, the statistics features would be removed from the
1016             Tokenizer.
1017              
1018             =head2 Making It Faster - Quote Engine
1019              
1020             Once inlining had reached diminishing returns, it became obvious from
1021             the profiling results that a huge amount of time was being spent
1022             stepping a char at a time though long, simple and "syntactically boring"
1023             code such as comments and strings.
1024              
1025             The existing regex engine was expanded to also encompass quotes and
1026             other quote-like things, and a special abstract base class was added
1027             that provided a number of specialised parsing methods that would "scan
1028             ahead", looking out ahead to find the end of a string, and updating
1029             the cursor to leave it in a valid position for the next call.
1030              
1031             This is also the point at which the number of character handler calls began
1032             to greatly differ from the number of characters. But it has been done
1033             in a way that allows the parser to retain the power of the original
1034             version at the critical points, while skipping through the "boring bits"
1035             as needed for additional speed.
1036              
1037             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1038             for the first time.
1039              
1040             =head2 Making It Faster - The "Complete" Mechanism
1041              
1042             As it became evident that great speed increases were available by using
1043             this "skipping ahead" mechanism, a new handler method was added that
1044             explicitly handles the parsing of an entire token, where the structure
1045             of the token is relatively simple. Tokens such as symbols fit this case,
1046             as once we are passed the initial sigil and word char, we know that we
1047             can skip ahead and "complete" the rest of the token much more easily.
1048              
1049             A number of these have been added for most or possibly all of the common
1050             cases, with most of these "complete" handlers implemented using regular
1051             expressions.
1052              
1053             In fact, so many have been added that at this point, you could arguably
1054             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1055             tokenizer". More tokens are now consumed in "complete" methods in a
1056             typical program than are handled by the normal char-by-char methods.
1057              
1058             Many of the these complete-handlers were implemented during the writing
1059             of the Lexer, and this has allowed the full parser to maintain around
1060             1000 LPGC despite the increasing weight of the Lexer.
1061              
1062             =head2 Making It Faster - Porting To C (In Progress)
1063              
1064             While it would be extraordinarily difficult to port all of the Tokenizer
1065             to C, work has started on a L "accelerator" package which acts as
1066             a separate and automatically-detected add-on to the main PPI package.
1067              
1068             L implements faster versions of a variety of functions scattered
1069             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1070             various other places, and implements them identically in XS/C.
1071              
1072             In particular, the skip-ahead methods from the Quote Engine would appear
1073             to be extremely amenable to being done in C, and a number of other
1074             functions could be cherry-picked one at a time and implemented in C.
1075              
1076             Each method is heavily tested to ensure that the functionality is
1077             identical, and a versioning mechanism is included to ensure that if a
1078             function gets out of sync, L will degrade gracefully and just
1079             not replace that single method.
1080              
1081             =head1 TO DO
1082              
1083             - Add an option to reset or seek the token stream...
1084              
1085             - Implement more Tokenizer functions in L
1086              
1087             =head1 SUPPORT
1088              
1089             See the L in the main module.
1090              
1091             =head1 AUTHOR
1092              
1093             Adam Kennedy Eadamk@cpan.orgE
1094              
1095             =head1 COPYRIGHT
1096              
1097             Copyright 2001 - 2011 Adam Kennedy.
1098              
1099             This program is free software; you can redistribute
1100             it and/or modify it under the same terms as Perl itself.
1101              
1102             The full text of the license can be found in the
1103             LICENSE file included with this module.
1104              
1105             =cut