File Coverage

blib/lib/PPI/Tokenizer.pm
Criterion Covered Total %
statement 222 248 89.5
branch 115 142 80.9
condition 27 37 72.9
subroutine 29 33 87.8
pod 5 5 100.0
total 398 465 85.5


line stmt bran cond sub pod time code
1             package PPI::Tokenizer;
2              
3             =pod
4              
5             =head1 NAME
6              
7             PPI::Tokenizer - The Perl Document Tokenizer
8              
9             =head1 SYNOPSIS
10              
11             # Create a tokenizer for a file, array or string
12             $Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13             $Tokenizer = PPI::Tokenizer->new( \@lines );
14             $Tokenizer = PPI::Tokenizer->new( \$source );
15            
16             # Return all the tokens for the document
17             my $tokens = $Tokenizer->all_tokens;
18            
19             # Or we can use it as an iterator
20             while ( my $Token = $Tokenizer->get_token ) {
21             print "Found token '$Token'\n";
22             }
23            
24             # If we REALLY need to manually nudge the cursor, you
25             # can do that to (The lexer needs this ability to do rollbacks)
26             $is_incremented = $Tokenizer->increment_cursor;
27             $is_decremented = $Tokenizer->decrement_cursor;
28              
29             =head1 DESCRIPTION
30              
31             PPI::Tokenizer is the class that provides Tokenizer objects for use in
32             breaking strings of Perl source code into Tokens.
33              
34             By the time you are reading this, you probably need to know a little
35             about the difference between how perl parses Perl "code" and how PPI
36             parsers Perl "documents".
37              
38             "perl" itself (the interpreter) uses a heavily modified lex specification
39             to specify its parsing logic, maintains several types of state as it
40             goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41              
42             In fact, it is provably impossible to use perl's parsing method without
43             simultaneously executing code. A formal mathematical proof has been
44             published demonstrating the method.
45              
46             This is where the truism "Only perl can parse Perl" comes from.
47              
48             PPI uses a completely different approach by abandoning the (impossible)
49             ability to parse Perl the same way that the interpreter does, and instead
50             parsing the source as a document, using a document structure independently
51             derived from the Perl documentation and approximating the perl interpreter
52             interpretation as closely as possible.
53              
54             It was touch and go for a long time whether we could get it close enough,
55             but in the end it turned out that it could be done.
56              
57             In this approach, the tokenizer C is implemented separately
58             from the lexer L.
59              
60             The job of C is to take pure source as a string and break it
61             up into a stream/set of tokens, and contains most of the "black magic" used
62             in PPI. By comparison, the lexer implements a relatively straight forward
63             tree structure, and has an implementation that is uncomplicated (compared
64             to the insanity in the tokenizer at least).
65              
66             The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67             supported by a very B flexible internal API, but fortunately it was
68             possible to largely encapsulate the black magic, so there is not a lot that
69             gets exposed to people using the C itself.
70              
71             =head1 METHODS
72              
73             Despite the incredible complexity, the Tokenizer itself only exposes a
74             relatively small number of methods, with most of the complexity implemented
75             in private methods.
76              
77             =cut
78              
79             # Make sure everything we need is loaded so
80             # we don't have to go and load all of PPI.
81 68     68   381 use strict;
  68         113  
  68         2399  
82 68     68   257 use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
  68         103  
  68         3500  
83 68     68   269 use List::Util 1.33 ();
  68         1026  
  68         1187  
84 68     68   252 use PPI::Util ();
  68         94  
  68         1213  
85 68     68   300 use PPI::Element ();
  68         127  
  68         1011  
86 68     68   232 use PPI::Token ();
  68         130  
  68         909  
87 68     68   255 use PPI::Exception ();
  68         107  
  68         835  
88 68     68   25402 use PPI::Exception::ParserRejection ();
  68         166  
  68         1224  
89 68     68   302 use PPI::Document ();
  68         101  
  68         185069  
90              
91             our $VERSION = '1.290';
92              
93             # The x operator cannot follow most Perl operators, implying that
94             # anything beginning with x following an operator is a word.
95             # These are the exceptions.
96             my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97              
98             # The x operator cannot follow most structure elements, implying that
99             # anything beginning with x following a structure element is a word.
100             # These are the exceptions.
101             my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102              
103             # Something that looks like the x operator but follows a word
104             # is usually that word's argument.
105             # These are the exceptions.
106             # chop, chomp, dump are ambiguous because they can have either parms
107             # or no parms.
108             my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109             endgrent
110             endhostent
111             endnetent
112             endprotoent
113             endpwent
114             endservent
115             fork
116             getgrent
117             gethostent
118             getlogin
119             getnetent
120             getppid
121             getprotoent
122             getpwent
123             getservent
124             setgrent
125             setpwent
126             time
127             times
128             wait
129             wantarray
130             __SUB__
131             );
132              
133              
134              
135             #####################################################################
136             # Creation and Initialization
137              
138             =pod
139              
140             =head2 new $file | \@lines | \$source
141              
142             The main C constructor creates a new Tokenizer object. These
143             objects have no configuration parameters, and can only be used once,
144             to tokenize a single perl source file.
145              
146             It takes as argument either a normal scalar containing source code,
147             a reference to a scalar containing source code, or a reference to an
148             ARRAY containing newline-terminated lines of source code.
149              
150             Returns a new C object on success, or throws a
151             L exception on error.
152              
153             =cut
154              
155             sub new {
156 16938   33 16938 1 59482 my $class = ref($_[0]) || $_[0];
157              
158             # Create the empty tokenizer struct
159 16938         139892 my $self = bless {
160             # Source code
161             source => undef,
162             source_bytes => undef,
163             document => undef,
164              
165             # Line buffer
166             line => undef,
167             line_length => undef,
168             line_cursor => undef,
169             line_count => 0,
170              
171             # Parse state
172             token => undef,
173             class => 'PPI::Token::BOM',
174             zone => 'PPI::Token::Whitespace',
175             feature_set => undef,
176              
177             # Output token buffer
178             tokens => [],
179             token_cursor => 0,
180             token_eof => 0,
181              
182             # Perl 6 blocks
183             perl6 => [],
184             }, $class;
185              
186 16938 100       57469 if ( ! ref $_[1] ) {
    100          
    50          
187 518         2850 my $source = PPI::Util::_slurp($_[1]);
188 518 100       1558 PPI::Exception->throw("Tokenizer failed to open file: $source")
189             if not ref $source;
190 516         1822 $self->{source} = $$source;
191              
192             } elsif ( _SCALAR0($_[1]) ) {
193 16417         19034 $self->{source} = ${$_[1]};
  16417         33076  
194              
195             } elsif ( _ARRAY0($_[1]) ) {
196 3         7 $self->{source} = join '', map "$_\n", @{$_[1]};
  3         18  
197              
198             } else {
199             # We don't support whatever this is
200 0         0 PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
201             }
202              
203             # We can't handle a null string
204 16936         31843 $self->{source_bytes} = length $self->{source};
205 16936 100       28929 if ( $self->{source_bytes} ) {
206             # Split on local newlines
207 16932         509857 $self->{source} =~ s/(?:\015{1,2}\012|\015|\012)/\n/g;
208 16932         247393 $self->{source} = [ split /(?<=\n)/, $self->{source} ];
209              
210             } else {
211 4         10 $self->{source} = [ ];
212             }
213              
214             ### EVIL
215             # I'm explaining this earlier than I should so you can understand
216             # why I'm about to do something that looks very strange. There's
217             # a problem with the Tokenizer, in that tokens tend to change
218             # classes as each letter is added, but they don't get allocated
219             # their definite final class until the "end" of the token, the
220             # detection of which occurs in about a hundred different places,
221             # all through various crufty code (that triples the speed).
222             #
223             # However, in general, this does not apply to tokens in which a
224             # whitespace character is valid, such as comments, whitespace and
225             # big strings.
226             #
227             # So what we do is add a space to the end of the source. This
228             # triggers normal "end of token" functionality for all cases. Then,
229             # once the tokenizer hits end of file, it examines the last token to
230             # manually either remove the ' ' token, or chop it off the end of
231             # a longer one in which the space would be valid.
232 16936 100   77085   66895 if ( List::Util::any { /^__(?:DATA|END)__\s*$/ } @{$self->{source}} ) {
  77085 100       143813  
  16936 100       68929  
233 10         29 $self->{source_eof_chop} = '';
234             } elsif ( ! defined $self->{source}->[0] ) {
235 4         9 $self->{source_eof_chop} = '';
236             } elsif ( $self->{source}->[-1] =~ /\s$/ ) {
237 1132         3066 $self->{source_eof_chop} = '';
238             } else {
239 15790         27535 $self->{source_eof_chop} = 1;
240 15790         27605 $self->{source}->[-1] .= ' ';
241             }
242              
243 16936         72031 $self;
244             }
245              
246             sub _document {
247 16845     16845   17929 my $self = shift;
248 16845 50       37931 return @_ ? $self->{document} = shift : $self->{document};
249             }
250              
251              
252              
253              
254              
255             #####################################################################
256             # Main Public Methods
257              
258             =pod
259              
260             =head2 get_token
261              
262             When using the PPI::Tokenizer object as an iterator, the C
263             method is the primary method that is used. It increments the cursor
264             and returns the next Token in the output array.
265              
266             The actual parsing of the file is done only as-needed, and a line at
267             a time. When C hits the end of the token array, it will
268             cause the parser to pull in the next line and parse it, continuing
269             as needed until there are more tokens on the output array that
270             get_token can then return.
271              
272             This means that a number of Tokenizer objects can be created, and
273             won't consume significant CPU until you actually begin to pull tokens
274             from it.
275              
276             Return a L object on success, C<0> if the Tokenizer had
277             reached the end of the file, or C on error.
278              
279             =cut
280              
281             sub get_token {
282 595835     595835 1 578989 my $self = shift;
283              
284             # Shortcut for EOF
285 595835 50 66     887931 if ( $self->{token_eof}
286 13526         26173 and $self->{token_cursor} > scalar @{$self->{tokens}}
287             ) {
288 0         0 return 0;
289             }
290              
291             # Return the next token if we can
292 595835 100       1167787 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293 513491         498706 $self->{token_cursor}++;
294 513491         1195896 return $token;
295             }
296              
297 82344         88026 my $line_rv;
298              
299             # Catch exceptions and return undef, so that we
300             # can start to convert code to exception-based code.
301 82344         91632 my $rv = eval {
302             # No token, we need to get some more
303 82344         133541 while ( $line_rv = $self->_process_next_line ) {
304             # If there is something in the buffer, return it
305             # The defined() prevents a ton of calls to PPI::Util::TRUE
306 73337 100       146562 if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307 51974         59380 $self->{token_cursor}++;
308 51974         88125 return $token;
309             }
310             }
311 30369         38714 return undef;
312             };
313 82344 100       182534 if ( $@ ) {
    100          
314 1 50       8 if ( _INSTANCE($@, 'PPI::Exception') ) {
315 1         3 $@->throw;
316             } else {
317 0         0 my $errstr = $@;
318 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
319 0         0 PPI::Exception->throw( $errstr );
320             }
321             } elsif ( $rv ) {
322 51974         202913 return $rv;
323             }
324              
325 30369 50       43004 if ( defined $line_rv ) {
326             # End of file, but we can still return things from the buffer
327 30369 50       48400 if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328 0         0 $self->{token_cursor}++;
329 0         0 return $token;
330             }
331              
332             # Set our token end of file flag
333 30369         32396 $self->{token_eof} = 1;
334 30369         80154 return 0;
335             }
336              
337             # Error, pass it up to our caller
338 0         0 undef;
339             }
340              
341             =pod
342              
343             =head2 all_tokens
344              
345             When not being used as an iterator, the C method tells
346             the Tokenizer to parse the entire file and return all of the tokens
347             in a single ARRAY reference.
348              
349             It should be noted that C does B interfere with the
350             use of the Tokenizer object as an iterator (does not modify the token
351             cursor) and use of the two different mechanisms can be mixed safely.
352              
353             Returns a reference to an ARRAY of L objects on success
354             or throws an exception on error.
355              
356             =cut
357              
358             sub all_tokens {
359 5     5 1 17 my $self = shift;
360              
361             # Catch exceptions and return undef, so that we
362             # can start to convert code to exception-based code.
363 5         6 my $ok = eval {
364             # Process lines until we get EOF
365 5 50       12 unless ( $self->{token_eof} ) {
366 5         6 my $rv;
367 5         12 while ( $rv = $self->_process_next_line ) {}
368 5 50       9 unless ( defined $rv ) {
369 0         0 PPI::Exception->throw("Error while processing source");
370             }
371              
372             # Clean up the end of the tokenizer
373 5         8 $self->_clean_eof;
374             }
375 5         8 1;
376             };
377 5 50       8 if ( !$ok ) {
378 0         0 my $errstr = $@;
379 0         0 $errstr =~ s/^(.*) at line .+$/$1/;
380 0         0 PPI::Exception->throw( $errstr );
381             }
382              
383             # End of file, return a copy of the token array.
384 5         5 return [ @{$self->{tokens}} ];
  5         19  
385             }
386              
387             =pod
388              
389             =head2 increment_cursor
390              
391             Although exposed as a public method, C is implemented
392             for expert use only, when writing lexers or other components that work
393             directly on token streams.
394              
395             It manually increments the token cursor forward through the file, in effect
396             "skipping" the next token.
397              
398             Return true if the cursor is incremented, C<0> if already at the end of
399             the file, or C on error.
400              
401             =cut
402              
403             sub increment_cursor {
404             # Do this via the get_token method, which makes sure there
405             # is actually a token there to move to.
406 0 0   0 1 0 $_[0]->get_token and 1;
407             }
408              
409             =pod
410              
411             =head2 decrement_cursor
412              
413             Although exposed as a public method, C is implemented
414             for expert use only, when writing lexers or other components that work
415             directly on token streams.
416              
417             It manually decrements the token cursor backwards through the file, in
418             effect "rolling back" the token stream. And indeed that is what it is
419             primarily intended for, when the component that is consuming the token
420             stream needs to implement some sort of "roll back" feature in its use
421             of the token stream.
422              
423             Return true if the cursor is decremented, C<0> if already at the
424             beginning of the file, or C on error.
425              
426             =cut
427              
428             sub decrement_cursor {
429 0     0 1 0 my $self = shift;
430              
431             # Check for the beginning of the file
432 0 0       0 return 0 unless $self->{token_cursor};
433              
434             # Decrement the token cursor
435 0         0 $self->{token_eof} = 0;
436 0         0 --$self->{token_cursor};
437             }
438              
439              
440              
441              
442              
443             #####################################################################
444             # Working With Source
445              
446             # Fetches the next line from the input line buffer
447             # Returns undef at EOF.
448             sub _get_line {
449 111552     111552   112007 my $self = shift;
450 111552 100       183432 return undef unless $self->{source}; # EOF hit previously
451              
452             # Pull off the next line
453 95522         88885 my $line = shift @{$self->{source}};
  95522         152662  
454              
455             # Flag EOF if we hit it
456 95522 100       156843 $self->{source} = undef unless defined $line;
457              
458             # Return the line (or EOF flag)
459 95522         137968 return $line; # string or undef
460             }
461              
462             # Fetches the next line, ready to process
463             # Returns 1 on success
464             # Returns 0 on EOF
465             sub _fill_line {
466 109163     109163   110557 my $self = shift;
467 109163         112357 my $inscan = shift;
468              
469             # Get the next line
470 109163         152717 my $line = $self->_get_line;
471 109163 100       161039 unless ( defined $line ) {
472             # End of file
473 32405 100       44667 unless ( $inscan ) {
474 30374         45981 delete $self->{line};
475 30374         33423 delete $self->{line_cursor};
476 30374         30762 delete $self->{line_length};
477 30374         54152 return 0;
478             }
479              
480             # In the scan version, just set the cursor to the end
481             # of the line, and the rest should just cascade out.
482 2031         2834 $self->{line_cursor} = $self->{line_length};
483 2031         3757 return 0;
484             }
485              
486             # Populate the appropriate variables
487 76758         113625 $self->{line} = $line;
488 76758         88467 $self->{line_cursor} = -1;
489 76758         87679 $self->{line_length} = length $line;
490 76758         84760 $self->{line_count}++;
491              
492 76758         131030 1;
493             }
494              
495             # Get the current character
496             sub _char {
497 0     0   0 my $self = shift;
498 0         0 substr( $self->{line}, $self->{line_cursor}, 1 );
499             }
500              
501              
502              
503              
504              
505             ####################################################################
506             # Per line processing methods
507              
508             # Processes the next line
509             # Returns 1 on success completion
510             # Returns 0 if EOF
511             # Returns undef on error
512             sub _process_next_line {
513 103723     103723   110851 my $self = shift;
514              
515             # Fill the line buffer
516 103723         98909 my $rv;
517 103723 100       148667 unless ( $rv = $self->_fill_line ) {
518 30374 50       42932 return undef unless defined $rv;
519              
520             # End of file, finalize last token
521 30374         53925 $self->_finalize_token;
522 30374         55049 return 0;
523             }
524              
525             # Run the __TOKENIZER__on_line_start
526 73349         192059 $rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527 73349 100       110111 unless ( $rv ) {
528             # If there are no more source lines, then clean up
529 28446 100 66     44266 if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
  28446         54399  
530 308         710 $self->_clean_eof;
531             }
532              
533             # Defined but false means next line
534 28446 50       54494 return 1 if defined $rv;
535 0         0 PPI::Exception->throw("Error at line $self->{line_count}");
536             }
537              
538             # If we can't deal with the entire line, process char by char
539 44903         77887 while ( $rv = $self->_process_next_char ) {}
540 44902 50       73765 unless ( defined $rv ) {
541 0         0 PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542             }
543              
544             # Trigger any action that needs to happen at the end of a line
545 44902         108495 $self->{class}->__TOKENIZER__on_line_end( $self );
546              
547             # If there are no more source lines, then clean up
548 44902 100 100     99842 unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
  42629         95048  
549 16623         29700 return $self->_clean_eof;
550             }
551              
552 28279         49375 return 1;
553             }
554              
555              
556              
557              
558              
559             #####################################################################
560             # Per-character processing methods
561              
562             # Process on a per-character basis.
563             # Note that due the high number of times this gets
564             # called, it has been fairly heavily in-lined, so the code
565             # might look a bit ugly and duplicated.
566             sub _process_next_char {
567 703952     703952   657651 my $self = shift;
568              
569             ### FIXME - This checks for a screwed up condition that triggers
570             ### several warnings, amongst other things.
571 703952 50 33     1502413 if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572             # $DB::single = 1;
573 0         0 return undef;
574             }
575              
576 703952         673094 $self->{line_cursor}++;
577 703952 100       811065 return 0 if $self->_at_line_end;
578              
579             # Pass control to the token class
580 659050         631521 my $result;
581 659050 100       1147299 unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582             # undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583 93701 50       239721 return defined $result ? 1 : undef;
584             }
585              
586             # We will need the value of the current character
587 565348         739503 my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588 565348 100       748711 if ( $result eq '1' ) {
589             # If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590             # the character is part of it.
591              
592             # Add the character
593 96035 50       122910 if ( defined $self->{token} ) {
594 96035         126398 $self->{token}->{content} .= $char;
595             } else {
596 0 0       0 defined($self->{token} = $self->{class}->new($char)) or return undef;
597             }
598              
599 96035         178460 return 1;
600             }
601              
602             # We have been provided with the name of a class
603 469313 100       709178 if ( $self->{class} ne "PPI::Token::$result" ) {
    100          
604             # New class
605 227979         299628 $self->_new_token( $result, $char );
606             } elsif ( defined $self->{token} ) {
607             # Same class as current
608 44568         56860 $self->{token}->{content} .= $char;
609             } else {
610             # Same class, but no current
611 196766 50       332295 defined($self->{token} = $self->{class}->new($char)) or return undef;
612             }
613              
614 469313         859031 1;
615             }
616              
617             sub _at_line_end {
618 703952     703952   745948 my ($self) = @_;
619 703952         1175695 return $self->{line_cursor} >= $self->{line_length};
620             }
621              
622              
623              
624              
625              
626             #####################################################################
627             # Altering Tokens in Tokenizer
628              
629             # Finish the end of a token.
630             # Returns the resulting parse class as a convenience.
631             sub _finalize_token {
632 609119     609119   575557 my $self = shift;
633 609119 100       824516 return $self->{class} unless defined $self->{token};
634              
635             # Add the token to the token buffer
636 578743         523691 push @{ $self->{tokens} }, $self->{token};
  578743         797870  
637 578743         606541 $self->{token} = undef;
638              
639             # Return the parse class to that of the zone we are in
640 578743         1020570 $self->{class} = $self->{zone};
641             }
642              
643             # Creates a new token and sets it in the tokenizer
644             # The defined() in here prevent a ton of calls to PPI::Util::TRUE
645             sub _new_token {
646 381975     381975   361972 my $self = shift;
647             # throw PPI::Exception() unless @_;
648 381975 100       622299 my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
649             ? shift : 'PPI::Token::' . shift;
650              
651             # Finalize any existing token
652 381975 100       638362 $self->_finalize_token if defined $self->{token};
653              
654             # Create the new token and update the parse class
655 381975 50       750259 defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
656 381975         451827 $self->{class} = $class;
657              
658 381975         421639 1;
659             }
660              
661             # At the end of the file, we need to clean up the results of the erroneous
662             # space that we inserted at the beginning of the process.
663             sub _clean_eof {
664 16936     16936   18313 my $self = shift;
665              
666             # Finish any partially completed token
667 16936 100       29391 $self->_finalize_token if $self->{token};
668              
669             # Find the last token, and if it has no content, kill it.
670             # There appears to be some evidence that such "null tokens" are
671             # somehow getting created accidentally.
672 16936         21267 my $last_token = $self->{tokens}->[ -1 ];
673 16936 50       28853 unless ( length $last_token->{content} ) {
674 0         0 pop @{$self->{tokens}};
  0         0  
675             }
676              
677             # Now, if the last character of the last token is a space we added,
678             # chop it off, deleting the token if there's nothing else left.
679 16936 100       27323 if ( $self->{source_eof_chop} ) {
680 15519         19133 $last_token = $self->{tokens}->[ -1 ];
681 15519         64724 $last_token->{content} =~ s/ $//;
682 15519 100       29032 unless ( length $last_token->{content} ) {
683             # Popping token
684 13247         14044 pop @{$self->{tokens}};
  13247         18090  
685             }
686              
687             # The hack involving adding an extra space is now reversed, and
688             # now nobody will ever know. The perfect crime!
689 15519         23851 $self->{source_eof_chop} = '';
690             }
691              
692 16936         47971 1;
693             }
694              
695              
696              
697              
698              
699             #####################################################################
700             # Utility Methods
701              
702             # Context
703             sub _last_token {
704 0     0   0 $_[0]->{tokens}->[-1];
705             }
706              
707             sub _last_significant_token {
708 3108     3108   3583 my $self = shift;
709 3108         3670 my $cursor = $#{ $self->{tokens} };
  3108         4977  
710 3108         6774 while ( $cursor >= 0 ) {
711 4162         5520 my $token = $self->{tokens}->[$cursor--];
712 4162 100       11266 return $token if $token->significant;
713             }
714 407         745 return;
715             }
716              
717             # Get an array ref of previous significant tokens.
718             # Like _last_significant_token except it gets more than just one token
719             # Returns array with 0 to x entries
720             sub _previous_significant_tokens {
721 164856     164856   167121 my $self = shift;
722 164856   50     231821 my $count = shift || 1;
723 164856         151620 my $cursor = $#{ $self->{tokens} };
  164856         211513  
724              
725 164856         169589 my @tokens;
726 164856         221409 while ( $cursor >= 0 ) {
727 266969         285924 my $token = $self->{tokens}->[$cursor--];
728 266969 100       466872 next if not $token->significant;
729 171887         179301 push @tokens, $token;
730 171887 100       264276 last if @tokens >= $count;
731             }
732              
733 164856         280101 return @tokens;
734             }
735              
736             my %OBVIOUS_CLASS = (
737             'PPI::Token::Symbol' => 'operator',
738             'PPI::Token::Magic' => 'operator',
739             'PPI::Token::Number' => 'operator',
740             'PPI::Token::ArrayIndex' => 'operator',
741             'PPI::Token::Quote::Double' => 'operator',
742             'PPI::Token::Quote::Interpolate' => 'operator',
743             'PPI::Token::Quote::Literal' => 'operator',
744             'PPI::Token::Quote::Single' => 'operator',
745             'PPI::Token::QuoteLike::Backtick' => 'operator',
746             'PPI::Token::QuoteLike::Command' => 'operator',
747             'PPI::Token::QuoteLike::Readline' => 'operator',
748             'PPI::Token::QuoteLike::Regexp' => 'operator',
749             'PPI::Token::QuoteLike::Words' => 'operator',
750             );
751              
752             my %OBVIOUS_CONTENT = (
753             '(' => 'operand',
754             '{' => 'operand',
755             '[' => 'operand',
756             ';' => 'operand',
757             '}' => 'operator',
758             );
759              
760              
761             my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
762              
763             # Try to determine operator/operand context, if possible.
764             # Returns "operator", "operand", or "" if unknown.
765             sub _opcontext {
766 10943     10943   12005 my $self = shift;
767 10943         18373 my @tokens = $self->_previous_significant_tokens(1);
768 10943         12973 my $p0 = $tokens[0];
769 10943 100       27098 return '' if not $p0;
770 10823         15762 my $c0 = ref $p0;
771              
772             # Map the obvious cases
773 10823 100       29976 return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
774 2232 100       4789 return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
775              
776             # Most of the time after an operator, we are an operand
777 1651 100       6627 return 'operand' if $p0->isa('PPI::Token::Operator');
778              
779             # If there's NOTHING, it's operand
780 1468 50       2666 return 'operand' if $p0->content eq '';
781              
782             # Otherwise, we don't know
783 1468         3163 return ''
784             }
785              
786             # Assuming we are currently parsing the word 'x', return true
787             # if previous tokens imply the x is an operator, false otherwise.
788             sub _current_x_is_operator {
789 1158     1158   1535 my ( $self ) = @_;
790 1158 100       1096 return if !@{$self->{tokens}};
  1158         2622  
791              
792 928         1835 my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
793 928 50       2974 return if !$prev;
794              
795 928 100       3625 return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
796              
797             return (!$prev->isa('PPI::Token::Operator') || $X_CAN_FOLLOW_OPERATOR{$prev})
798 766   100     5062 && (!$prev->isa('PPI::Token::Structure') || $X_CAN_FOLLOW_STRUCTURE{$prev})
799             && !$prev->isa('PPI::Token::Label')
800             ;
801             }
802              
803              
804             # Assuming we are at the end of parsing the current token that could be a word,
805             # a wordlike operator, or a version string, try to determine whether context
806             # before or after it forces it to be a bareword. This method is only useful
807             # during tokenization.
808             sub __current_token_is_forced_word {
809 33384     33384   52834 my ( $t, $word ) = @_;
810              
811             # Check if forced by preceding tokens.
812              
813 33384         46410 my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
814 33384 100       68241 if ( !$prev ) {
815 8885         19566 pos $t->{line} = $t->{line_cursor};
816             }
817             else {
818 24499         36881 my $content = $prev->{content};
819              
820             # We are forced if we are a method name.
821             # '->' will always be an operator, so we don't check its type.
822 24499 100       37172 return 1 if $content eq '->';
823              
824             # If we are contained in a pair of curly braces, we are probably a
825             # forced bareword hash key. '{' is never a word or operator, so we
826             # don't check its type.
827 24369         46075 pos $t->{line} = $t->{line_cursor};
828 24369 100 100     64637 return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
829              
830             # sub, package, use, and no all indicate that what immediately follows
831             # is a word not an operator or (in the case of sub and package) a
832             # version string. However, we don't want to be fooled by 'package
833             # package v10' or 'use no v10'. We're a forced package unless we're
834             # preceded by 'package sub', in which case we're a version string.
835             # We also have to make sure that the sub/package/etc doing the forcing
836             # is not a method call.
837 24142 100       46576 if( $USUALLY_FORCES{$content}) {
838 5631 100 66     15816 return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
      100        
      100        
839 5621 100       23731 return 1 if not $prevprev;
840 236 100 100     536 return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
841 6         27 return;
842             }
843             }
844             # pos on $t->{line} is guaranteed to be set at this point.
845              
846             # Check if forced by following tokens.
847              
848             # If the word is followed by => it is probably a word, not a regex.
849 27396 100       66355 return 1 if $t->{line} =~ /\G\s*=>/gc;
850              
851             # Otherwise we probably aren't forced
852 26606         119786 return '';
853             }
854              
855             sub _features {
856 37     37   55 my ( $self, $arg ) = @_;
857 37 50 0     91 return $arg ? $self->{feature_set} = $arg : $self->{feature_set} || {};
858             }
859              
860 12282     12282   37435 sub _current_token_has_signatures_active { shift->{feature_set}{signatures} }
861              
862             1;
863              
864             =pod
865              
866             =head1 NOTES
867              
868             =head2 How the Tokenizer Works
869              
870             Understanding the Tokenizer is not for the faint-hearted. It is by far
871             the most complex and twisty piece of perl I've ever written that is actually
872             still built properly and isn't a terrible spaghetti-like mess. In fact, you
873             probably want to skip this section.
874              
875             But if you really want to understand, well then here goes.
876              
877             =head2 Source Input and Clean Up
878              
879             The Tokenizer starts by taking source in a variety of forms, sucking it
880             all in and merging into one big string, and doing our own internal line
881             split, using a "universal line separator" which allows the Tokenizer to
882             take source for any platform (and even supports a few known types of
883             broken newlines caused by mixed mac/pc/*nix editor screw ups).
884              
885             The resulting array of lines is used to feed the tokenizer, and is also
886             accessed directly by the heredoc-logic to do the line-oriented part of
887             here-doc support.
888              
889             =head2 Doing Things the Old Fashioned Way
890              
891             Due to the complexity of perl, and after 2 previously aborted parser
892             attempts, in the end the tokenizer was fashioned around a line-buffered
893             character-by-character method.
894              
895             That is, the Tokenizer pulls and holds a line at a time into a line buffer,
896             and then iterates a cursor along it. At each cursor position, a method is
897             called in whatever token class we are currently in, which will examine the
898             character at the current position, and handle it.
899              
900             As the handler methods in the various token classes are called, they
901             build up an output token array for the source code.
902              
903             Various parts of the Tokenizer use look-ahead, arbitrary-distance
904             look-behind (although currently the maximum is three significant tokens),
905             or both, and various other heuristic guesses.
906              
907             I've been told it is officially termed a I<"backtracking parser
908             with infinite lookaheads">.
909              
910             =head2 State Variables
911              
912             Aside from the current line and the character cursor, the Tokenizer
913             maintains a number of different state variables.
914              
915             =over
916              
917             =item Current Class
918              
919             The Tokenizer maintains the current token class at all times. Much of the
920             time is just going to be the "Whitespace" class, which is what the base of
921             a document is. As the tokenizer executes the various character handlers,
922             the class changes a lot as it moves a long. In fact, in some instances,
923             the character handler may not handle the character directly itself, but
924             rather change the "current class" and then hand off to the character
925             handler for the new class.
926              
927             Because of this, and some other things I'll deal with later, the number of
928             times the character handlers are called does not in fact have a direct
929             relationship to the number of actual characters in the document.
930              
931             =item Current Zone
932              
933             Rather than create a class stack to allow for infinitely nested layers of
934             classes, the Tokenizer recognises just a single layer.
935              
936             To put it a different way, in various parts of the file, the Tokenizer will
937             recognise different "base" or "substrate" classes. When a Token such as a
938             comment or a number is finalised by the tokenizer, it "falls back" to the
939             base state.
940              
941             This allows proper tokenization of special areas such as __DATA__
942             and __END__ blocks, which also contain things like comments and POD,
943             without allowing the creation of any significant Tokens inside these areas.
944              
945             For the main part of a document we use L for this,
946             with the idea being that code is "floating in a sea of whitespace".
947              
948             =item Current Token
949              
950             The final main state variable is the "current token". This is the Token
951             that is currently being built by the Tokenizer. For certain types, it
952             can be manipulated and morphed and change class quite a bit while being
953             assembled, as the Tokenizer's understanding of the token content changes.
954              
955             When the Tokenizer is confident that it has seen the end of the Token, it
956             will be "finalized", which adds it to the output token array and resets
957             the current class to that of the zone that we are currently in.
958              
959             I should also note at this point that the "current token" variable is
960             optional. The Tokenizer is capable of knowing what class it is currently
961             set to, without actually having accumulated any characters in the Token.
962              
963             =back
964              
965             =head2 Making It Faster
966              
967             As I'm sure you can imagine, calling several different methods for each
968             character and running regexes and other complex heuristics made the first
969             fully working version of the tokenizer extremely slow.
970              
971             During testing, I created a metric to measure parsing speed called
972             LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
973             cycles on a typical single-core CPU, and so a Tokenizer running at
974             "1000 lines per gigacycle" should generate around 1200 lines of tokenized
975             code when running on a 1200 MHz processor.
976              
977             The first working version of the tokenizer ran at only 350 LPGC, so to
978             tokenize a typical large module such as L took
979             10-15 seconds. This sluggishness made it unpractical for many uses.
980              
981             So in the current parser, there are multiple layers of optimisation
982             very carefully built in to the basic. This has brought the tokenizer
983             up to a more reasonable 1000 LPGC, at the expense of making the code
984             quite a bit twistier.
985              
986             =head2 Making It Faster - Whole Line Classification
987              
988             The first step in the optimisation process was to add a hew handler to
989             enable several of the more basic classes (whitespace, comments) to be
990             able to be parsed a line at a time. At the start of each line, a
991             special optional handler (only supported by a few classes) is called to
992             check and see if the entire line can be parsed in one go.
993              
994             This is used mainly to handle things like POD, comments, empty lines,
995             and a few other minor special cases.
996              
997             =head2 Making It Faster - Inlining
998              
999             The second stage of the optimisation involved inlining a small
1000             number of critical methods that were repeated an extremely high number
1001             of times. Profiling suggested that there were about 1,000,000 individual
1002             method calls per gigacycle, and by cutting these by two thirds a significant
1003             speed improvement was gained, in the order of about 50%.
1004              
1005             You may notice that many methods in the C code look
1006             very nested and long hand. This is primarily due to this inlining.
1007              
1008             At around this time, some statistics code that existed in the early
1009             versions of the parser was also removed, as it was determined that
1010             it was consuming around 15% of the CPU for the entire parser, while
1011             making the core more complicated.
1012              
1013             A judgment call was made that with the difficulties likely to be
1014             encountered with future planned enhancements, and given the relatively
1015             high cost involved, the statistics features would be removed from the
1016             Tokenizer.
1017              
1018             =head2 Making It Faster - Quote Engine
1019              
1020             Once inlining had reached diminishing returns, it became obvious from
1021             the profiling results that a huge amount of time was being spent
1022             stepping a char at a time though long, simple and "syntactically boring"
1023             code such as comments and strings.
1024              
1025             The existing regex engine was expanded to also encompass quotes and
1026             other quote-like things, and a special abstract base class was added
1027             that provided a number of specialised parsing methods that would "scan
1028             ahead", looking out ahead to find the end of a string, and updating
1029             the cursor to leave it in a valid position for the next call.
1030              
1031             This is also the point at which the number of character handler calls began
1032             to greatly differ from the number of characters. But it has been done
1033             in a way that allows the parser to retain the power of the original
1034             version at the critical points, while skipping through the "boring bits"
1035             as needed for additional speed.
1036              
1037             The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1038             for the first time.
1039              
1040             =head2 Making It Faster - The "Complete" Mechanism
1041              
1042             As it became evident that great speed increases were available by using
1043             this "skipping ahead" mechanism, a new handler method was added that
1044             explicitly handles the parsing of an entire token, where the structure
1045             of the token is relatively simple. Tokens such as symbols fit this case,
1046             as once we are passed the initial sigil and word char, we know that we
1047             can skip ahead and "complete" the rest of the token much more easily.
1048              
1049             A number of these have been added for most or possibly all of the common
1050             cases, with most of these "complete" handlers implemented using regular
1051             expressions.
1052              
1053             In fact, so many have been added that at this point, you could arguably
1054             reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1055             tokenizer". More tokens are now consumed in "complete" methods in a
1056             typical program than are handled by the normal char-by-char methods.
1057              
1058             Many of the these complete-handlers were implemented during the writing
1059             of the Lexer, and this has allowed the full parser to maintain around
1060             1000 LPGC despite the increasing weight of the Lexer.
1061              
1062             =head2 Making It Faster - Porting To C (In Progress)
1063              
1064             While it would be extraordinarily difficult to port all of the Tokenizer
1065             to C, work has started on a L "accelerator" package which acts as
1066             a separate and automatically-detected add-on to the main PPI package.
1067              
1068             L implements faster versions of a variety of functions scattered
1069             over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1070             various other places, and implements them identically in XS/C.
1071              
1072             In particular, the skip-ahead methods from the Quote Engine would appear
1073             to be extremely amenable to being done in C, and a number of other
1074             functions could be cherry-picked one at a time and implemented in C.
1075              
1076             Each method is heavily tested to ensure that the functionality is
1077             identical, and a versioning mechanism is included to ensure that if a
1078             function gets out of sync, L will degrade gracefully and just
1079             not replace that single method.
1080              
1081             =head1 TO DO
1082              
1083             - Add an option to reset or seek the token stream...
1084              
1085             - Implement more Tokenizer functions in L
1086              
1087             =head1 SUPPORT
1088              
1089             See the L in the main module.
1090              
1091             =head1 AUTHOR
1092              
1093             Adam Kennedy Eadamk@cpan.orgE
1094              
1095             =head1 COPYRIGHT
1096              
1097             Copyright 2001 - 2011 Adam Kennedy.
1098              
1099             This program is free software; you can redistribute
1100             it and/or modify it under the same terms as Perl itself.
1101              
1102             The full text of the license can be found in the
1103             LICENSE file included with this module.
1104              
1105             =cut