File Coverage

blib/lib/PPI/Tokenizer.pm

Criterion	Covered	Total	%
statement	226	257	87.9
branch	115	144	79.8
condition	36	47	76.6
subroutine	28	32	87.5
pod	5	5	100.0
total	410	485	84.5

line	stmt	bran	cond	sub	pod	time	code
1							package PPI::Tokenizer;
2
3							=pod
4
5							=head1 NAME
6
7							PPI::Tokenizer - The Perl Document Tokenizer
8
9							=head1 SYNOPSIS
10
11							# Create a tokenizer for a file, array or string
12							$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13							$Tokenizer = PPI::Tokenizer->new( \@lines );
14							$Tokenizer = PPI::Tokenizer->new( \$source );
15
16							# Return all the tokens for the document
17							my $tokens = $Tokenizer->all_tokens;
18
19							# Or we can use it as an iterator
20							while ( my $Token = $Tokenizer->get_token ) {
21							print "Found token '$Token'\n";
22							}
23
24							# If we REALLY need to manually nudge the cursor, you
25							# can do that to (The lexer needs this ability to do rollbacks)
26							$is_incremented = $Tokenizer->increment_cursor;
27							$is_decremented = $Tokenizer->decrement_cursor;
28
29							=head1 DESCRIPTION
30
31							PPI::Tokenizer is the class that provides Tokenizer objects for use in
32							breaking strings of Perl source code into Tokens.
33
34							By the time you are reading this, you probably need to know a little
35							about the difference between how perl parses Perl "code" and how PPI
36							parsers Perl "documents".
37
38							"perl" itself (the interpreter) uses a heavily modified lex specification
39							to specify its parsing logic, maintains several types of state as it
40							goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42							In fact, it is provably impossible to use perl's parsing method without
43							simultaneously executing code. A formal mathematical proof has been
44							published demonstrating the method.
45
46							This is where the truism "Only perl can parse Perl" comes from.
47
48							PPI uses a completely different approach by abandoning the (impossible)
49							ability to parse Perl the same way that the interpreter does, and instead
50							parsing the source as a document, using a document structure independently
51							derived from the Perl documentation and approximating the perl interpreter
52							interpretation as closely as possible.
53
54							It was touch and go for a long time whether we could get it close enough,
55							but in the end it turned out that it could be done.
56
57							In this approach, the tokenizer C<PPI::Tokenizer> is implemented separately
58							from the lexer L<PPI::Lexer>.
59
60							The job of C<PPI::Tokenizer> is to take pure source as a string and break it
61							up into a stream/set of tokens, and contains most of the "black magic" used
62							in PPI. By comparison, the lexer implements a relatively straight forward
63							tree structure, and has an implementation that is uncomplicated (compared
64							to the insanity in the tokenizer at least).
65
66							The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67							supported by a very B<VERY> flexible internal API, but fortunately it was
68							possible to largely encapsulate the black magic, so there is not a lot that
69							gets exposed to people using the C<PPI::Tokenizer> itself.
70
71							=head1 METHODS
72
73							Despite the incredible complexity, the Tokenizer itself only exposes a
74							relatively small number of methods, with most of the complexity implemented
75							in private methods.
76
77							=cut
78
79							# Make sure everything we need is loaded so
80							# we don't have to go and load all of PPI.
81	66			66		375	use strict;
	66					107
	66					2392
82	66			66		263	use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
	66					103
	66					3600
83	66			66		314	use List::Util 1.33 ();
	66					1200
	66					1169
84	66			66		220	use PPI::Util ();
	66					134
	66					1132
85	66			66		463	use PPI::Element ();
	66					120
	66					831
86	66			66		258	use PPI::Token ();
	66					121
	66					917
87	66			66		222	use PPI::Exception ();
	66					119
	66					814
88	66			66		23143	use PPI::Exception::ParserRejection ();
	66					161
	66					1070
89	66			66		261	use PPI::Document ();
	66					114
	66					178881
90
91							our $VERSION = '1.284';
92
93							# The x operator cannot follow most Perl operators, implying that
94							# anything beginning with x following an operator is a word.
95							# These are the exceptions.
96							my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97
98							# The x operator cannot follow most structure elements, implying that
99							# anything beginning with x following a structure element is a word.
100							# These are the exceptions.
101							my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102
103							# Something that looks like the x operator but follows a word
104							# is usually that word's argument.
105							# These are the exceptions.
106							# chop, chomp, dump are ambiguous because they can have either parms
107							# or no parms.
108							my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109							endgrent
110							endhostent
111							endnetent
112							endprotoent
113							endpwent
114							endservent
115							fork
116							getgrent
117							gethostent
118							getlogin
119							getnetent
120							getppid
121							getprotoent
122							getpwent
123							getservent
124							setgrent
125							setpwent
126							time
127							times
128							wait
129							wantarray
130							__SUB__
131							);
132
133
134
135							#####################################################################
136							# Creation and Initialization
137
138							=pod
139
140							=head2 new $file \| \@lines \| \$source
141
142							The main C<new> constructor creates a new Tokenizer object. These
143							objects have no configuration parameters, and can only be used once,
144							to tokenize a single perl source file.
145
146							It takes as argument either a normal scalar containing source code,
147							a reference to a scalar containing source code, or a reference to an
148							ARRAY containing newline-terminated lines of source code.
149
150							Returns a new C<PPI::Tokenizer> object on success, or throws a
151							L<PPI::Exception> exception on error.
152
153							=cut
154
155							sub new {
156	16891		33	16891	1	51130	my $class = ref($_[0]) \|\| $_[0];
157
158							# Create the empty tokenizer struct
159	16891					144421	my $self = bless {
160							# Source code
161							source => undef,
162							source_bytes => undef,
163							document => undef,
164
165							# Line buffer
166							line => undef,
167							line_length => undef,
168							line_cursor => undef,
169							line_count => 0,
170
171							# Parse state
172							token => undef,
173							class => 'PPI::Token::BOM',
174							zone => 'PPI::Token::Whitespace',
175
176							# Output token buffer
177							tokens => [],
178							token_cursor => 0,
179							token_eof => 0,
180
181							# Perl 6 blocks
182							perl6 => [],
183							}, $class;
184
185	16891	50				78439	if ( ! defined $_[1] ) {
		100
		50
		0
186							# We weren't given anything
187	0					0	PPI::Exception->throw("No source provided to Tokenizer");
188
189							} elsif ( ! ref $_[1] ) {
190	506					2772	my $source = PPI::Util::_slurp($_[1]);
191	506	50				1448	if ( ref $source ) {
192							# Content returned by reference
193	506					1829	$self->{source} = $$source;
194							} else {
195							# Errors returned as a string
196	0					0	return( $source );
197							}
198
199							} elsif ( _SCALAR0($_[1]) ) {
200	16385					18710	$self->{source} = ${$_[1]};
	16385					32940
201
202							} elsif ( _ARRAY0($_[1]) ) {
203	0					0	$self->{source} = join '', map { "\n" } @{$_[1]};
	0					0
	0					0
204
205							} else {
206							# We don't support whatever this is
207	0					0	PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
208							}
209
210							# We can't handle a null string
211	16891					32020	$self->{source_bytes} = length $self->{source};
212	16891	100				30595	if ( $self->{source_bytes} ) {
213							# Split on local newlines
214	16887					461918	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g;
215	16887					235157	$self->{source} = [ split /(?<=\n)/, $self->{source} ];
216
217							} else {
218	4					6	$self->{source} = [ ];
219							}
220
221							### EVIL
222							# I'm explaining this earlier than I should so you can understand
223							# why I'm about to do something that looks very strange. There's
224							# a problem with the Tokenizer, in that tokens tend to change
225							# classes as each letter is added, but they don't get allocated
226							# their definite final class until the "end" of the token, the
227							# detection of which occurs in about a hundred different places,
228							# all through various crufty code (that triples the speed).
229							#
230							# However, in general, this does not apply to tokens in which a
231							# whitespace character is valid, such as comments, whitespace and
232							# big strings.
233							#
234							# So what we do is add a space to the end of the source. This
235							# triggers normal "end of token" functionality for all cases. Then,
236							# once the tokenizer hits end of file, it examines the last token to
237							# manually either remove the ' ' token, or chop it off the end of
238							# a longer one in which the space would be valid.
239	16891	100		72876		60950	if ( List::Util::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) {
	72876	100				143707
	16891	100				67715
240	10					27	$self->{source_eof_chop} = '';
241							} elsif ( ! defined $self->{source}->[0] ) {
242	4					8	$self->{source_eof_chop} = '';
243							} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
244	1128					3317	$self->{source_eof_chop} = '';
245							} else {
246	15749					29939	$self->{source_eof_chop} = 1;
247	15749					29758	$self->{source}->[-1] .= ' ';
248							}
249
250	16891					71751	$self;
251							}
252
253							sub _document {
254	18091			18091		22050	my $self = shift;
255	18091	100				46574	return @_ ? $self->{document} = shift : $self->{document};
256							}
257
258
259
260
261
262							#####################################################################
263							# Main Public Methods
264
265							=pod
266
267							=head2 get_token
268
269							When using the PPI::Tokenizer object as an iterator, the C<get_token>
270							method is the primary method that is used. It increments the cursor
271							and returns the next Token in the output array.
272
273							The actual parsing of the file is done only as-needed, and a line at
274							a time. When C<get_token> hits the end of the token array, it will
275							cause the parser to pull in the next line and parse it, continuing
276							as needed until there are more tokens on the output array that
277							get_token can then return.
278
279							This means that a number of Tokenizer objects can be created, and
280							won't consume significant CPU until you actually begin to pull tokens
281							from it.
282
283							Return a L<PPI::Token> object on success, C<0> if the Tokenizer had
284							reached the end of the file, or C<undef> on error.
285
286							=cut
287
288							sub get_token {
289	393053			393053	1	412323	my $self = shift;
290
291							# Shortcut for EOF
292	393053	50	66			641467	if ( $self->{token_eof}
293	13486					26801	and $self->{token_cursor} > scalar @{$self->{tokens}}
294							) {
295	0					0	return 0;
296							}
297
298							# Return the next token if we can
299	393053	100				887576	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
300	314867					340537	$self->{token_cursor}++;
301	314867					846527	return $token;
302							}
303
304	78186					84646	my $line_rv;
305
306							# Catch exceptions and return undef, so that we
307							# can start to convert code to exception-based code.
308	78186					92642	my $rv = eval {
309							# No token, we need to get some more
310	78186					142600	while ( $line_rv = $self->_process_next_line ) {
311							# If there is something in the buffer, return it
312							# The defined() prevents a ton of calls to PPI::Util::TRUE
313	69073	100				152651	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
314	47901					55166	$self->{token_cursor}++;
315	47901					87177	return $token;
316							}
317							}
318	30284					40642	return undef;
319							};
320	78186	100				186650	if ( $@ ) {
		100
321	1	50				7	if ( _INSTANCE($@, 'PPI::Exception') ) {
322	1					3	$@->throw;
323							} else {
324	0					0	my $errstr = $@;
325	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
326	0					0	PPI::Exception->throw( $errstr );
327							}
328							} elsif ( $rv ) {
329	47901					215797	return $rv;
330							}
331
332	30284	50				46285	if ( defined $line_rv ) {
333							# End of file, but we can still return things from the buffer
334	30284	50				52724	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
335	0					0	$self->{token_cursor}++;
336	0					0	return $token;
337							}
338
339							# Set our token end of file flag
340	30284					35734	$self->{token_eof} = 1;
341	30284					83472	return 0;
342							}
343
344							# Error, pass it up to our caller
345	0					0	undef;
346							}
347
348							=pod
349
350							=head2 all_tokens
351
352							When not being used as an iterator, the C<all_tokens> method tells
353							the Tokenizer to parse the entire file and return all of the tokens
354							in a single ARRAY reference.
355
356							It should be noted that C<all_tokens> does B<NOT> interfere with the
357							use of the Tokenizer object as an iterator (does not modify the token
358							cursor) and use of the two different mechanisms can be mixed safely.
359
360							Returns a reference to an ARRAY of L<PPI::Token> objects on success
361							or throws an exception on error.
362
363							=cut
364
365							sub all_tokens {
366	5			5	1	32	my $self = shift;
367
368							# Catch exceptions and return undef, so that we
369							# can start to convert code to exception-based code.
370	5					10	my $ok = eval {
371							# Process lines until we get EOF
372	5	50				12	unless ( $self->{token_eof} ) {
373	5					6	my $rv;
374	5					15	while ( $rv = $self->_process_next_line ) {}
375	5	50				9	unless ( defined $rv ) {
376	0					0	PPI::Exception->throw("Error while processing source");
377							}
378
379							# Clean up the end of the tokenizer
380	5					8	$self->_clean_eof;
381							}
382	5					6	1;
383							};
384	5	50				11	if ( !$ok ) {
385	0					0	my $errstr = $@;
386	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
387	0					0	PPI::Exception->throw( $errstr );
388							}
389
390							# End of file, return a copy of the token array.
391	5					7	return [ @{$self->{tokens}} ];
	5					19
392							}
393
394							=pod
395
396							=head2 increment_cursor
397
398							Although exposed as a public method, C<increment_cursor> is implemented
399							for expert use only, when writing lexers or other components that work
400							directly on token streams.
401
402							It manually increments the token cursor forward through the file, in effect
403							"skipping" the next token.
404
405							Return true if the cursor is incremented, C<0> if already at the end of
406							the file, or C<undef> on error.
407
408							=cut
409
410							sub increment_cursor {
411							# Do this via the get_token method, which makes sure there
412							# is actually a token there to move to.
413	0	0		0	1	0	$_[0]->get_token and 1;
414							}
415
416							=pod
417
418							=head2 decrement_cursor
419
420							Although exposed as a public method, C<decrement_cursor> is implemented
421							for expert use only, when writing lexers or other components that work
422							directly on token streams.
423
424							It manually decrements the token cursor backwards through the file, in
425							effect "rolling back" the token stream. And indeed that is what it is
426							primarily intended for, when the component that is consuming the token
427							stream needs to implement some sort of "roll back" feature in its use
428							of the token stream.
429
430							Return true if the cursor is decremented, C<0> if already at the
431							beginning of the file, or C<undef> on error.
432
433							=cut
434
435							sub decrement_cursor {
436	0			0	1	0	my $self = shift;
437
438							# Check for the beginning of the file
439	0	0				0	return 0 unless $self->{token_cursor};
440
441							# Decrement the token cursor
442	0					0	$self->{token_eof} = 0;
443	0					0	--$self->{token_cursor};
444							}
445
446
447
448
449
450							#####################################################################
451							# Working With Source
452
453							# Fetches the next line from the input line buffer
454							# Returns undef at EOF.
455							sub _get_line {
456	107296			107296		105942	my $self = shift;
457	107296	100				184744	return undef unless $self->{source}; # EOF hit previously
458
459							# Pull off the next line
460	91268					86324	my $line = shift @{$self->{source}};
	91268					159205
461
462							# Flag EOF if we hit it
463	91268	100				159248	$self->{source} = undef unless defined $line;
464
465							# Return the line (or EOF flag)
466	91268					144027	return $line; # string or undef
467							}
468
469							# Fetches the next line, ready to process
470							# Returns 1 on success
471							# Returns 0 on EOF
472							sub _fill_line {
473	104920			104920		110056	my $self = shift;
474	104920					120543	my $inscan = shift;
475
476							# Get the next line
477	104920					157160	my $line = $self->_get_line;
478	104920	100				165799	unless ( defined $line ) {
479							# End of file
480	32358	100				48043	unless ( $inscan ) {
481	30289					49595	delete $self->{line};
482	30289					35579	delete $self->{line_cursor};
483	30289					31949	delete $self->{line_length};
484	30289					57073	return 0;
485							}
486
487							# In the scan version, just set the cursor to the end
488							# of the line, and the rest should just cascade out.
489	2069					2820	$self->{line_cursor} = $self->{line_length};
490	2069					3711	return 0;
491							}
492
493							# Populate the appropriate variables
494	72562					125440	$self->{line} = $line;
495	72562					89217	$self->{line_cursor} = -1;
496	72562					89236	$self->{line_length} = length $line;
497	72562					81973	$self->{line_count}++;
498
499	72562					133184	1;
500							}
501
502							# Get the current character
503							sub _char {
504	0			0		0	my $self = shift;
505	0					0	substr( $self->{line}, $self->{line_cursor}, 1 );
506							}
507
508
509
510
511
512							####################################################################
513							# Per line processing methods
514
515							# Processes the next line
516							# Returns 1 on success completion
517							# Returns 0 if EOF
518							# Returns undef on error
519							sub _process_next_line {
520	99374			99374		108624	my $self = shift;
521
522							# Fill the line buffer
523	99374					100927	my $rv;
524	99374	100				153342	unless ( $rv = $self->_fill_line ) {
525	30289	50				46150	return undef unless defined $rv;
526
527							# End of file, finalize last token
528	30289					52126	$self->_finalize_token;
529	30289					56452	return 0;
530							}
531
532							# Run the __TOKENIZER__on_line_start
533	69085					216415	$rv = $self->{class}->__TOKENIZER__on_line_start( $self );
534	69085	100				113828	unless ( $rv ) {
535							# If there are no more source lines, then clean up
536	28249	100	66			50353	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
	28249					58676
537	311					921	$self->_clean_eof;
538							}
539
540							# Defined but false means next line
541	28249	50				60307	return 1 if defined $rv;
542	0					0	PPI::Exception->throw("Error at line $self->{line_count}");
543							}
544
545							# If we can't deal with the entire line, process char by char
546	40836					79757	while ( $rv = $self->_process_next_char ) {}
547	40835	50				72085	unless ( defined $rv ) {
548	0					0	PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
549							}
550
551							# Trigger any action that needs to happen at the end of a line
552	40835					106177	$self->{class}->__TOKENIZER__on_line_end( $self );
553
554							# If there are no more source lines, then clean up
555	40835	100	100			101905	unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
	38525					100578
556	16575					34469	return $self->_clean_eof;
557							}
558
559	24260					48886	return 1;
560							}
561
562
563
564
565
566							#####################################################################
567							# Per-character processing methods
568
569							# Process on a per-character basis.
570							# Note that due the high number of times this gets
571							# called, it has been fairly heavily in-lined, so the code
572							# might look a bit ugly and duplicated.
573							sub _process_next_char {
574	453121			453121		480252	my $self = shift;
575
576							### FIXME - This checks for a screwed up condition that triggers
577							### several warnings, amongst other things.
578	453121	50	33			1074671	if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
579							# $DB::single = 1;
580	0					0	return undef;
581							}
582
583	453121					453682	$self->{line_cursor}++;
584	453121	100				563257	return 0 if $self->_at_line_end;
585
586							# Pass control to the token class
587	412286					426941	my $result;
588	412286	100				818547	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
589							# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
590	78337	50				212039	return defined $result ? 1 : undef;
591							}
592
593							# We will need the value of the current character
594	333948					488605	my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
595	333948	100				477829	if ( $result eq '1' ) {
596							# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
597							# the character is part of it.
598
599							# Add the character
600	58959	50				86877	if ( defined $self->{token} ) {
601	58959					92353	$self->{token}->{content} .= $char;
602							} else {
603	0	0				0	defined($self->{token} = $self->{class}->new($char)) or return undef;
604							}
605
606	58959					119399	return 1;
607							}
608
609							# We have been provided with the name of a class
610	274989	100				478351	if ( $self->{class} ne "PPI::Token::$result" ) {
		100
611							# New class
612	106352					162840	$self->_new_token( $result, $char );
613							} elsif ( defined $self->{token} ) {
614							# Same class as current
615	33425					47334	$self->{token}->{content} .= $char;
616							} else {
617							# Same class, but no current
618	135212	50				243502	defined($self->{token} = $self->{class}->new($char)) or return undef;
619							}
620
621	274989					569931	1;
622							}
623
624							sub _at_line_end {
625	453121			453121		510739	my ($self) = @_;
626	453121					839077	return $self->{line_cursor} >= $self->{line_length};
627							}
628
629
630
631
632
633							#####################################################################
634							# Altering Tokens in Tokenizer
635
636							# Finish the end of a token.
637							# Returns the resulting parse class as a convenience.
638							sub _finalize_token {
639	406286			406286		407453	my $self = shift;
640	406286	100				582846	return $self->{class} unless defined $self->{token};
641
642							# Add the token to the token buffer
643	375995					356560	push @{ $self->{tokens} }, $self->{token};
	375995					567903
644	375995					428491	$self->{token} = undef;
645
646							# Return the parse class to that of the zone we are in
647	375995					727951	$self->{class} = $self->{zone};
648							}
649
650							# Creates a new token and sets it in the tokenizer
651							# The defined() in here prevent a ton of calls to PPI::Util::TRUE
652							sub _new_token {
653	240781			240781		242326	my $self = shift;
654							# throw PPI::Exception() unless @_;
655	240781	100				426631	my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
656							? shift : 'PPI::Token::' . shift;
657
658							# Finalize any existing token
659	240781	100				511821	$self->_finalize_token if defined $self->{token};
660
661							# Create the new token and update the parse class
662	240781	50				571862	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
663	240781					310980	$self->{class} = $class;
664
665	240781					283779	1;
666							}
667
668							# At the end of the file, we need to clean up the results of the erroneous
669							# space that we inserted at the beginning of the process.
670							sub _clean_eof {
671	16891			16891		19220	my $self = shift;
672
673							# Finish any partially completed token
674	16891	100				31433	$self->_finalize_token if $self->{token};
675
676							# Find the last token, and if it has no content, kill it.
677							# There appears to be some evidence that such "null tokens" are
678							# somehow getting created accidentally.
679	16891					22979	my $last_token = $self->{tokens}->[ -1 ];
680	16891	50				31368	unless ( length $last_token->{content} ) {
681	0					0	pop @{$self->{tokens}};
	0					0
682							}
683
684							# Now, if the last character of the last token is a space we added,
685							# chop it off, deleting the token if there's nothing else left.
686	16891	100				29927	if ( $self->{source_eof_chop} ) {
687	15477					21296	$last_token = $self->{tokens}->[ -1 ];
688	15477					66420	$last_token->{content} =~ s/ $//;
689	15477	100				30593	unless ( length $last_token->{content} ) {
690							# Popping token
691	13196					14687	pop @{$self->{tokens}};
	13196					18706
692							}
693
694							# The hack involving adding an extra space is now reversed, and
695							# now nobody will ever know. The perfect crime!
696	15477					24802	$self->{source_eof_chop} = '';
697							}
698
699	16891					48357	1;
700							}
701
702
703
704
705
706							#####################################################################
707							# Utility Methods
708
709							# Context
710							sub _last_token {
711	0			0		0	$_[0]->{tokens}->[-1];
712							}
713
714							sub _last_significant_token {
715	3132			3132		4151	my $self = shift;
716	3132					3423	my $cursor = $#{ $self->{tokens} };
	3132					5137
717	3132					6969	while ( $cursor >= 0 ) {
718	4199					6158	my $token = $self->{tokens}->[$cursor--];
719	4199	100				12602	return $token if $token->significant;
720							}
721	406					806	return;
722							}
723
724							# Get an array ref of previous significant tokens.
725							# Like _last_significant_token except it gets more than just one token
726							# Returns array with 0 to x entries
727							sub _previous_significant_tokens {
728	168509			168509		181230	my $self = shift;
729	168509		50			236306	my $count = shift \|\| 1;
730	168509					154373	my $cursor = $#{ $self->{tokens} };
	168509					215124
731
732	168509					178009	my @tokens;
733	168509					251770	while ( $cursor >= 0 ) {
734	410241					436146	my $token = $self->{tokens}->[$cursor--];
735	410241	100				696060	next if not $token->significant;
736	291122					301094	push @tokens, $token;
737	291122	100				456958	last if @tokens >= $count;
738							}
739
740	168509					319365	return @tokens;
741							}
742
743							my %OBVIOUS_CLASS = (
744							'PPI::Token::Symbol' => 'operator',
745							'PPI::Token::Magic' => 'operator',
746							'PPI::Token::Number' => 'operator',
747							'PPI::Token::ArrayIndex' => 'operator',
748							'PPI::Token::Quote::Double' => 'operator',
749							'PPI::Token::Quote::Interpolate' => 'operator',
750							'PPI::Token::Quote::Literal' => 'operator',
751							'PPI::Token::Quote::Single' => 'operator',
752							'PPI::Token::QuoteLike::Backtick' => 'operator',
753							'PPI::Token::QuoteLike::Command' => 'operator',
754							'PPI::Token::QuoteLike::Readline' => 'operator',
755							'PPI::Token::QuoteLike::Regexp' => 'operator',
756							'PPI::Token::QuoteLike::Words' => 'operator',
757							);
758
759							my %OBVIOUS_CONTENT = (
760							'(' => 'operand',
761							'{' => 'operand',
762							'[' => 'operand',
763							';' => 'operand',
764							'}' => 'operator',
765							);
766
767
768							my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
769
770							# Try to determine operator/operand context, if possible.
771							# Returns "operator", "operand", or "" if unknown.
772							sub _opcontext {
773	7255			7255		10161	my $self = shift;
774	7255					15476	my @tokens = $self->_previous_significant_tokens(1);
775	7255					10456	my $p0 = $tokens[0];
776	7255	100				21940	return '' if not $p0;
777	7136					11851	my $c0 = ref $p0;
778
779							# Map the obvious cases
780	7136	100				26812	return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
781	2256	100				5641	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
782
783							# Most of the time after an operator, we are an operand
784	1688	100				7511	return 'operand' if $p0->isa('PPI::Token::Operator');
785
786							# If there's NOTHING, it's operand
787	1483	50				2825	return 'operand' if $p0->content eq '';
788
789							# Otherwise, we don't know
790	1483					3253	return ''
791							}
792
793							# Assuming we are currently parsing the word 'x', return true
794							# if previous tokens imply the x is an operator, false otherwise.
795							sub _current_x_is_operator {
796	1119			1119		1563	my ( $self ) = @_;
797	1119	100				1220	return if !@{$self->{tokens}};
	1119					2588
798
799	912					2166	my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
800	912	50				3077	return if !$prev;
801
802	912	100				3786	return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
803
804							return (!$prev->isa('PPI::Token::Operator') \|\| $X_CAN_FOLLOW_OPERATOR{$prev})
805	755		100			6525	&& (!$prev->isa('PPI::Token::Structure') \|\| $X_CAN_FOLLOW_STRUCTURE{$prev})
806							&& !$prev->isa('PPI::Token::Label')
807							;
808							}
809
810
811							# Assuming we are at the end of parsing the current token that could be a word,
812							# a wordlike operator, or a version string, try to determine whether context
813							# before or after it forces it to be a bareword. This method is only useful
814							# during tokenization.
815							sub __current_token_is_forced_word {
816	33124			33124		53689	my ( $t, $word ) = @_;
817
818							# Check if forced by preceding tokens.
819
820	33124					53244	my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
821	33124	100				78494	if ( !$prev ) {
822	8864					20367	pos $t->{line} = $t->{line_cursor};
823							}
824							else {
825	24260					42847	my $content = $prev->{content};
826
827							# We are forced if we are a method name.
828							# '->' will always be an operator, so we don't check its type.
829	24260	100				42276	return 1 if $content eq '->';
830
831							# If we are contained in a pair of curly braces, we are probably a
832							# forced bareword hash key. '{' is never a word or operator, so we
833							# don't check its type.
834	24131					53101	pos $t->{line} = $t->{line_cursor};
835	24131	100	100			69472	return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
836
837							# sub, package, use, and no all indicate that what immediately follows
838							# is a word not an operator or (in the case of sub and package) a
839							# version string. However, we don't want to be fooled by 'package
840							# package v10' or 'use no v10'. We're a forced package unless we're
841							# preceded by 'package sub', in which case we're a version string.
842							# We also have to make sure that the sub/package/etc doing the forcing
843							# is not a method call.
844	23904	100				57936	if( $USUALLY_FORCES{$content}) {
845	5631	100	66			22214	return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
			100
			100
846	5621	100				25594	return 1 if not $prevprev;
847	236	100	100			568	return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
848	6					26	return;
849							}
850							}
851							# pos on $t->{line} is guaranteed to be set at this point.
852
853							# Check if forced by following tokens.
854
855							# If the word is followed by => it is probably a word, not a regex.
856	27137	100				72755	return 1 if $t->{line} =~ /\G\s*=>/gc;
857
858							# Otherwise we probably aren't forced
859	26347					139680	return '';
860							}
861
862							sub _current_token_has_signatures_active {
863	8512			8512		14148	my ($t) = @_;
864
865							# Get at least the three previous significant tokens, and extend the
866							# retrieval range to include at least one token that can walk the
867							# already generated tree. (i.e. has a parent)
868	8512					16675	my ( $tokens_to_get, @tokens ) = (3);
869	8512		66			23749	while ( !@tokens or ( $tokens[-1] and !$tokens[-1]->parent ) ) {
			100
870	24070					36777	@tokens = $t->_previous_significant_tokens($tokens_to_get);
871	24070	100				37595	last if @tokens < $tokens_to_get;
872	22774					80727	$tokens_to_get++;
873							}
874
875	8512					24778	my ($closest_parented_token) = grep $_->parent, @tokens;
876	8512		66			25652	$closest_parented_token \|\|= $t->_document \|\| $t->_document(PPI::Document->new);
			66
877	8512					21501	return $closest_parented_token->presumed_features->{signatures}, @tokens;
878							}
879
880							1;
881
882							=pod
883
884							=head1 NOTES
885
886							=head2 How the Tokenizer Works
887
888							Understanding the Tokenizer is not for the faint-hearted. It is by far
889							the most complex and twisty piece of perl I've ever written that is actually
890							still built properly and isn't a terrible spaghetti-like mess. In fact, you
891							probably want to skip this section.
892
893							But if you really want to understand, well then here goes.
894
895							=head2 Source Input and Clean Up
896
897							The Tokenizer starts by taking source in a variety of forms, sucking it
898							all in and merging into one big string, and doing our own internal line
899							split, using a "universal line separator" which allows the Tokenizer to
900							take source for any platform (and even supports a few known types of
901							broken newlines caused by mixed mac/pc/*nix editor screw ups).
902
903							The resulting array of lines is used to feed the tokenizer, and is also
904							accessed directly by the heredoc-logic to do the line-oriented part of
905							here-doc support.
906
907							=head2 Doing Things the Old Fashioned Way
908
909							Due to the complexity of perl, and after 2 previously aborted parser
910							attempts, in the end the tokenizer was fashioned around a line-buffered
911							character-by-character method.
912
913							That is, the Tokenizer pulls and holds a line at a time into a line buffer,
914							and then iterates a cursor along it. At each cursor position, a method is
915							called in whatever token class we are currently in, which will examine the
916							character at the current position, and handle it.
917
918							As the handler methods in the various token classes are called, they
919							build up an output token array for the source code.
920
921							Various parts of the Tokenizer use look-ahead, arbitrary-distance
922							look-behind (although currently the maximum is three significant tokens),
923							or both, and various other heuristic guesses.
924
925							I've been told it is officially termed a I<"backtracking parser
926							with infinite lookaheads">.
927
928							=head2 State Variables
929
930							Aside from the current line and the character cursor, the Tokenizer
931							maintains a number of different state variables.
932
933							=over
934
935							=item Current Class
936
937							The Tokenizer maintains the current token class at all times. Much of the
938							time is just going to be the "Whitespace" class, which is what the base of
939							a document is. As the tokenizer executes the various character handlers,
940							the class changes a lot as it moves a long. In fact, in some instances,
941							the character handler may not handle the character directly itself, but
942							rather change the "current class" and then hand off to the character
943							handler for the new class.
944
945							Because of this, and some other things I'll deal with later, the number of
946							times the character handlers are called does not in fact have a direct
947							relationship to the number of actual characters in the document.
948
949							=item Current Zone
950
951							Rather than create a class stack to allow for infinitely nested layers of
952							classes, the Tokenizer recognises just a single layer.
953
954							To put it a different way, in various parts of the file, the Tokenizer will
955							recognise different "base" or "substrate" classes. When a Token such as a
956							comment or a number is finalised by the tokenizer, it "falls back" to the
957							base state.
958
959							This allows proper tokenization of special areas such as __DATA__
960							and __END__ blocks, which also contain things like comments and POD,
961							without allowing the creation of any significant Tokens inside these areas.
962
963							For the main part of a document we use L<PPI::Token::Whitespace> for this,
964							with the idea being that code is "floating in a sea of whitespace".
965
966							=item Current Token
967
968							The final main state variable is the "current token". This is the Token
969							that is currently being built by the Tokenizer. For certain types, it
970							can be manipulated and morphed and change class quite a bit while being
971							assembled, as the Tokenizer's understanding of the token content changes.
972
973							When the Tokenizer is confident that it has seen the end of the Token, it
974							will be "finalized", which adds it to the output token array and resets
975							the current class to that of the zone that we are currently in.
976
977							I should also note at this point that the "current token" variable is
978							optional. The Tokenizer is capable of knowing what class it is currently
979							set to, without actually having accumulated any characters in the Token.
980
981							=back
982
983							=head2 Making It Faster
984
985							As I'm sure you can imagine, calling several different methods for each
986							character and running regexes and other complex heuristics made the first
987							fully working version of the tokenizer extremely slow.
988
989							During testing, I created a metric to measure parsing speed called
990							LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
991							cycles on a typical single-core CPU, and so a Tokenizer running at
992							"1000 lines per gigacycle" should generate around 1200 lines of tokenized
993							code when running on a 1200 MHz processor.
994
995							The first working version of the tokenizer ran at only 350 LPGC, so to
996							tokenize a typical large module such as L<ExtUtils::MakeMaker> took
997							10-15 seconds. This sluggishness made it unpractical for many uses.
998
999							So in the current parser, there are multiple layers of optimisation
1000							very carefully built in to the basic. This has brought the tokenizer
1001							up to a more reasonable 1000 LPGC, at the expense of making the code
1002							quite a bit twistier.
1003
1004							=head2 Making It Faster - Whole Line Classification
1005
1006							The first step in the optimisation process was to add a hew handler to
1007							enable several of the more basic classes (whitespace, comments) to be
1008							able to be parsed a line at a time. At the start of each line, a
1009							special optional handler (only supported by a few classes) is called to
1010							check and see if the entire line can be parsed in one go.
1011
1012							This is used mainly to handle things like POD, comments, empty lines,
1013							and a few other minor special cases.
1014
1015							=head2 Making It Faster - Inlining
1016
1017							The second stage of the optimisation involved inlining a small
1018							number of critical methods that were repeated an extremely high number
1019							of times. Profiling suggested that there were about 1,000,000 individual
1020							method calls per gigacycle, and by cutting these by two thirds a significant
1021							speed improvement was gained, in the order of about 50%.
1022
1023							You may notice that many methods in the C<PPI::Tokenizer> code look
1024							very nested and long hand. This is primarily due to this inlining.
1025
1026							At around this time, some statistics code that existed in the early
1027							versions of the parser was also removed, as it was determined that
1028							it was consuming around 15% of the CPU for the entire parser, while
1029							making the core more complicated.
1030
1031							A judgment call was made that with the difficulties likely to be
1032							encountered with future planned enhancements, and given the relatively
1033							high cost involved, the statistics features would be removed from the
1034							Tokenizer.
1035
1036							=head2 Making It Faster - Quote Engine
1037
1038							Once inlining had reached diminishing returns, it became obvious from
1039							the profiling results that a huge amount of time was being spent
1040							stepping a char at a time though long, simple and "syntactically boring"
1041							code such as comments and strings.
1042
1043							The existing regex engine was expanded to also encompass quotes and
1044							other quote-like things, and a special abstract base class was added
1045							that provided a number of specialised parsing methods that would "scan
1046							ahead", looking out ahead to find the end of a string, and updating
1047							the cursor to leave it in a valid position for the next call.
1048
1049							This is also the point at which the number of character handler calls began
1050							to greatly differ from the number of characters. But it has been done
1051							in a way that allows the parser to retain the power of the original
1052							version at the critical points, while skipping through the "boring bits"
1053							as needed for additional speed.
1054
1055							The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1056							for the first time.
1057
1058							=head2 Making It Faster - The "Complete" Mechanism
1059
1060							As it became evident that great speed increases were available by using
1061							this "skipping ahead" mechanism, a new handler method was added that
1062							explicitly handles the parsing of an entire token, where the structure
1063							of the token is relatively simple. Tokens such as symbols fit this case,
1064							as once we are passed the initial sigil and word char, we know that we
1065							can skip ahead and "complete" the rest of the token much more easily.
1066
1067							A number of these have been added for most or possibly all of the common
1068							cases, with most of these "complete" handlers implemented using regular
1069							expressions.
1070
1071							In fact, so many have been added that at this point, you could arguably
1072							reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1073							tokenizer". More tokens are now consumed in "complete" methods in a
1074							typical program than are handled by the normal char-by-char methods.
1075
1076							Many of the these complete-handlers were implemented during the writing
1077							of the Lexer, and this has allowed the full parser to maintain around
1078							1000 LPGC despite the increasing weight of the Lexer.
1079
1080							=head2 Making It Faster - Porting To C (In Progress)
1081
1082							While it would be extraordinarily difficult to port all of the Tokenizer
1083							to C, work has started on a L<PPI::XS> "accelerator" package which acts as
1084							a separate and automatically-detected add-on to the main PPI package.
1085
1086							L<PPI::XS> implements faster versions of a variety of functions scattered
1087							over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1088							various other places, and implements them identically in XS/C.
1089
1090							In particular, the skip-ahead methods from the Quote Engine would appear
1091							to be extremely amenable to being done in C, and a number of other
1092							functions could be cherry-picked one at a time and implemented in C.
1093
1094							Each method is heavily tested to ensure that the functionality is
1095							identical, and a versioning mechanism is included to ensure that if a
1096							function gets out of sync, L<PPI::XS> will degrade gracefully and just
1097							not replace that single method.
1098
1099							=head1 TO DO
1100
1101							- Add an option to reset or seek the token stream...
1102
1103							- Implement more Tokenizer functions in L<PPI::XS>
1104
1105							=head1 SUPPORT
1106
1107							See the L<support section\|PPI/SUPPORT> in the main module.
1108
1109							=head1 AUTHOR
1110
1111							Adam Kennedy E<lt>adamk@cpan.orgE<gt>
1112
1113							=head1 COPYRIGHT
1114
1115							Copyright 2001 - 2011 Adam Kennedy.
1116
1117							This program is free software; you can redistribute
1118							it and/or modify it under the same terms as Perl itself.
1119
1120							The full text of the license can be found in the
1121							LICENSE file included with this module.
1122
1123							=cut