File Coverage

blib/lib/PPI/Tokenizer.pm

Criterion	Covered	Total	%
statement	221	252	87.7
branch	113	144	78.4
condition	27	37	72.9
subroutine	29	33	87.8
pod	5	5	100.0
total	395	471	83.8

line	stmt	bran	cond	sub	pod	time	code
1							package PPI::Tokenizer;
2
3							=pod
4
5							=head1 NAME
6
7							PPI::Tokenizer - The Perl Document Tokenizer
8
9							=head1 SYNOPSIS
10
11							# Create a tokenizer for a file, array or string
12							$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13							$Tokenizer = PPI::Tokenizer->new( \@lines );
14							$Tokenizer = PPI::Tokenizer->new( \$source );
15
16							# Return all the tokens for the document
17							my $tokens = $Tokenizer->all_tokens;
18
19							# Or we can use it as an iterator
20							while ( my $Token = $Tokenizer->get_token ) {
21							print "Found token '$Token'\n";
22							}
23
24							# If we REALLY need to manually nudge the cursor, you
25							# can do that to (The lexer needs this ability to do rollbacks)
26							$is_incremented = $Tokenizer->increment_cursor;
27							$is_decremented = $Tokenizer->decrement_cursor;
28
29							=head1 DESCRIPTION
30
31							PPI::Tokenizer is the class that provides Tokenizer objects for use in
32							breaking strings of Perl source code into Tokens.
33
34							By the time you are reading this, you probably need to know a little
35							about the difference between how perl parses Perl "code" and how PPI
36							parsers Perl "documents".
37
38							"perl" itself (the interpreter) uses a heavily modified lex specification
39							to specify its parsing logic, maintains several types of state as it
40							goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42							In fact, it is provably impossible to use perl's parsing method without
43							simultaneously executing code. A formal mathematical proof has been
44							published demonstrating the method.
45
46							This is where the truism "Only perl can parse Perl" comes from.
47
48							PPI uses a completely different approach by abandoning the (impossible)
49							ability to parse Perl the same way that the interpreter does, and instead
50							parsing the source as a document, using a document structure independently
51							derived from the Perl documentation and approximating the perl interpreter
52							interpretation as closely as possible.
53
54							It was touch and go for a long time whether we could get it close enough,
55							but in the end it turned out that it could be done.
56
57							In this approach, the tokenizer C is implemented separately
58							from the lexer L.
59
60							The job of C is to take pure source as a string and break it
61							up into a stream/set of tokens, and contains most of the "black magic" used
62							in PPI. By comparison, the lexer implements a relatively straight forward
63							tree structure, and has an implementation that is uncomplicated (compared
64							to the insanity in the tokenizer at least).
65
66							The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67							supported by a very B flexible internal API, but fortunately it was
68							possible to largely encapsulate the black magic, so there is not a lot that
69							gets exposed to people using the C itself.
70
71							=head1 METHODS
72
73							Despite the incredible complexity, the Tokenizer itself only exposes a
74							relatively small number of methods, with most of the complexity implemented
75							in private methods.
76
77							=cut
78
79							# Make sure everything we need is loaded so
80							# we don't have to go and load all of PPI.
81	66			66		478	use strict;
	66					159
	66					3249
82	66			66		400	use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
	66					180
	66					5875
83	66			66		515	use List::Util 1.33 ();
	66					1923
	66					1663
84	66			66		359	use PPI::Util ();
	66					147
	66					1715
85	66			66		550	use PPI::Element ();
	66					147
	66					1398
86	66			66		389	use PPI::Token ();
	66					159
	66					1410
87	66			66		368	use PPI::Exception ();
	66					138
	66					1476
88	66			66		33814	use PPI::Exception::ParserRejection ();
	66					227
	66					1611
89	66			66		379	use PPI::Document ();
	66					153
	66					262928
90
91							our $VERSION = '1.28401'; # TRIAL
92
93							# The x operator cannot follow most Perl operators, implying that
94							# anything beginning with x following an operator is a word.
95							# These are the exceptions.
96							my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97
98							# The x operator cannot follow most structure elements, implying that
99							# anything beginning with x following a structure element is a word.
100							# These are the exceptions.
101							my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102
103							# Something that looks like the x operator but follows a word
104							# is usually that word's argument.
105							# These are the exceptions.
106							# chop, chomp, dump are ambiguous because they can have either parms
107							# or no parms.
108							my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109							endgrent
110							endhostent
111							endnetent
112							endprotoent
113							endpwent
114							endservent
115							fork
116							getgrent
117							gethostent
118							getlogin
119							getnetent
120							getppid
121							getprotoent
122							getpwent
123							getservent
124							setgrent
125							setpwent
126							time
127							times
128							wait
129							wantarray
130							__SUB__
131							);
132
133
134
135							#####################################################################
136							# Creation and Initialization
137
138							=pod
139
140							=head2 new $file \| \@lines \| \$source
141
142							The main C constructor creates a new Tokenizer object. These
143							objects have no configuration parameters, and can only be used once,
144							to tokenize a single perl source file.
145
146							It takes as argument either a normal scalar containing source code,
147							a reference to a scalar containing source code, or a reference to an
148							ARRAY containing newline-terminated lines of source code.
149
150							Returns a new C object on success, or throws a
151							L exception on error.
152
153							=cut
154
155							sub new {
156	16891		33	16891	1	84199	my $class = ref($_[0]) \|\| $_[0];
157
158							# Create the empty tokenizer struct
159	16891					248035	my $self = bless {
160							# Source code
161							source => undef,
162							source_bytes => undef,
163							document => undef,
164
165							# Line buffer
166							line => undef,
167							line_length => undef,
168							line_cursor => undef,
169							line_count => 0,
170
171							# Parse state
172							token => undef,
173							class => 'PPI::Token::BOM',
174							zone => 'PPI::Token::Whitespace',
175							features => undef,
176
177							# Output token buffer
178							tokens => [],
179							token_cursor => 0,
180							token_eof => 0,
181
182							# Perl 6 blocks
183							perl6 => [],
184							}, $class;
185
186	16891	50				120766	if ( ! defined $_[1] ) {
		100
		50
		0
187							# We weren't given anything
188	0					0	PPI::Exception->throw("No source provided to Tokenizer");
189
190							} elsif ( ! ref $_[1] ) {
191	506					3125	my $source = PPI::Util::_slurp($_[1]);
192	506	50				2002	if ( ref $source ) {
193							# Content returned by reference
194	506					2295	$self->{source} = $$source;
195							} else {
196							# Errors returned as a string
197	0					0	return( $source );
198							}
199
200							} elsif ( _SCALAR0($_[1]) ) {
201	16385					31669	$self->{source} = ${$_[1]};
	16385					58660
202
203							} elsif ( _ARRAY0($_[1]) ) {
204	0					0	$self->{source} = join '', map { "\n" } @{$_[1]};
	0					0
	0					0
205
206							} else {
207							# We don't support whatever this is
208	0					0	PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
209							}
210
211							# We can't handle a null string
212	16891					60945	$self->{source_bytes} = length $self->{source};
213	16891	100				54311	if ( $self->{source_bytes} ) {
214							# Split on local newlines
215	16887					787563	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g;
216	16887					391689	$self->{source} = [ split /(?<=\n)/, $self->{source} ];
217
218							} else {
219	4					13	$self->{source} = [ ];
220							}
221
222							### EVIL
223							# I'm explaining this earlier than I should so you can understand
224							# why I'm about to do something that looks very strange. There's
225							# a problem with the Tokenizer, in that tokens tend to change
226							# classes as each letter is added, but they don't get allocated
227							# their definite final class until the "end" of the token, the
228							# detection of which occurs in about a hundred different places,
229							# all through various crufty code (that triples the speed).
230							#
231							# However, in general, this does not apply to tokens in which a
232							# whitespace character is valid, such as comments, whitespace and
233							# big strings.
234							#
235							# So what we do is add a space to the end of the source. This
236							# triggers normal "end of token" functionality for all cases. Then,
237							# once the tokenizer hits end of file, it examines the last token to
238							# manually either remove the ' ' token, or chop it off the end of
239							# a longer one in which the space would be valid.
240	16891	100		72975		106565	if ( List::Util::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) {
	72975	100				240233
	16891	100				126295
241	10					41	$self->{source_eof_chop} = '';
242							} elsif ( ! defined $self->{source}->[0] ) {
243	4					12	$self->{source_eof_chop} = '';
244							} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
245	1127					4479	$self->{source_eof_chop} = '';
246							} else {
247	15750					48124	$self->{source_eof_chop} = 1;
248	15750					46873	$self->{source}->[-1] .= ' ';
249							}
250
251	16891					102010	$self;
252							}
253
254							sub _document {
255	16800			16800		45277	my $self = shift;
256	16800	50				66372	return @_ ? $self->{document} = shift : $self->{document};
257							}
258
259
260
261
262
263							#####################################################################
264							# Main Public Methods
265
266							=pod
267
268							=head2 get_token
269
270							When using the PPI::Tokenizer object as an iterator, the C
271							method is the primary method that is used. It increments the cursor
272							and returns the next Token in the output array.
273
274							The actual parsing of the file is done only as-needed, and a line at
275							a time. When C hits the end of the token array, it will
276							cause the parser to pull in the next line and parse it, continuing
277							as needed until there are more tokens on the output array that
278							get_token can then return.
279
280							This means that a number of Tokenizer objects can be created, and
281							won't consume significant CPU until you actually begin to pull tokens
282							from it.
283
284							Return a L object on success, C<0> if the Tokenizer had
285							reached the end of the file, or C on error.
286
287							=cut
288
289							sub get_token {
290	394502			394502	1	632170	my $self = shift;
291
292							# Shortcut for EOF
293	394502	50	66			1033047	if ( $self->{token_eof}
294	13486					47849	and $self->{token_cursor} > scalar @{$self->{tokens}}
295							) {
296	0					0	return 0;
297							}
298
299							# Return the next token if we can
300	394502	100				1467622	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
301	316269					517557	$self->{token_cursor}++;
302	316269					1351530	return $token;
303							}
304
305	78233					122912	my $line_rv;
306
307							# Catch exceptions and return undef, so that we
308							# can start to convert code to exception-based code.
309	78233					136576	my $rv = eval {
310							# No token, we need to get some more
311	78233					226066	while ( $line_rv = $self->_process_next_line ) {
312							# If there is something in the buffer, return it
313							# The defined() prevents a ton of calls to PPI::Util::TRUE
314	69135	100				229279	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
315	47948					88345	$self->{token_cursor}++;
316	47948					134256	return $token;
317							}
318							}
319	30284					61552	return undef;
320							};
321	78233	100				290102	if ( $@ ) {
		100
322	1	50				13	if ( _INSTANCE($@, 'PPI::Exception') ) {
323	1					4	$@->throw;
324							} else {
325	0					0	my $errstr = $@;
326	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
327	0					0	PPI::Exception->throw( $errstr );
328							}
329							} elsif ( $rv ) {
330	47948					314547	return $rv;
331							}
332
333	30284	50				66636	if ( defined $line_rv ) {
334							# End of file, but we can still return things from the buffer
335	30284	50				91918	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
336	0					0	$self->{token_cursor}++;
337	0					0	return $token;
338							}
339
340							# Set our token end of file flag
341	30284					55402	$self->{token_eof} = 1;
342	30284					156354	return 0;
343							}
344
345							# Error, pass it up to our caller
346	0					0	undef;
347							}
348
349							=pod
350
351							=head2 all_tokens
352
353							When not being used as an iterator, the C method tells
354							the Tokenizer to parse the entire file and return all of the tokens
355							in a single ARRAY reference.
356
357							It should be noted that C does B interfere with the
358							use of the Tokenizer object as an iterator (does not modify the token
359							cursor) and use of the two different mechanisms can be mixed safely.
360
361							Returns a reference to an ARRAY of L objects on success
362							or throws an exception on error.
363
364							=cut
365
366							sub all_tokens {
367	5			5	1	34	my $self = shift;
368
369							# Catch exceptions and return undef, so that we
370							# can start to convert code to exception-based code.
371	5					15	my $ok = eval {
372							# Process lines until we get EOF
373	5	50				21	unless ( $self->{token_eof} ) {
374	5					11	my $rv;
375	5					22	while ( $rv = $self->_process_next_line ) {}
376	5	50				14	unless ( defined $rv ) {
377	0					0	PPI::Exception->throw("Error while processing source");
378							}
379
380							# Clean up the end of the tokenizer
381	5					19	$self->_clean_eof;
382							}
383	5					12	1;
384							};
385	5	50				18	if ( !$ok ) {
386	0					0	my $errstr = $@;
387	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
388	0					0	PPI::Exception->throw( $errstr );
389							}
390
391							# End of file, return a copy of the token array.
392	5					13	return [ @{$self->{tokens}} ];
	5					35
393							}
394
395							=pod
396
397							=head2 increment_cursor
398
399							Although exposed as a public method, C is implemented
400							for expert use only, when writing lexers or other components that work
401							directly on token streams.
402
403							It manually increments the token cursor forward through the file, in effect
404							"skipping" the next token.
405
406							Return true if the cursor is incremented, C<0> if already at the end of
407							the file, or C on error.
408
409							=cut
410
411							sub increment_cursor {
412							# Do this via the get_token method, which makes sure there
413							# is actually a token there to move to.
414	0	0		0	1	0	$_[0]->get_token and 1;
415							}
416
417							=pod
418
419							=head2 decrement_cursor
420
421							Although exposed as a public method, C is implemented
422							for expert use only, when writing lexers or other components that work
423							directly on token streams.
424
425							It manually decrements the token cursor backwards through the file, in
426							effect "rolling back" the token stream. And indeed that is what it is
427							primarily intended for, when the component that is consuming the token
428							stream needs to implement some sort of "roll back" feature in its use
429							of the token stream.
430
431							Return true if the cursor is decremented, C<0> if already at the
432							beginning of the file, or C on error.
433
434							=cut
435
436							sub decrement_cursor {
437	0			0	1	0	my $self = shift;
438
439							# Check for the beginning of the file
440	0	0				0	return 0 unless $self->{token_cursor};
441
442							# Decrement the token cursor
443	0					0	$self->{token_eof} = 0;
444	0					0	--$self->{token_cursor};
445							}
446
447
448
449
450
451							#####################################################################
452							# Working With Source
453
454							# Fetches the next line from the input line buffer
455							# Returns undef at EOF.
456							sub _get_line {
457	107380			107380		167078	my $self = shift;
458	107380	100				311279	return undef unless $self->{source}; # EOF hit previously
459
460							# Pull off the next line
461	91367					143483	my $line = shift @{$self->{source}};
	91367					255242
462
463							# Flag EOF if we hit it
464	91367	100				245204	$self->{source} = undef unless defined $line;
465
466							# Return the line (or EOF flag)
467	91367					231537	return $line; # string or undef
468							}
469
470							# Fetches the next line, ready to process
471							# Returns 1 on success
472							# Returns 0 on EOF
473							sub _fill_line {
474	105000			105000		165830	my $self = shift;
475	105000					191301	my $inscan = shift;
476
477							# Get the next line
478	105000					252049	my $line = $self->_get_line;
479	105000	100				246119	unless ( defined $line ) {
480							# End of file
481	32343	100				72993	unless ( $inscan ) {
482	30289					74640	delete $self->{line};
483	30289					54955	delete $self->{line_cursor};
484	30289					47345	delete $self->{line_length};
485	30289					92640	return 0;
486							}
487
488							# In the scan version, just set the cursor to the end
489							# of the line, and the rest should just cascade out.
490	2054					4537	$self->{line_cursor} = $self->{line_length};
491	2054					5938	return 0;
492							}
493
494							# Populate the appropriate variables
495	72657					185204	$self->{line} = $line;
496	72657					136566	$self->{line_cursor} = -1;
497	72657					143715	$self->{line_length} = length $line;
498	72657					129431	$self->{line_count}++;
499
500	72657					205339	1;
501							}
502
503							# Get the current character
504							sub _char {
505	0			0		0	my $self = shift;
506	0					0	substr( $self->{line}, $self->{line_cursor}, 1 );
507							}
508
509
510
511
512
513							####################################################################
514							# Per line processing methods
515
516							# Processes the next line
517							# Returns 1 on success completion
518							# Returns 0 if EOF
519							# Returns undef on error
520							sub _process_next_line {
521	99436			99436		168452	my $self = shift;
522
523							# Fill the line buffer
524	99436					163550	my $rv;
525	99436	100				230198	unless ( $rv = $self->_fill_line ) {
526	30289	50				74055	return undef unless defined $rv;
527
528							# End of file, finalize last token
529	30289					86188	$self->_finalize_token;
530	30289					88106	return 0;
531							}
532
533							# Run the __TOKENIZER__on_line_start
534	69147					341768	$rv = $self->{class}->__TOKENIZER__on_line_start( $self );
535	69147	100				175106	unless ( $rv ) {
536							# If there are no more source lines, then clean up
537	28255	100	66			82873	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
	28255					88190
538	310					1083	$self->_clean_eof;
539							}
540
541							# Defined but false means next line
542	28255	50				94991	return 1 if defined $rv;
543	0					0	PPI::Exception->throw("Error at line $self->{line_count}");
544							}
545
546							# If we can't deal with the entire line, process char by char
547	40892					119768	while ( $rv = $self->_process_next_char ) {}
548	40891	50				110533	unless ( defined $rv ) {
549	0					0	PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
550							}
551
552							# Trigger any action that needs to happen at the end of a line
553	40891					170517	$self->{class}->__TOKENIZER__on_line_end( $self );
554
555							# If there are no more source lines, then clean up
556	40891	100	100			159199	unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
	38594					153727
557	16576					55073	return $self->_clean_eof;
558							}
559
560	24315					72155	return 1;
561							}
562
563
564
565
566
567							#####################################################################
568							# Per-character processing methods
569
570							# Process on a per-character basis.
571							# Note that due the high number of times this gets
572							# called, it has been fairly heavily in-lined, so the code
573							# might look a bit ugly and duplicated.
574							sub _process_next_char {
575	454535			454535		728658	my $self = shift;
576
577							### FIXME - This checks for a screwed up condition that triggers
578							### several warnings, amongst other things.
579	454535	50	33			1805215	if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
580							# $DB::single = 1;
581	0					0	return undef;
582							}
583
584	454535					710093	$self->{line_cursor}++;
585	454535	100				866484	return 0 if $self->_at_line_end;
586
587							# Pass control to the token class
588	413644					636155	my $result;
589	413644	100				1332407	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
590							# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
591	78882	50				342537	return defined $result ? 1 : undef;
592							}
593
594							# We will need the value of the current character
595	334761					806516	my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
596	334761	100				758044	if ( $result eq '1' ) {
597							# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
598							# the character is part of it.
599
600							# Add the character
601	59061	50				128112	if ( defined $self->{token} ) {
602	59061					149954	$self->{token}->{content} .= $char;
603							} else {
604	0	0				0	defined($self->{token} = $self->{class}->new($char)) or return undef;
605							}
606
607	59061					199358	return 1;
608							}
609
610							# We have been provided with the name of a class
611	275700	100				828179	if ( $self->{class} ne "PPI::Token::$result" ) {
		100
612							# New class
613	106783					285265	$self->_new_token( $result, $char );
614							} elsif ( defined $self->{token} ) {
615							# Same class as current
616	33516					77442	$self->{token}->{content} .= $char;
617							} else {
618							# Same class, but no current
619	135401	50				424601	defined($self->{token} = $self->{class}->new($char)) or return undef;
620							}
621
622	275700					922816	1;
623							}
624
625							sub _at_line_end {
626	454535			454535		796325	my ($self) = @_;
627	454535					1333309	return $self->{line_cursor} >= $self->{line_length};
628							}
629
630
631
632
633
634							#####################################################################
635							# Altering Tokens in Tokenizer
636
637							# Finish the end of a token.
638							# Returns the resulting parse class as a convenience.
639							sub _finalize_token {
640	407744			407744		654981	my $self = shift;
641	407744	100				905021	return $self->{class} unless defined $self->{token};
642
643							# Add the token to the token buffer
644	377453					542530	push @{ $self->{tokens} }, $self->{token};
	377453					907990
645	377453					647310	$self->{token} = undef;
646
647							# Return the parse class to that of the zone we are in
648	377453					1195313	$self->{class} = $self->{zone};
649							}
650
651							# Creates a new token and sets it in the tokenizer
652							# The defined() in here prevent a ton of calls to PPI::Util::TRUE
653							sub _new_token {
654	242050			242050		370398	my $self = shift;
655							# throw PPI::Exception() unless @_;
656	242050	100				695468	my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
657							? shift : 'PPI::Token::' . shift;
658
659							# Finalize any existing token
660	242050	100				727180	$self->_finalize_token if defined $self->{token};
661
662							# Create the new token and update the parse class
663	242050	50				954957	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
664	242050					472507	$self->{class} = $class;
665
666	242050					435881	1;
667							}
668
669							# At the end of the file, we need to clean up the results of the erroneous
670							# space that we inserted at the beginning of the process.
671							sub _clean_eof {
672	16891			16891		32118	my $self = shift;
673
674							# Finish any partially completed token
675	16891	100				53175	$self->_finalize_token if $self->{token};
676
677							# Find the last token, and if it has no content, kill it.
678							# There appears to be some evidence that such "null tokens" are
679							# somehow getting created accidentally.
680	16891					37007	my $last_token = $self->{tokens}->[ -1 ];
681	16891	50				61524	unless ( length $last_token->{content} ) {
682	0					0	pop @{$self->{tokens}};
	0					0
683							}
684
685							# Now, if the last character of the last token is a space we added,
686							# chop it off, deleting the token if there's nothing else left.
687	16891	100				50629	if ( $self->{source_eof_chop} ) {
688	15478					37064	$last_token = $self->{tokens}->[ -1 ];
689	15478					107799	$last_token->{content} =~ s/ $//;
690	15478	100				58279	unless ( length $last_token->{content} ) {
691							# Popping token
692	13205					26475	pop @{$self->{tokens}};
	13205					30672
693							}
694
695							# The hack involving adding an extra space is now reversed, and
696							# now nobody will ever know. The perfect crime!
697	15478					42627	$self->{source_eof_chop} = '';
698							}
699
700	16891					80661	1;
701							}
702
703
704
705
706
707							#####################################################################
708							# Utility Methods
709
710							# Context
711							sub _last_token {
712	0			0		0	$_[0]->{tokens}->[-1];
713							}
714
715							sub _last_significant_token {
716	3090			3090		6509	my $self = shift;
717	3090					5514	my $cursor = $#{ $self->{tokens} };
	3090					8060
718	3090					13122	while ( $cursor >= 0 ) {
719	4176					8680	my $token = $self->{tokens}->[$cursor--];
720	4176	100				19791	return $token if $token->significant;
721							}
722	401					1302	return;
723							}
724
725							# Get an array ref of previous significant tokens.
726							# Like _last_significant_token except it gets more than just one token
727							# Returns array with 0 to x entries
728							sub _previous_significant_tokens {
729	153268			153268		241521	my $self = shift;
730	153268		50			343011	my $count = shift \|\| 1;
731	153268					223479	my $cursor = $#{ $self->{tokens} };
	153268					327830
732
733	153268					260441	my @tokens;
734	153268					374339	while ( $cursor >= 0 ) {
735	248241					442262	my $token = $self->{tokens}->[$cursor--];
736	248241	100				745760	next if not $token->significant;
737	160199					294957	push @tokens, $token;
738	160199	100				416038	last if @tokens >= $count;
739							}
740
741	153268					423548	return @tokens;
742							}
743
744							my %OBVIOUS_CLASS = (
745							'PPI::Token::Symbol' => 'operator',
746							'PPI::Token::Magic' => 'operator',
747							'PPI::Token::Number' => 'operator',
748							'PPI::Token::ArrayIndex' => 'operator',
749							'PPI::Token::Quote::Double' => 'operator',
750							'PPI::Token::Quote::Interpolate' => 'operator',
751							'PPI::Token::Quote::Literal' => 'operator',
752							'PPI::Token::Quote::Single' => 'operator',
753							'PPI::Token::QuoteLike::Backtick' => 'operator',
754							'PPI::Token::QuoteLike::Command' => 'operator',
755							'PPI::Token::QuoteLike::Readline' => 'operator',
756							'PPI::Token::QuoteLike::Regexp' => 'operator',
757							'PPI::Token::QuoteLike::Words' => 'operator',
758							);
759
760							my %OBVIOUS_CONTENT = (
761							'(' => 'operand',
762							'{' => 'operand',
763							'[' => 'operand',
764							';' => 'operand',
765							'}' => 'operator',
766							);
767
768
769							my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
770
771							# Try to determine operator/operand context, if possible.
772							# Returns "operator", "operand", or "" if unknown.
773							sub _opcontext {
774	7305			7305		12208	my $self = shift;
775	7305					19060	my @tokens = $self->_previous_significant_tokens(1);
776	7305					12406	my $p0 = $tokens[0];
777	7305	100				33830	return '' if not $p0;
778	7173					16108	my $c0 = ref $p0;
779
780							# Map the obvious cases
781	7173	100				31992	return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
782	2257	100				9686	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
783
784							# Most of the time after an operator, we are an operand
785	1682	100				15548	return 'operand' if $p0->isa('PPI::Token::Operator');
786
787							# If there's NOTHING, it's operand
788	1476	50				5027	return 'operand' if $p0->content eq '';
789
790							# Otherwise, we don't know
791	1476					6074	return ''
792							}
793
794							# Assuming we are currently parsing the word 'x', return true
795							# if previous tokens imply the x is an operator, false otherwise.
796							sub _current_x_is_operator {
797	1108			1108		2494	my ( $self ) = @_;
798	1108	100				1799	return if !@{$self->{tokens}};
	1108					4197
799
800	905					2979	my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
801	905	50				4217	return if !$prev;
802
803	905	100				7556	return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
804
805							return (!$prev->isa('PPI::Token::Operator') \|\| $X_CAN_FOLLOW_OPERATOR{$prev})
806	746		100			7417	&& (!$prev->isa('PPI::Token::Structure') \|\| $X_CAN_FOLLOW_STRUCTURE{$prev})
807							&& !$prev->isa('PPI::Token::Label')
808							;
809							}
810
811
812							# Assuming we are at the end of parsing the current token that could be a word,
813							# a wordlike operator, or a version string, try to determine whether context
814							# before or after it forces it to be a bareword. This method is only useful
815							# during tokenization.
816							sub __current_token_is_forced_word {
817	33255			33255		81083	my ( $t, $word ) = @_;
818
819							# Check if forced by preceding tokens.
820
821	33255					83848	my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
822	33255	100				116539	if ( !$prev ) {
823	8867					36149	pos $t->{line} = $t->{line_cursor};
824							}
825							else {
826	24388					61750	my $content = $prev->{content};
827
828							# We are forced if we are a method name.
829							# '->' will always be an operator, so we don't check its type.
830	24388	100				65742	return 1 if $content eq '->';
831
832							# If we are contained in a pair of curly braces, we are probably a
833							# forced bareword hash key. '{' is never a word or operator, so we
834							# don't check its type.
835	24259					96072	pos $t->{line} = $t->{line_cursor};
836	24259	100	100			104532	return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
837
838							# sub, package, use, and no all indicate that what immediately follows
839							# is a word not an operator or (in the case of sub and package) a
840							# version string. However, we don't want to be fooled by 'package
841							# package v10' or 'use no v10'. We're a forced package unless we're
842							# preceded by 'package sub', in which case we're a version string.
843							# We also have to make sure that the sub/package/etc doing the forcing
844							# is not a method call.
845	24031	100				80691	if( $USUALLY_FORCES{$content}) {
846	5631	100	66			34067	return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
			100
			100
847	5621	100				36584	return 1 if not $prevprev;
848	236	100	100			989	return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
849	6					46	return;
850							}
851							}
852							# pos on $t->{line} is guaranteed to be set at this point.
853
854							# Check if forced by following tokens.
855
856							# If the word is followed by => it is probably a word, not a regex.
857	27267	100				113639	return 1 if $t->{line} =~ /\G\s*=>/gc;
858
859							# Otherwise we probably aren't forced
860	26477					230783	return '';
861							}
862
863							sub _features {
864	35			35		60	my $self = shift;
865	35	50	0			132	return @_ ? $self->{features} = shift : $self->{features} \|\| {};
866							}
867
868							sub _current_token_has_signatures_active {
869	8494			8494		14675	my $self = shift;
870	8494					31700	return $self->{features}{signatures};
871							}
872
873							1;
874
875							=pod
876
877							=head1 NOTES
878
879							=head2 How the Tokenizer Works
880
881							Understanding the Tokenizer is not for the faint-hearted. It is by far
882							the most complex and twisty piece of perl I've ever written that is actually
883							still built properly and isn't a terrible spaghetti-like mess. In fact, you
884							probably want to skip this section.
885
886							But if you really want to understand, well then here goes.
887
888							=head2 Source Input and Clean Up
889
890							The Tokenizer starts by taking source in a variety of forms, sucking it
891							all in and merging into one big string, and doing our own internal line
892							split, using a "universal line separator" which allows the Tokenizer to
893							take source for any platform (and even supports a few known types of
894							broken newlines caused by mixed mac/pc/*nix editor screw ups).
895
896							The resulting array of lines is used to feed the tokenizer, and is also
897							accessed directly by the heredoc-logic to do the line-oriented part of
898							here-doc support.
899
900							=head2 Doing Things the Old Fashioned Way
901
902							Due to the complexity of perl, and after 2 previously aborted parser
903							attempts, in the end the tokenizer was fashioned around a line-buffered
904							character-by-character method.
905
906							That is, the Tokenizer pulls and holds a line at a time into a line buffer,
907							and then iterates a cursor along it. At each cursor position, a method is
908							called in whatever token class we are currently in, which will examine the
909							character at the current position, and handle it.
910
911							As the handler methods in the various token classes are called, they
912							build up an output token array for the source code.
913
914							Various parts of the Tokenizer use look-ahead, arbitrary-distance
915							look-behind (although currently the maximum is three significant tokens),
916							or both, and various other heuristic guesses.
917
918							I've been told it is officially termed a I<"backtracking parser
919							with infinite lookaheads">.
920
921							=head2 State Variables
922
923							Aside from the current line and the character cursor, the Tokenizer
924							maintains a number of different state variables.
925
926							=over
927
928							=item Current Class
929
930							The Tokenizer maintains the current token class at all times. Much of the
931							time is just going to be the "Whitespace" class, which is what the base of
932							a document is. As the tokenizer executes the various character handlers,
933							the class changes a lot as it moves a long. In fact, in some instances,
934							the character handler may not handle the character directly itself, but
935							rather change the "current class" and then hand off to the character
936							handler for the new class.
937
938							Because of this, and some other things I'll deal with later, the number of
939							times the character handlers are called does not in fact have a direct
940							relationship to the number of actual characters in the document.
941
942							=item Current Zone
943
944							Rather than create a class stack to allow for infinitely nested layers of
945							classes, the Tokenizer recognises just a single layer.
946
947							To put it a different way, in various parts of the file, the Tokenizer will
948							recognise different "base" or "substrate" classes. When a Token such as a
949							comment or a number is finalised by the tokenizer, it "falls back" to the
950							base state.
951
952							This allows proper tokenization of special areas such as __DATA__
953							and __END__ blocks, which also contain things like comments and POD,
954							without allowing the creation of any significant Tokens inside these areas.
955
956							For the main part of a document we use L for this,
957							with the idea being that code is "floating in a sea of whitespace".
958
959							=item Current Token
960
961							The final main state variable is the "current token". This is the Token
962							that is currently being built by the Tokenizer. For certain types, it
963							can be manipulated and morphed and change class quite a bit while being
964							assembled, as the Tokenizer's understanding of the token content changes.
965
966							When the Tokenizer is confident that it has seen the end of the Token, it
967							will be "finalized", which adds it to the output token array and resets
968							the current class to that of the zone that we are currently in.
969
970							I should also note at this point that the "current token" variable is
971							optional. The Tokenizer is capable of knowing what class it is currently
972							set to, without actually having accumulated any characters in the Token.
973
974							=back
975
976							=head2 Making It Faster
977
978							As I'm sure you can imagine, calling several different methods for each
979							character and running regexes and other complex heuristics made the first
980							fully working version of the tokenizer extremely slow.
981
982							During testing, I created a metric to measure parsing speed called
983							LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
984							cycles on a typical single-core CPU, and so a Tokenizer running at
985							"1000 lines per gigacycle" should generate around 1200 lines of tokenized
986							code when running on a 1200 MHz processor.
987
988							The first working version of the tokenizer ran at only 350 LPGC, so to
989							tokenize a typical large module such as L took
990							10-15 seconds. This sluggishness made it unpractical for many uses.
991
992							So in the current parser, there are multiple layers of optimisation
993							very carefully built in to the basic. This has brought the tokenizer
994							up to a more reasonable 1000 LPGC, at the expense of making the code
995							quite a bit twistier.
996
997							=head2 Making It Faster - Whole Line Classification
998
999							The first step in the optimisation process was to add a hew handler to
1000							enable several of the more basic classes (whitespace, comments) to be
1001							able to be parsed a line at a time. At the start of each line, a
1002							special optional handler (only supported by a few classes) is called to
1003							check and see if the entire line can be parsed in one go.
1004
1005							This is used mainly to handle things like POD, comments, empty lines,
1006							and a few other minor special cases.
1007
1008							=head2 Making It Faster - Inlining
1009
1010							The second stage of the optimisation involved inlining a small
1011							number of critical methods that were repeated an extremely high number
1012							of times. Profiling suggested that there were about 1,000,000 individual
1013							method calls per gigacycle, and by cutting these by two thirds a significant
1014							speed improvement was gained, in the order of about 50%.
1015
1016							You may notice that many methods in the C code look
1017							very nested and long hand. This is primarily due to this inlining.
1018
1019							At around this time, some statistics code that existed in the early
1020							versions of the parser was also removed, as it was determined that
1021							it was consuming around 15% of the CPU for the entire parser, while
1022							making the core more complicated.
1023
1024							A judgment call was made that with the difficulties likely to be
1025							encountered with future planned enhancements, and given the relatively
1026							high cost involved, the statistics features would be removed from the
1027							Tokenizer.
1028
1029							=head2 Making It Faster - Quote Engine
1030
1031							Once inlining had reached diminishing returns, it became obvious from
1032							the profiling results that a huge amount of time was being spent
1033							stepping a char at a time though long, simple and "syntactically boring"
1034							code such as comments and strings.
1035
1036							The existing regex engine was expanded to also encompass quotes and
1037							other quote-like things, and a special abstract base class was added
1038							that provided a number of specialised parsing methods that would "scan
1039							ahead", looking out ahead to find the end of a string, and updating
1040							the cursor to leave it in a valid position for the next call.
1041
1042							This is also the point at which the number of character handler calls began
1043							to greatly differ from the number of characters. But it has been done
1044							in a way that allows the parser to retain the power of the original
1045							version at the critical points, while skipping through the "boring bits"
1046							as needed for additional speed.
1047
1048							The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1049							for the first time.
1050
1051							=head2 Making It Faster - The "Complete" Mechanism
1052
1053							As it became evident that great speed increases were available by using
1054							this "skipping ahead" mechanism, a new handler method was added that
1055							explicitly handles the parsing of an entire token, where the structure
1056							of the token is relatively simple. Tokens such as symbols fit this case,
1057							as once we are passed the initial sigil and word char, we know that we
1058							can skip ahead and "complete" the rest of the token much more easily.
1059
1060							A number of these have been added for most or possibly all of the common
1061							cases, with most of these "complete" handlers implemented using regular
1062							expressions.
1063
1064							In fact, so many have been added that at this point, you could arguably
1065							reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1066							tokenizer". More tokens are now consumed in "complete" methods in a
1067							typical program than are handled by the normal char-by-char methods.
1068
1069							Many of the these complete-handlers were implemented during the writing
1070							of the Lexer, and this has allowed the full parser to maintain around
1071							1000 LPGC despite the increasing weight of the Lexer.
1072
1073							=head2 Making It Faster - Porting To C (In Progress)
1074
1075							While it would be extraordinarily difficult to port all of the Tokenizer
1076							to C, work has started on a L "accelerator" package which acts as
1077							a separate and automatically-detected add-on to the main PPI package.
1078
1079							L implements faster versions of a variety of functions scattered
1080							over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1081							various other places, and implements them identically in XS/C.
1082
1083							In particular, the skip-ahead methods from the Quote Engine would appear
1084							to be extremely amenable to being done in C, and a number of other
1085							functions could be cherry-picked one at a time and implemented in C.
1086
1087							Each method is heavily tested to ensure that the functionality is
1088							identical, and a versioning mechanism is included to ensure that if a
1089							function gets out of sync, L will degrade gracefully and just
1090							not replace that single method.
1091
1092							=head1 TO DO
1093
1094							- Add an option to reset or seek the token stream...
1095
1096							- Implement more Tokenizer functions in L
1097
1098							=head1 SUPPORT
1099
1100							See the L in the main module.
1101
1102							=head1 AUTHOR
1103
1104							Adam Kennedy Eadamk@cpan.orgE
1105
1106							=head1 COPYRIGHT
1107
1108							Copyright 2001 - 2011 Adam Kennedy.
1109
1110							This program is free software; you can redistribute
1111							it and/or modify it under the same terms as Perl itself.
1112
1113							The full text of the license can be found in the
1114							LICENSE file included with this module.
1115
1116							=cut