File Coverage

blib/lib/PPI/Tokenizer.pm

Criterion	Covered	Total	%
statement	223	249	89.5
branch	118	144	81.9
condition	27	37	72.9
subroutine	29	33	87.8
pod	5	5	100.0
total	402	468	85.9

line	stmt	bran	cond	sub	pod	time	code
1							package PPI::Tokenizer;
2
3							=pod
4
5							=head1 NAME
6
7							PPI::Tokenizer - The Perl Document Tokenizer
8
9							=head1 SYNOPSIS
10
11							# Create a tokenizer for a file, array or string
12							$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13							$Tokenizer = PPI::Tokenizer->new( \@lines );
14							$Tokenizer = PPI::Tokenizer->new( \$source );
15
16							# Return all the tokens for the document
17							my $tokens = $Tokenizer->all_tokens;
18
19							# Or we can use it as an iterator
20							while ( my $Token = $Tokenizer->get_token ) {
21							print "Found token '$Token'\n";
22							}
23
24							# If we REALLY need to manually nudge the cursor, you
25							# can do that to (The lexer needs this ability to do rollbacks)
26							$is_incremented = $Tokenizer->increment_cursor;
27							$is_decremented = $Tokenizer->decrement_cursor;
28
29							=head1 DESCRIPTION
30
31							PPI::Tokenizer is the class that provides Tokenizer objects for use in
32							breaking strings of Perl source code into Tokens.
33
34							By the time you are reading this, you probably need to know a little
35							about the difference between how perl parses Perl "code" and how PPI
36							parsers Perl "documents".
37
38							"perl" itself (the interpreter) uses a heavily modified lex specification
39							to specify its parsing logic, maintains several types of state as it
40							goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42							In fact, it is provably impossible to use perl's parsing method without
43							simultaneously executing code. A formal mathematical proof has been
44							published demonstrating the method.
45
46							This is where the truism "Only perl can parse Perl" comes from.
47
48							PPI uses a completely different approach by abandoning the (impossible)
49							ability to parse Perl the same way that the interpreter does, and instead
50							parsing the source as a document, using a document structure independently
51							derived from the Perl documentation and approximating the perl interpreter
52							interpretation as closely as possible.
53
54							It was touch and go for a long time whether we could get it close enough,
55							but in the end it turned out that it could be done.
56
57							In this approach, the tokenizer C is implemented separately
58							from the lexer L.
59
60							The job of C is to take pure source as a string and break it
61							up into a stream/set of tokens, and contains most of the "black magic" used
62							in PPI. By comparison, the lexer implements a relatively straight forward
63							tree structure, and has an implementation that is uncomplicated (compared
64							to the insanity in the tokenizer at least).
65
66							The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67							supported by a very B flexible internal API, but fortunately it was
68							possible to largely encapsulate the black magic, so there is not a lot that
69							gets exposed to people using the C itself.
70
71							=head1 METHODS
72
73							Despite the incredible complexity, the Tokenizer itself only exposes a
74							relatively small number of methods, with most of the complexity implemented
75							in private methods.
76
77							=cut
78
79							# Make sure everything we need is loaded so
80							# we don't have to go and load all of PPI.
81	68			68		383	use strict;
	68					145
	68					2544
82	68			68		267	use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
	68					153
	68					3895
83	68			68		306	use List::Util 1.33 ();
	68					1156
	68					1178
84	68			68		283	use PPI::Util ();
	68					99
	68					1231
85	68			68		504	use PPI::Element ();
	68					110
	68					932
86	68			68		258	use PPI::Token ();
	68					112
	68					837
87	68			68		235	use PPI::Exception ();
	68					122
	68					822
88	68			68		25127	use PPI::Exception::ParserRejection ();
	68					171
	68					1173
89	68			68		260	use PPI::Document ();
	68					96
	68					185856
90
91							our $VERSION = '1.291';
92
93							# The x operator cannot follow most Perl operators, implying that
94							# anything beginning with x following an operator is a word.
95							# These are the exceptions.
96							my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97
98							# The x operator cannot follow most structure elements, implying that
99							# anything beginning with x following a structure element is a word.
100							# These are the exceptions.
101							my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102
103							# Something that looks like the x operator but follows a word
104							# is usually that word's argument.
105							# These are the exceptions.
106							# chop, chomp, dump are ambiguous because they can have either parms
107							# or no parms.
108							my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109							endgrent
110							endhostent
111							endnetent
112							endprotoent
113							endpwent
114							endservent
115							fork
116							getgrent
117							gethostent
118							getlogin
119							getnetent
120							getppid
121							getprotoent
122							getpwent
123							getservent
124							setgrent
125							setpwent
126							time
127							times
128							wait
129							wantarray
130							__SUB__
131							);
132
133
134
135							#####################################################################
136							# Creation and Initialization
137
138							=pod
139
140							=head2 new $file \| \@lines \| \$source
141
142							The main C constructor creates a new Tokenizer object. These
143							objects have no configuration parameters, and can only be used once,
144							to tokenize a single perl source file.
145
146							It takes as argument either a normal scalar containing source code,
147							a reference to a scalar containing source code, or a reference to an
148							ARRAY containing newline-terminated lines of source code.
149
150							Returns a new C object on success, or throws a
151							L exception on error.
152
153							=cut
154
155							sub new {
156	16939		33	16939	1	61836	my $class = ref($_[0]) \|\| $_[0];
157
158							# Create the empty tokenizer struct
159	16939					146690	my $self = bless {
160							# Source code
161							source => undef,
162							source_bytes => undef,
163							document => undef,
164
165							# Line buffer
166							line => undef,
167							line_length => undef,
168							line_cursor => undef,
169							line_count => 0,
170
171							# Parse state
172							token => undef,
173							class => 'PPI::Token::BOM',
174							zone => 'PPI::Token::Whitespace',
175							feature_set => undef,
176
177							# Output token buffer
178							tokens => [],
179							token_cursor => 0,
180							token_eof => 0,
181
182							# Perl 6 blocks
183							perl6 => [],
184							}, $class;
185
186	16939	100				58065	if ( ! ref $_[1] ) {
		100
		50
187	518					2499	my $source = PPI::Util::_slurp($_[1]);
188	518	100				1822	PPI::Exception->throw("Tokenizer failed to open file: $source")
189							if not ref $source;
190	516					1926	$self->{source} = $$source;
191
192							} elsif ( _SCALAR0($_[1]) ) {
193	1					8	PPI::Exception->throw("Did not pass a string: ${$_[1]}")
194	16418	100				19467	if _SCALAR0( $self->{source} = ${$_[1]} );
	16418					50431
195
196							} elsif ( _ARRAY0($_[1]) ) {
197	3					17	$self->{source} = join '', map "$_\n", @{$_[1]};
	3					15
198
199							} else {
200							# We don't support whatever this is
201	0					0	PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
202							}
203
204							# We can't handle a null string
205	16936					32172	$self->{source_bytes} = length $self->{source};
206	16936	100				29689	if ( $self->{source_bytes} ) {
207							# Split on local newlines
208	16932					526030	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g;
209	16932					257321	$self->{source} = [ split /(?<=\n)/, $self->{source} ];
210
211							} else {
212	4					6	$self->{source} = [ ];
213							}
214
215							### EVIL
216							# I'm explaining this earlier than I should so you can understand
217							# why I'm about to do something that looks very strange. There's
218							# a problem with the Tokenizer, in that tokens tend to change
219							# classes as each letter is added, but they don't get allocated
220							# their definite final class until the "end" of the token, the
221							# detection of which occurs in about a hundred different places,
222							# all through various crufty code (that triples the speed).
223							#
224							# However, in general, this does not apply to tokens in which a
225							# whitespace character is valid, such as comments, whitespace and
226							# big strings.
227							#
228							# So what we do is add a space to the end of the source. This
229							# triggers normal "end of token" functionality for all cases. Then,
230							# once the tokenizer hits end of file, it examines the last token to
231							# manually either remove the ' ' token, or chop it off the end of
232							# a longer one in which the space would be valid.
233	16936	100		77168		67769	if ( List::Util::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) {
	77168	100				164112
	16936	100				65148
234	10					30	$self->{source_eof_chop} = '';
235							} elsif ( ! defined $self->{source}->[0] ) {
236	4					7	$self->{source_eof_chop} = '';
237							} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
238	1128					3368	$self->{source_eof_chop} = '';
239							} else {
240	15794					30982	$self->{source_eof_chop} = 1;
241	15794					29040	$self->{source}->[-1] .= ' ';
242							}
243
244	16936					72279	$self;
245							}
246
247							sub _document {
248	16845			16845		20626	my $self = shift;
249	16845	50				40540	return @_ ? $self->{document} = shift : $self->{document};
250							}
251
252
253
254
255
256							#####################################################################
257							# Main Public Methods
258
259							=pod
260
261							=head2 get_token
262
263							When using the PPI::Tokenizer object as an iterator, the C
264							method is the primary method that is used. It increments the cursor
265							and returns the next Token in the output array.
266
267							The actual parsing of the file is done only as-needed, and a line at
268							a time. When C hits the end of the token array, it will
269							cause the parser to pull in the next line and parse it, continuing
270							as needed until there are more tokens on the output array that
271							get_token can then return.
272
273							This means that a number of Tokenizer objects can be created, and
274							won't consume significant CPU until you actually begin to pull tokens
275							from it.
276
277							Return a L object on success, C<0> if the Tokenizer had
278							reached the end of the file, or C on error.
279
280							=cut
281
282							sub get_token {
283	597242			597242	1	597115	my $self = shift;
284
285							# Shortcut for EOF
286	597242	50	66			897028	if ( $self->{token_eof}
287	13618					25490	and $self->{token_cursor} > scalar @{$self->{tokens}}
288							) {
289	0					0	return 0;
290							}
291
292							# Return the next token if we can
293	597242	100				1211490	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
294	514738					514292	$self->{token_cursor}++;
295	514738					1229847	return $token;
296							}
297
298	82504					82321	my $line_rv;
299
300							# Catch exceptions and return undef, so that we
301							# can start to convert code to exception-based code.
302	82504					92491	my $rv = eval {
303							# No token, we need to get some more
304	82504					138376	while ( $line_rv = $self->_process_next_line ) {
305							# If there is something in the buffer, return it
306							# The defined() prevents a ton of calls to PPI::Util::TRUE
307	73403	100				148176	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
308	52042					54681	$self->{token_cursor}++;
309	52042					90118	return $token;
310							}
311							}
312	30461					39025	return undef;
313							};
314	82504	100				178833	if ( $@ ) {
		100
315	1	50				8	if ( _INSTANCE($@, 'PPI::Exception') ) {
316	1					2	$@->throw;
317							} else {
318	0					0	my $errstr = $@;
319	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
320	0					0	PPI::Exception->throw( $errstr );
321							}
322							} elsif ( $rv ) {
323	52042					211512	return $rv;
324							}
325
326	30461	50				41178	if ( defined $line_rv ) {
327							# End of file, but we can still return things from the buffer
328	30461	50				50737	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
329	0					0	$self->{token_cursor}++;
330	0					0	return $token;
331							}
332
333							# Set our token end of file flag
334	30461					33958	$self->{token_eof} = 1;
335	30461					83090	return 0;
336							}
337
338							# Error, pass it up to our caller
339	0					0	undef;
340							}
341
342							=pod
343
344							=head2 all_tokens
345
346							When not being used as an iterator, the C method tells
347							the Tokenizer to parse the entire file and return all of the tokens
348							in a single ARRAY reference.
349
350							It should be noted that C does B interfere with the
351							use of the Tokenizer object as an iterator (does not modify the token
352							cursor) and use of the two different mechanisms can be mixed safely.
353
354							Returns a reference to an ARRAY of L objects on success
355							or throws an exception on error.
356
357							=cut
358
359							sub all_tokens {
360	5			5	1	21	my $self = shift;
361
362							# Catch exceptions and return undef, so that we
363							# can start to convert code to exception-based code.
364	5					8	my $ok = eval {
365							# Process lines until we get EOF
366	5	50				14	unless ( $self->{token_eof} ) {
367	5					6	my $rv;
368	5					18	while ( $rv = $self->_process_next_line ) {}
369	5	50				8	unless ( defined $rv ) {
370	0					0	PPI::Exception->throw("Error while processing source");
371							}
372
373							# Clean up the end of the tokenizer
374	5					12	$self->_clean_eof;
375							}
376	5					8	1;
377							};
378	5	50				10	if ( !$ok ) {
379	0					0	my $errstr = $@;
380	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
381	0					0	PPI::Exception->throw( $errstr );
382							}
383
384							# End of file, return a copy of the token array.
385	5					7	return [ @{$self->{tokens}} ];
	5					22
386							}
387
388							=pod
389
390							=head2 increment_cursor
391
392							Although exposed as a public method, C is implemented
393							for expert use only, when writing lexers or other components that work
394							directly on token streams.
395
396							It manually increments the token cursor forward through the file, in effect
397							"skipping" the next token.
398
399							Return true if the cursor is incremented, C<0> if already at the end of
400							the file, or C on error.
401
402							=cut
403
404							sub increment_cursor {
405							# Do this via the get_token method, which makes sure there
406							# is actually a token there to move to.
407	0	0		0	1	0	$_[0]->get_token and 1;
408							}
409
410							=pod
411
412							=head2 decrement_cursor
413
414							Although exposed as a public method, C is implemented
415							for expert use only, when writing lexers or other components that work
416							directly on token streams.
417
418							It manually decrements the token cursor backwards through the file, in
419							effect "rolling back" the token stream. And indeed that is what it is
420							primarily intended for, when the component that is consuming the token
421							stream needs to implement some sort of "roll back" feature in its use
422							of the token stream.
423
424							Return true if the cursor is decremented, C<0> if already at the
425							beginning of the file, or C on error.
426
427							=cut
428
429							sub decrement_cursor {
430	0			0	1	0	my $self = shift;
431
432							# Check for the beginning of the file
433	0	0				0	return 0 unless $self->{token_cursor};
434
435							# Decrement the token cursor
436	0					0	$self->{token_eof} = 0;
437	0					0	--$self->{token_cursor};
438							}
439
440
441
442
443
444							#####################################################################
445							# Working With Source
446
447							# Fetches the next line from the input line buffer
448							# Returns undef at EOF.
449							sub _get_line {
450	111750			111750		104141	my $self = shift;
451	111750	100				173560	return undef unless $self->{source}; # EOF hit previously
452
453							# Pull off the next line
454	95605					88574	my $line = shift @{$self->{source}};
	95605					151303
455
456							# Flag EOF if we hit it
457	95605	100				159863	$self->{source} = undef unless defined $line;
458
459							# Return the line (or EOF flag)
460	95605					140333	return $line; # string or undef
461							}
462
463							# Fetches the next line, ready to process
464							# Returns 1 on success
465							# Returns 0 on EOF
466							sub _fill_line {
467	109357			109357		104355	my $self = shift;
468	109357					115035	my $inscan = shift;
469
470							# Get the next line
471	109357					149874	my $line = $self->_get_line;
472	109357	100				159733	unless ( defined $line ) {
473							# End of file
474	32520	100				45913	unless ( $inscan ) {
475	30466					47347	delete $self->{line};
476	30466					34335	delete $self->{line_cursor};
477	30466					31171	delete $self->{line_length};
478	30466					54562	return 0;
479							}
480
481							# In the scan version, just set the cursor to the end
482							# of the line, and the rest should just cascade out.
483	2054					2608	$self->{line_cursor} = $self->{line_length};
484	2054					3840	return 0;
485							}
486
487							# Populate the appropriate variables
488	76837					114403	$self->{line} = $line;
489	76837					87653	$self->{line_cursor} = -1;
490	76837					94446	$self->{line_length} = length $line;
491	76837					80131	$self->{line_count}++;
492
493	76837					132537	1;
494							}
495
496							# Get the current character
497							sub _char {
498	0			0		0	my $self = shift;
499	0					0	substr( $self->{line}, $self->{line_cursor}, 1 );
500							}
501
502
503
504
505
506							####################################################################
507							# Per line processing methods
508
509							# Processes the next line
510							# Returns 1 on success completion
511							# Returns 0 if EOF
512							# Returns undef on error
513							sub _process_next_line {
514	103881			103881		105247	my $self = shift;
515
516							# Fill the line buffer
517	103881					99693	my $rv;
518	103881	100				148858	unless ( $rv = $self->_fill_line ) {
519	30466	50				47340	return undef unless defined $rv;
520
521							# End of file, finalize last token
522	30466					51210	$self->_finalize_token;
523	30466					55260	return 0;
524							}
525
526							# Run the __TOKENIZER__on_line_start
527	73415					192547	$rv = $self->{class}->__TOKENIZER__on_line_start( $self );
528	73415	100				109732	unless ( $rv ) {
529							# If there are no more source lines, then clean up
530	28445	100	66			45453	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
	28445					54325
531	308					736	$self->_clean_eof;
532							}
533
534							# Defined but false means next line
535	28445	50				54881	return 1 if defined $rv;
536	0					0	PPI::Exception->throw("Error at line $self->{line_count}");
537							}
538
539							# If we can't deal with the entire line, process char by char
540	44970					80521	while ( $rv = $self->_process_next_char ) {}
541	44969	50				69136	unless ( defined $rv ) {
542	0					0	PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
543							}
544
545							# Trigger any action that needs to happen at the end of a line
546	44969					112050	$self->{class}->__TOKENIZER__on_line_end( $self );
547
548							# If there are no more source lines, then clean up
549	44969	100	100			100220	unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
	42674					94296
550	16623					29101	return $self->_clean_eof;
551							}
552
553	28346					52350	return 1;
554							}
555
556
557
558
559
560							#####################################################################
561							# Per-character processing methods
562
563							# Process on a per-character basis.
564							# Note that due the high number of times this gets
565							# called, it has been fairly heavily in-lined, so the code
566							# might look a bit ugly and duplicated.
567							sub _process_next_char {
568	705254			705254		679230	my $self = shift;
569
570							### FIXME - This checks for a screwed up condition that triggers
571							### several warnings, amongst other things.
572	705254	50	33			1551890	if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
573							# $DB::single = 1;
574	0					0	return undef;
575							}
576
577	705254					688510	$self->{line_cursor}++;
578	705254	100				827682	return 0 if $self->_at_line_end;
579
580							# Pass control to the token class
581	660285					634383	my $result;
582	660285	100				1184626	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
583							# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
584	94068	50				239091	return defined $result ? 1 : undef;
585							}
586
587							# We will need the value of the current character
588	566216					755002	my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
589	566216	100				768544	if ( $result eq '1' ) {
590							# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
591							# the character is part of it.
592
593							# Add the character
594	96222	50				129446	if ( defined $self->{token} ) {
595	96222					132975	$self->{token}->{content} .= $char;
596							} else {
597	0	0				0	defined($self->{token} = $self->{class}->new($char)) or return undef;
598							}
599
600	96222					184958	return 1;
601							}
602
603							# We have been provided with the name of a class
604	469994	100				746203	if ( $self->{class} ne "PPI::Token::$result" ) {
		100
605							# New class
606	228473					305009	$self->_new_token( $result, $char );
607							} elsif ( defined $self->{token} ) {
608							# Same class as current
609	44589					58722	$self->{token}->{content} .= $char;
610							} else {
611							# Same class, but no current
612	196932	50				340282	defined($self->{token} = $self->{class}->new($char)) or return undef;
613							}
614
615	469994					881797	1;
616							}
617
618							sub _at_line_end {
619	705254			705254		785038	my ($self) = @_;
620	705254					1197660	return $self->{line_cursor} >= $self->{line_length};
621							}
622
623
624
625
626
627							#####################################################################
628							# Altering Tokens in Tokenizer
629
630							# Finish the end of a token.
631							# Returns the resulting parse class as a convenience.
632							sub _finalize_token {
633	610515			610515		614217	my $self = shift;
634	610515	100				841677	return $self->{class} unless defined $self->{token};
635
636							# Add the token to the token buffer
637	580047					532926	push @{ $self->{tokens} }, $self->{token};
	580047					840484
638	580047					636109	$self->{token} = undef;
639
640							# Return the parse class to that of the zone we are in
641	580047					1042892	$self->{class} = $self->{zone};
642							}
643
644							# Creates a new token and sets it in the tokenizer
645							# The defined() in here prevent a ton of calls to PPI::Util::TRUE
646							sub _new_token {
647	383113			383113		404817	my $self = shift;
648							# throw PPI::Exception() unless @_;
649	383113	100				628877	my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
650							? shift : 'PPI::Token::' . shift;
651
652							# Finalize any existing token
653	383113	100				648810	$self->_finalize_token if defined $self->{token};
654
655							# Create the new token and update the parse class
656	383113	50				785566	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
657	383113					463336	$self->{class} = $class;
658
659	383113					429272	1;
660							}
661
662							# At the end of the file, we need to clean up the results of the erroneous
663							# space that we inserted at the beginning of the process.
664							sub _clean_eof {
665	16936			16936		18684	my $self = shift;
666
667							# Finish any partially completed token
668	16936	100				29479	$self->_finalize_token if $self->{token};
669
670							# Find the last token, and if it has no content, kill it.
671							# There appears to be some evidence that such "null tokens" are
672							# somehow getting created accidentally.
673	16936					22537	my $last_token = $self->{tokens}->[ -1 ];
674	16936	50				29115	unless ( length $last_token->{content} ) {
675	0					0	pop @{$self->{tokens}};
	0					0
676							}
677
678							# Now, if the last character of the last token is a space we added,
679							# chop it off, deleting the token if there's nothing else left.
680	16936	100				28896	if ( $self->{source_eof_chop} ) {
681	15521					19645	$last_token = $self->{tokens}->[ -1 ];
682	15521					64946	$last_token->{content} =~ s/ $//;
683	15521	100				30839	unless ( length $last_token->{content} ) {
684							# Popping token
685	13236					13686	pop @{$self->{tokens}};
	13236					18521
686							}
687
688							# The hack involving adding an extra space is now reversed, and
689							# now nobody will ever know. The perfect crime!
690	15521					24993	$self->{source_eof_chop} = '';
691							}
692
693	16936					48720	1;
694							}
695
696
697
698
699
700							#####################################################################
701							# Utility Methods
702
703							# Context
704							sub _last_token {
705	0			0		0	$_[0]->{tokens}->[-1];
706							}
707
708							sub _last_significant_token {
709	3127			3127		3524	my $self = shift;
710	3127					3613	my $cursor = $#{ $self->{tokens} };
	3127					4310
711	3127					6254	while ( $cursor >= 0 ) {
712	4195					5431	my $token = $self->{tokens}->[$cursor--];
713	4195	100				12184	return $token if $token->significant;
714							}
715	402					776	return;
716							}
717
718							# Get an array ref of previous significant tokens.
719							# Like _last_significant_token except it gets more than just one token
720							# Returns array with 0 to x entries
721							sub _previous_significant_tokens {
722	165598			165598		163569	my $self = shift;
723	165598		50			231169	my $count = shift \|\| 1;
724	165598					146617	my $cursor = $#{ $self->{tokens} };
	165598					204779
725
726	165598					176653	my @tokens;
727	165598					228368	while ( $cursor >= 0 ) {
728	267863					290835	my $token = $self->{tokens}->[$cursor--];
729	267863	100				468781	next if not $token->significant;
730	172714					179463	push @tokens, $token;
731	172714	100				272766	last if @tokens >= $count;
732							}
733
734	165598					278044	return @tokens;
735							}
736
737							my %OBVIOUS_CLASS = (
738							'PPI::Token::Symbol' => 'operator',
739							'PPI::Token::Magic' => 'operator',
740							'PPI::Token::Number' => 'operator',
741							'PPI::Token::ArrayIndex' => 'operator',
742							'PPI::Token::Quote::Double' => 'operator',
743							'PPI::Token::Quote::Interpolate' => 'operator',
744							'PPI::Token::Quote::Literal' => 'operator',
745							'PPI::Token::Quote::Single' => 'operator',
746							'PPI::Token::QuoteLike::Backtick' => 'operator',
747							'PPI::Token::QuoteLike::Command' => 'operator',
748							'PPI::Token::QuoteLike::Readline' => 'operator',
749							'PPI::Token::QuoteLike::Regexp' => 'operator',
750							'PPI::Token::QuoteLike::Words' => 'operator',
751							);
752
753							my %OBVIOUS_CONTENT = (
754							'(' => 'operand',
755							'{' => 'operand',
756							'[' => 'operand',
757							';' => 'operand',
758							'}' => 'operator',
759							);
760
761
762							my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
763
764							# Try to determine operator/operand context, if possible.
765							# Returns "operator", "operand", or "" if unknown.
766							sub _opcontext {
767	11030			11030		12603	my $self = shift;
768	11030					18529	my @tokens = $self->_previous_significant_tokens(1);
769	11030					13731	my $p0 = $tokens[0];
770	11030	100				26595	return '' if not $p0;
771	10912					14973	my $c0 = ref $p0;
772
773							# Map the obvious cases
774	10912	100				30795	return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
775	2321	100				4746	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
776
777							# Most of the time after an operator, we are an operand
778	1733	100				6837	return 'operand' if $p0->isa('PPI::Token::Operator');
779
780							# If there's NOTHING, it's operand
781	1509	50				2655	return 'operand' if $p0->content eq '';
782
783							# Otherwise, we don't know
784	1509					2969	return ''
785							}
786
787							# Assuming we are currently parsing the word 'x', return true
788							# if previous tokens imply the x is an operator, false otherwise.
789							sub _current_x_is_operator {
790	1113			1113		1485	my ( $self ) = @_;
791	1113	100				1032	return if !@{$self->{tokens}};
	1113					2667
792
793	915					1816	my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
794	915	100				2753	return if !$prev;
795
796	914	100				3426	return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
797
798							return (!$prev->isa('PPI::Token::Operator') \|\| $X_CAN_FOLLOW_OPERATOR{$prev})
799	748		100			4424	&& (!$prev->isa('PPI::Token::Structure') \|\| $X_CAN_FOLLOW_STRUCTURE{$prev})
800							&& !$prev->isa('PPI::Token::Label')
801							;
802							}
803
804
805							# Assuming we are at the end of parsing the current token that could be a word,
806							# a wordlike operator, or a version string, try to determine whether context
807							# before or after it forces it to be a bareword. This method is only useful
808							# during tokenization.
809							sub __current_token_is_forced_word {
810	33401			33401		49836	my ( $t, $word ) = @_;
811
812							# Check if forced by preceding tokens.
813
814	33401					46648	my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
815	33401	100				69526	if ( !$prev ) {
816	8860					19761	pos $t->{line} = $t->{line_cursor};
817							}
818							else {
819	24541					39425	my $content = $prev->{content};
820
821							# We are forced if we are a method name.
822							# '->' will always be an operator, so we don't check its type.
823	24541	100				39667	return 1 if $content eq '->';
824
825							# If we are contained in a pair of curly braces, we are probably a
826							# forced bareword hash key. '{' is never a word or operator, so we
827							# don't check its type.
828	24411					47370	pos $t->{line} = $t->{line_cursor};
829	24411	100	100			59069	return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
830
831							# sub, package, use, and no all indicate that what immediately follows
832							# is a word not an operator or (in the case of sub and package) a
833							# version string. However, we don't want to be fooled by 'package
834							# package v10' or 'use no v10'. We're a forced package unless we're
835							# preceded by 'package sub', in which case we're a version string.
836							# We also have to make sure that the sub/package/etc doing the forcing
837							# is not a method call.
838	24181	100				49339	if( $USUALLY_FORCES{$content}) {
839	5631	100	66			18776	return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
			100
			100
840	5621	100				25010	return 1 if not $prevprev;
841	236	100	100			465	return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
842	6					32	return;
843							}
844							}
845							# pos on $t->{line} is guaranteed to be set at this point.
846
847							# Check if forced by following tokens.
848
849							# If the word is followed by => it is probably a word, not a regex.
850	27410	100				65221	return 1 if $t->{line} =~ /\G\s*=>/gc;
851
852							# Otherwise we probably aren't forced
853	26619					121515	return '';
854							}
855
856							sub _features {
857	37			37		78	my ( $self, $arg ) = @_;
858	37	50	0			138	return $arg ? $self->{feature_set} = $arg : $self->{feature_set} \|\| {};
859							}
860
861	12314			12314		38002	sub _current_token_has_signatures_active { shift->{feature_set}{signatures} }
862
863							1;
864
865							=pod
866
867							=head1 NOTES
868
869							=head2 How the Tokenizer Works
870
871							Understanding the Tokenizer is not for the faint-hearted. It is by far
872							the most complex and twisty piece of perl I've ever written that is actually
873							still built properly and isn't a terrible spaghetti-like mess. In fact, you
874							probably want to skip this section.
875
876							But if you really want to understand, well then here goes.
877
878							=head2 Source Input and Clean Up
879
880							The Tokenizer starts by taking source in a variety of forms, sucking it
881							all in and merging into one big string, and doing our own internal line
882							split, using a "universal line separator" which allows the Tokenizer to
883							take source for any platform (and even supports a few known types of
884							broken newlines caused by mixed mac/pc/*nix editor screw ups).
885
886							The resulting array of lines is used to feed the tokenizer, and is also
887							accessed directly by the heredoc-logic to do the line-oriented part of
888							here-doc support.
889
890							=head2 Doing Things the Old Fashioned Way
891
892							Due to the complexity of perl, and after 2 previously aborted parser
893							attempts, in the end the tokenizer was fashioned around a line-buffered
894							character-by-character method.
895
896							That is, the Tokenizer pulls and holds a line at a time into a line buffer,
897							and then iterates a cursor along it. At each cursor position, a method is
898							called in whatever token class we are currently in, which will examine the
899							character at the current position, and handle it.
900
901							As the handler methods in the various token classes are called, they
902							build up an output token array for the source code.
903
904							Various parts of the Tokenizer use look-ahead, arbitrary-distance
905							look-behind (although currently the maximum is three significant tokens),
906							or both, and various other heuristic guesses.
907
908							I've been told it is officially termed a I<"backtracking parser
909							with infinite lookaheads">.
910
911							=head2 State Variables
912
913							Aside from the current line and the character cursor, the Tokenizer
914							maintains a number of different state variables.
915
916							=over
917
918							=item Current Class
919
920							The Tokenizer maintains the current token class at all times. Much of the
921							time is just going to be the "Whitespace" class, which is what the base of
922							a document is. As the tokenizer executes the various character handlers,
923							the class changes a lot as it moves a long. In fact, in some instances,
924							the character handler may not handle the character directly itself, but
925							rather change the "current class" and then hand off to the character
926							handler for the new class.
927
928							Because of this, and some other things I'll deal with later, the number of
929							times the character handlers are called does not in fact have a direct
930							relationship to the number of actual characters in the document.
931
932							=item Current Zone
933
934							Rather than create a class stack to allow for infinitely nested layers of
935							classes, the Tokenizer recognises just a single layer.
936
937							To put it a different way, in various parts of the file, the Tokenizer will
938							recognise different "base" or "substrate" classes. When a Token such as a
939							comment or a number is finalised by the tokenizer, it "falls back" to the
940							base state.
941
942							This allows proper tokenization of special areas such as __DATA__
943							and __END__ blocks, which also contain things like comments and POD,
944							without allowing the creation of any significant Tokens inside these areas.
945
946							For the main part of a document we use L for this,
947							with the idea being that code is "floating in a sea of whitespace".
948
949							=item Current Token
950
951							The final main state variable is the "current token". This is the Token
952							that is currently being built by the Tokenizer. For certain types, it
953							can be manipulated and morphed and change class quite a bit while being
954							assembled, as the Tokenizer's understanding of the token content changes.
955
956							When the Tokenizer is confident that it has seen the end of the Token, it
957							will be "finalized", which adds it to the output token array and resets
958							the current class to that of the zone that we are currently in.
959
960							I should also note at this point that the "current token" variable is
961							optional. The Tokenizer is capable of knowing what class it is currently
962							set to, without actually having accumulated any characters in the Token.
963
964							=back
965
966							=head2 Making It Faster
967
968							As I'm sure you can imagine, calling several different methods for each
969							character and running regexes and other complex heuristics made the first
970							fully working version of the tokenizer extremely slow.
971
972							During testing, I created a metric to measure parsing speed called
973							LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
974							cycles on a typical single-core CPU, and so a Tokenizer running at
975							"1000 lines per gigacycle" should generate around 1200 lines of tokenized
976							code when running on a 1200 MHz processor.
977
978							The first working version of the tokenizer ran at only 350 LPGC, so to
979							tokenize a typical large module such as L took
980							10-15 seconds. This sluggishness made it unpractical for many uses.
981
982							So in the current parser, there are multiple layers of optimisation
983							very carefully built in to the basic. This has brought the tokenizer
984							up to a more reasonable 1000 LPGC, at the expense of making the code
985							quite a bit twistier.
986
987							=head2 Making It Faster - Whole Line Classification
988
989							The first step in the optimisation process was to add a hew handler to
990							enable several of the more basic classes (whitespace, comments) to be
991							able to be parsed a line at a time. At the start of each line, a
992							special optional handler (only supported by a few classes) is called to
993							check and see if the entire line can be parsed in one go.
994
995							This is used mainly to handle things like POD, comments, empty lines,
996							and a few other minor special cases.
997
998							=head2 Making It Faster - Inlining
999
1000							The second stage of the optimisation involved inlining a small
1001							number of critical methods that were repeated an extremely high number
1002							of times. Profiling suggested that there were about 1,000,000 individual
1003							method calls per gigacycle, and by cutting these by two thirds a significant
1004							speed improvement was gained, in the order of about 50%.
1005
1006							You may notice that many methods in the C code look
1007							very nested and long hand. This is primarily due to this inlining.
1008
1009							At around this time, some statistics code that existed in the early
1010							versions of the parser was also removed, as it was determined that
1011							it was consuming around 15% of the CPU for the entire parser, while
1012							making the core more complicated.
1013
1014							A judgment call was made that with the difficulties likely to be
1015							encountered with future planned enhancements, and given the relatively
1016							high cost involved, the statistics features would be removed from the
1017							Tokenizer.
1018
1019							=head2 Making It Faster - Quote Engine
1020
1021							Once inlining had reached diminishing returns, it became obvious from
1022							the profiling results that a huge amount of time was being spent
1023							stepping a char at a time though long, simple and "syntactically boring"
1024							code such as comments and strings.
1025
1026							The existing regex engine was expanded to also encompass quotes and
1027							other quote-like things, and a special abstract base class was added
1028							that provided a number of specialised parsing methods that would "scan
1029							ahead", looking out ahead to find the end of a string, and updating
1030							the cursor to leave it in a valid position for the next call.
1031
1032							This is also the point at which the number of character handler calls began
1033							to greatly differ from the number of characters. But it has been done
1034							in a way that allows the parser to retain the power of the original
1035							version at the critical points, while skipping through the "boring bits"
1036							as needed for additional speed.
1037
1038							The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1039							for the first time.
1040
1041							=head2 Making It Faster - The "Complete" Mechanism
1042
1043							As it became evident that great speed increases were available by using
1044							this "skipping ahead" mechanism, a new handler method was added that
1045							explicitly handles the parsing of an entire token, where the structure
1046							of the token is relatively simple. Tokens such as symbols fit this case,
1047							as once we are passed the initial sigil and word char, we know that we
1048							can skip ahead and "complete" the rest of the token much more easily.
1049
1050							A number of these have been added for most or possibly all of the common
1051							cases, with most of these "complete" handlers implemented using regular
1052							expressions.
1053
1054							In fact, so many have been added that at this point, you could arguably
1055							reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1056							tokenizer". More tokens are now consumed in "complete" methods in a
1057							typical program than are handled by the normal char-by-char methods.
1058
1059							Many of the these complete-handlers were implemented during the writing
1060							of the Lexer, and this has allowed the full parser to maintain around
1061							1000 LPGC despite the increasing weight of the Lexer.
1062
1063							=head2 Making It Faster - Porting To C (In Progress)
1064
1065							While it would be extraordinarily difficult to port all of the Tokenizer
1066							to C, work has started on a L "accelerator" package which acts as
1067							a separate and automatically-detected add-on to the main PPI package.
1068
1069							L implements faster versions of a variety of functions scattered
1070							over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1071							various other places, and implements them identically in XS/C.
1072
1073							In particular, the skip-ahead methods from the Quote Engine would appear
1074							to be extremely amenable to being done in C, and a number of other
1075							functions could be cherry-picked one at a time and implemented in C.
1076
1077							Each method is heavily tested to ensure that the functionality is
1078							identical, and a versioning mechanism is included to ensure that if a
1079							function gets out of sync, L will degrade gracefully and just
1080							not replace that single method.
1081
1082							=head1 TO DO
1083
1084							- Add an option to reset or seek the token stream...
1085
1086							- Implement more Tokenizer functions in L
1087
1088							=head1 SUPPORT
1089
1090							See the L in the main module.
1091
1092							=head1 AUTHOR
1093
1094							Adam Kennedy Eadamk@cpan.orgE
1095
1096							=head1 COPYRIGHT
1097
1098							Copyright 2001 - 2011 Adam Kennedy.
1099
1100							This program is free software; you can redistribute
1101							it and/or modify it under the same terms as Perl itself.
1102
1103							The full text of the license can be found in the
1104							LICENSE file included with this module.
1105
1106							=cut