File Coverage

blib/lib/PPI/Tokenizer.pm

Criterion	Covered	Total	%
statement	222	248	89.5
branch	115	142	80.9
condition	27	37	72.9
subroutine	29	33	87.8
pod	5	5	100.0
total	398	465	85.5

line	stmt	bran	cond	sub	pod	time	code
1							package PPI::Tokenizer;
2
3							=pod
4
5							=head1 NAME
6
7							PPI::Tokenizer - The Perl Document Tokenizer
8
9							=head1 SYNOPSIS
10
11							# Create a tokenizer for a file, array or string
12							$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13							$Tokenizer = PPI::Tokenizer->new( \@lines );
14							$Tokenizer = PPI::Tokenizer->new( \$source );
15
16							# Return all the tokens for the document
17							my $tokens = $Tokenizer->all_tokens;
18
19							# Or we can use it as an iterator
20							while ( my $Token = $Tokenizer->get_token ) {
21							print "Found token '$Token'\n";
22							}
23
24							# If we REALLY need to manually nudge the cursor, you
25							# can do that to (The lexer needs this ability to do rollbacks)
26							$is_incremented = $Tokenizer->increment_cursor;
27							$is_decremented = $Tokenizer->decrement_cursor;
28
29							=head1 DESCRIPTION
30
31							PPI::Tokenizer is the class that provides Tokenizer objects for use in
32							breaking strings of Perl source code into Tokens.
33
34							By the time you are reading this, you probably need to know a little
35							about the difference between how perl parses Perl "code" and how PPI
36							parsers Perl "documents".
37
38							"perl" itself (the interpreter) uses a heavily modified lex specification
39							to specify its parsing logic, maintains several types of state as it
40							goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42							In fact, it is provably impossible to use perl's parsing method without
43							simultaneously executing code. A formal mathematical proof has been
44							published demonstrating the method.
45
46							This is where the truism "Only perl can parse Perl" comes from.
47
48							PPI uses a completely different approach by abandoning the (impossible)
49							ability to parse Perl the same way that the interpreter does, and instead
50							parsing the source as a document, using a document structure independently
51							derived from the Perl documentation and approximating the perl interpreter
52							interpretation as closely as possible.
53
54							It was touch and go for a long time whether we could get it close enough,
55							but in the end it turned out that it could be done.
56
57							In this approach, the tokenizer C is implemented separately
58							from the lexer L.
59
60							The job of C is to take pure source as a string and break it
61							up into a stream/set of tokens, and contains most of the "black magic" used
62							in PPI. By comparison, the lexer implements a relatively straight forward
63							tree structure, and has an implementation that is uncomplicated (compared
64							to the insanity in the tokenizer at least).
65
66							The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67							supported by a very B flexible internal API, but fortunately it was
68							possible to largely encapsulate the black magic, so there is not a lot that
69							gets exposed to people using the C itself.
70
71							=head1 METHODS
72
73							Despite the incredible complexity, the Tokenizer itself only exposes a
74							relatively small number of methods, with most of the complexity implemented
75							in private methods.
76
77							=cut
78
79							# Make sure everything we need is loaded so
80							# we don't have to go and load all of PPI.
81	68			68		381	use strict;
	68					113
	68					2399
82	68			68		257	use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
	68					103
	68					3500
83	68			68		269	use List::Util 1.33 ();
	68					1026
	68					1187
84	68			68		252	use PPI::Util ();
	68					94
	68					1213
85	68			68		300	use PPI::Element ();
	68					127
	68					1011
86	68			68		232	use PPI::Token ();
	68					130
	68					909
87	68			68		255	use PPI::Exception ();
	68					107
	68					835
88	68			68		25402	use PPI::Exception::ParserRejection ();
	68					166
	68					1224
89	68			68		302	use PPI::Document ();
	68					101
	68					185069
90
91							our $VERSION = '1.290';
92
93							# The x operator cannot follow most Perl operators, implying that
94							# anything beginning with x following an operator is a word.
95							# These are the exceptions.
96							my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97
98							# The x operator cannot follow most structure elements, implying that
99							# anything beginning with x following a structure element is a word.
100							# These are the exceptions.
101							my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102
103							# Something that looks like the x operator but follows a word
104							# is usually that word's argument.
105							# These are the exceptions.
106							# chop, chomp, dump are ambiguous because they can have either parms
107							# or no parms.
108							my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109							endgrent
110							endhostent
111							endnetent
112							endprotoent
113							endpwent
114							endservent
115							fork
116							getgrent
117							gethostent
118							getlogin
119							getnetent
120							getppid
121							getprotoent
122							getpwent
123							getservent
124							setgrent
125							setpwent
126							time
127							times
128							wait
129							wantarray
130							__SUB__
131							);
132
133
134
135							#####################################################################
136							# Creation and Initialization
137
138							=pod
139
140							=head2 new $file \| \@lines \| \$source
141
142							The main C constructor creates a new Tokenizer object. These
143							objects have no configuration parameters, and can only be used once,
144							to tokenize a single perl source file.
145
146							It takes as argument either a normal scalar containing source code,
147							a reference to a scalar containing source code, or a reference to an
148							ARRAY containing newline-terminated lines of source code.
149
150							Returns a new C object on success, or throws a
151							L exception on error.
152
153							=cut
154
155							sub new {
156	16938		33	16938	1	59482	my $class = ref($_[0]) \|\| $_[0];
157
158							# Create the empty tokenizer struct
159	16938					139892	my $self = bless {
160							# Source code
161							source => undef,
162							source_bytes => undef,
163							document => undef,
164
165							# Line buffer
166							line => undef,
167							line_length => undef,
168							line_cursor => undef,
169							line_count => 0,
170
171							# Parse state
172							token => undef,
173							class => 'PPI::Token::BOM',
174							zone => 'PPI::Token::Whitespace',
175							feature_set => undef,
176
177							# Output token buffer
178							tokens => [],
179							token_cursor => 0,
180							token_eof => 0,
181
182							# Perl 6 blocks
183							perl6 => [],
184							}, $class;
185
186	16938	100				57469	if ( ! ref $_[1] ) {
		100
		50
187	518					2850	my $source = PPI::Util::_slurp($_[1]);
188	518	100				1558	PPI::Exception->throw("Tokenizer failed to open file: $source")
189							if not ref $source;
190	516					1822	$self->{source} = $$source;
191
192							} elsif ( _SCALAR0($_[1]) ) {
193	16417					19034	$self->{source} = ${$_[1]};
	16417					33076
194
195							} elsif ( _ARRAY0($_[1]) ) {
196	3					7	$self->{source} = join '', map "$_\n", @{$_[1]};
	3					18
197
198							} else {
199							# We don't support whatever this is
200	0					0	PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
201							}
202
203							# We can't handle a null string
204	16936					31843	$self->{source_bytes} = length $self->{source};
205	16936	100				28929	if ( $self->{source_bytes} ) {
206							# Split on local newlines
207	16932					509857	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g;
208	16932					247393	$self->{source} = [ split /(?<=\n)/, $self->{source} ];
209
210							} else {
211	4					10	$self->{source} = [ ];
212							}
213
214							### EVIL
215							# I'm explaining this earlier than I should so you can understand
216							# why I'm about to do something that looks very strange. There's
217							# a problem with the Tokenizer, in that tokens tend to change
218							# classes as each letter is added, but they don't get allocated
219							# their definite final class until the "end" of the token, the
220							# detection of which occurs in about a hundred different places,
221							# all through various crufty code (that triples the speed).
222							#
223							# However, in general, this does not apply to tokens in which a
224							# whitespace character is valid, such as comments, whitespace and
225							# big strings.
226							#
227							# So what we do is add a space to the end of the source. This
228							# triggers normal "end of token" functionality for all cases. Then,
229							# once the tokenizer hits end of file, it examines the last token to
230							# manually either remove the ' ' token, or chop it off the end of
231							# a longer one in which the space would be valid.
232	16936	100		77085		66895	if ( List::Util::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) {
	77085	100				143813
	16936	100				68929
233	10					29	$self->{source_eof_chop} = '';
234							} elsif ( ! defined $self->{source}->[0] ) {
235	4					9	$self->{source_eof_chop} = '';
236							} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
237	1132					3066	$self->{source_eof_chop} = '';
238							} else {
239	15790					27535	$self->{source_eof_chop} = 1;
240	15790					27605	$self->{source}->[-1] .= ' ';
241							}
242
243	16936					72031	$self;
244							}
245
246							sub _document {
247	16845			16845		17929	my $self = shift;
248	16845	50				37931	return @_ ? $self->{document} = shift : $self->{document};
249							}
250
251
252
253
254
255							#####################################################################
256							# Main Public Methods
257
258							=pod
259
260							=head2 get_token
261
262							When using the PPI::Tokenizer object as an iterator, the C
263							method is the primary method that is used. It increments the cursor
264							and returns the next Token in the output array.
265
266							The actual parsing of the file is done only as-needed, and a line at
267							a time. When C hits the end of the token array, it will
268							cause the parser to pull in the next line and parse it, continuing
269							as needed until there are more tokens on the output array that
270							get_token can then return.
271
272							This means that a number of Tokenizer objects can be created, and
273							won't consume significant CPU until you actually begin to pull tokens
274							from it.
275
276							Return a L object on success, C<0> if the Tokenizer had
277							reached the end of the file, or C on error.
278
279							=cut
280
281							sub get_token {
282	595835			595835	1	578989	my $self = shift;
283
284							# Shortcut for EOF
285	595835	50	66			887931	if ( $self->{token_eof}
286	13526					26173	and $self->{token_cursor} > scalar @{$self->{tokens}}
287							) {
288	0					0	return 0;
289							}
290
291							# Return the next token if we can
292	595835	100				1167787	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293	513491					498706	$self->{token_cursor}++;
294	513491					1195896	return $token;
295							}
296
297	82344					88026	my $line_rv;
298
299							# Catch exceptions and return undef, so that we
300							# can start to convert code to exception-based code.
301	82344					91632	my $rv = eval {
302							# No token, we need to get some more
303	82344					133541	while ( $line_rv = $self->_process_next_line ) {
304							# If there is something in the buffer, return it
305							# The defined() prevents a ton of calls to PPI::Util::TRUE
306	73337	100				146562	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307	51974					59380	$self->{token_cursor}++;
308	51974					88125	return $token;
309							}
310							}
311	30369					38714	return undef;
312							};
313	82344	100				182534	if ( $@ ) {
		100
314	1	50				8	if ( _INSTANCE($@, 'PPI::Exception') ) {
315	1					3	$@->throw;
316							} else {
317	0					0	my $errstr = $@;
318	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
319	0					0	PPI::Exception->throw( $errstr );
320							}
321							} elsif ( $rv ) {
322	51974					202913	return $rv;
323							}
324
325	30369	50				43004	if ( defined $line_rv ) {
326							# End of file, but we can still return things from the buffer
327	30369	50				48400	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328	0					0	$self->{token_cursor}++;
329	0					0	return $token;
330							}
331
332							# Set our token end of file flag
333	30369					32396	$self->{token_eof} = 1;
334	30369					80154	return 0;
335							}
336
337							# Error, pass it up to our caller
338	0					0	undef;
339							}
340
341							=pod
342
343							=head2 all_tokens
344
345							When not being used as an iterator, the C method tells
346							the Tokenizer to parse the entire file and return all of the tokens
347							in a single ARRAY reference.
348
349							It should be noted that C does B interfere with the
350							use of the Tokenizer object as an iterator (does not modify the token
351							cursor) and use of the two different mechanisms can be mixed safely.
352
353							Returns a reference to an ARRAY of L objects on success
354							or throws an exception on error.
355
356							=cut
357
358							sub all_tokens {
359	5			5	1	17	my $self = shift;
360
361							# Catch exceptions and return undef, so that we
362							# can start to convert code to exception-based code.
363	5					6	my $ok = eval {
364							# Process lines until we get EOF
365	5	50				12	unless ( $self->{token_eof} ) {
366	5					6	my $rv;
367	5					12	while ( $rv = $self->_process_next_line ) {}
368	5	50				9	unless ( defined $rv ) {
369	0					0	PPI::Exception->throw("Error while processing source");
370							}
371
372							# Clean up the end of the tokenizer
373	5					8	$self->_clean_eof;
374							}
375	5					8	1;
376							};
377	5	50				8	if ( !$ok ) {
378	0					0	my $errstr = $@;
379	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
380	0					0	PPI::Exception->throw( $errstr );
381							}
382
383							# End of file, return a copy of the token array.
384	5					5	return [ @{$self->{tokens}} ];
	5					19
385							}
386
387							=pod
388
389							=head2 increment_cursor
390
391							Although exposed as a public method, C is implemented
392							for expert use only, when writing lexers or other components that work
393							directly on token streams.
394
395							It manually increments the token cursor forward through the file, in effect
396							"skipping" the next token.
397
398							Return true if the cursor is incremented, C<0> if already at the end of
399							the file, or C on error.
400
401							=cut
402
403							sub increment_cursor {
404							# Do this via the get_token method, which makes sure there
405							# is actually a token there to move to.
406	0	0		0	1	0	$_[0]->get_token and 1;
407							}
408
409							=pod
410
411							=head2 decrement_cursor
412
413							Although exposed as a public method, C is implemented
414							for expert use only, when writing lexers or other components that work
415							directly on token streams.
416
417							It manually decrements the token cursor backwards through the file, in
418							effect "rolling back" the token stream. And indeed that is what it is
419							primarily intended for, when the component that is consuming the token
420							stream needs to implement some sort of "roll back" feature in its use
421							of the token stream.
422
423							Return true if the cursor is decremented, C<0> if already at the
424							beginning of the file, or C on error.
425
426							=cut
427
428							sub decrement_cursor {
429	0			0	1	0	my $self = shift;
430
431							# Check for the beginning of the file
432	0	0				0	return 0 unless $self->{token_cursor};
433
434							# Decrement the token cursor
435	0					0	$self->{token_eof} = 0;
436	0					0	--$self->{token_cursor};
437							}
438
439
440
441
442
443							#####################################################################
444							# Working With Source
445
446							# Fetches the next line from the input line buffer
447							# Returns undef at EOF.
448							sub _get_line {
449	111552			111552		112007	my $self = shift;
450	111552	100				183432	return undef unless $self->{source}; # EOF hit previously
451
452							# Pull off the next line
453	95522					88885	my $line = shift @{$self->{source}};
	95522					152662
454
455							# Flag EOF if we hit it
456	95522	100				156843	$self->{source} = undef unless defined $line;
457
458							# Return the line (or EOF flag)
459	95522					137968	return $line; # string or undef
460							}
461
462							# Fetches the next line, ready to process
463							# Returns 1 on success
464							# Returns 0 on EOF
465							sub _fill_line {
466	109163			109163		110557	my $self = shift;
467	109163					112357	my $inscan = shift;
468
469							# Get the next line
470	109163					152717	my $line = $self->_get_line;
471	109163	100				161039	unless ( defined $line ) {
472							# End of file
473	32405	100				44667	unless ( $inscan ) {
474	30374					45981	delete $self->{line};
475	30374					33423	delete $self->{line_cursor};
476	30374					30762	delete $self->{line_length};
477	30374					54152	return 0;
478							}
479
480							# In the scan version, just set the cursor to the end
481							# of the line, and the rest should just cascade out.
482	2031					2834	$self->{line_cursor} = $self->{line_length};
483	2031					3757	return 0;
484							}
485
486							# Populate the appropriate variables
487	76758					113625	$self->{line} = $line;
488	76758					88467	$self->{line_cursor} = -1;
489	76758					87679	$self->{line_length} = length $line;
490	76758					84760	$self->{line_count}++;
491
492	76758					131030	1;
493							}
494
495							# Get the current character
496							sub _char {
497	0			0		0	my $self = shift;
498	0					0	substr( $self->{line}, $self->{line_cursor}, 1 );
499							}
500
501
502
503
504
505							####################################################################
506							# Per line processing methods
507
508							# Processes the next line
509							# Returns 1 on success completion
510							# Returns 0 if EOF
511							# Returns undef on error
512							sub _process_next_line {
513	103723			103723		110851	my $self = shift;
514
515							# Fill the line buffer
516	103723					98909	my $rv;
517	103723	100				148667	unless ( $rv = $self->_fill_line ) {
518	30374	50				42932	return undef unless defined $rv;
519
520							# End of file, finalize last token
521	30374					53925	$self->_finalize_token;
522	30374					55049	return 0;
523							}
524
525							# Run the __TOKENIZER__on_line_start
526	73349					192059	$rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527	73349	100				110111	unless ( $rv ) {
528							# If there are no more source lines, then clean up
529	28446	100	66			44266	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
	28446					54399
530	308					710	$self->_clean_eof;
531							}
532
533							# Defined but false means next line
534	28446	50				54494	return 1 if defined $rv;
535	0					0	PPI::Exception->throw("Error at line $self->{line_count}");
536							}
537
538							# If we can't deal with the entire line, process char by char
539	44903					77887	while ( $rv = $self->_process_next_char ) {}
540	44902	50				73765	unless ( defined $rv ) {
541	0					0	PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542							}
543
544							# Trigger any action that needs to happen at the end of a line
545	44902					108495	$self->{class}->__TOKENIZER__on_line_end( $self );
546
547							# If there are no more source lines, then clean up
548	44902	100	100			99842	unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
	42629					95048
549	16623					29700	return $self->_clean_eof;
550							}
551
552	28279					49375	return 1;
553							}
554
555
556
557
558
559							#####################################################################
560							# Per-character processing methods
561
562							# Process on a per-character basis.
563							# Note that due the high number of times this gets
564							# called, it has been fairly heavily in-lined, so the code
565							# might look a bit ugly and duplicated.
566							sub _process_next_char {
567	703952			703952		657651	my $self = shift;
568
569							### FIXME - This checks for a screwed up condition that triggers
570							### several warnings, amongst other things.
571	703952	50	33			1502413	if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572							# $DB::single = 1;
573	0					0	return undef;
574							}
575
576	703952					673094	$self->{line_cursor}++;
577	703952	100				811065	return 0 if $self->_at_line_end;
578
579							# Pass control to the token class
580	659050					631521	my $result;
581	659050	100				1147299	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582							# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583	93701	50				239721	return defined $result ? 1 : undef;
584							}
585
586							# We will need the value of the current character
587	565348					739503	my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588	565348	100				748711	if ( $result eq '1' ) {
589							# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590							# the character is part of it.
591
592							# Add the character
593	96035	50				122910	if ( defined $self->{token} ) {
594	96035					126398	$self->{token}->{content} .= $char;
595							} else {
596	0	0				0	defined($self->{token} = $self->{class}->new($char)) or return undef;
597							}
598
599	96035					178460	return 1;
600							}
601
602							# We have been provided with the name of a class
603	469313	100				709178	if ( $self->{class} ne "PPI::Token::$result" ) {
		100
604							# New class
605	227979					299628	$self->_new_token( $result, $char );
606							} elsif ( defined $self->{token} ) {
607							# Same class as current
608	44568					56860	$self->{token}->{content} .= $char;
609							} else {
610							# Same class, but no current
611	196766	50				332295	defined($self->{token} = $self->{class}->new($char)) or return undef;
612							}
613
614	469313					859031	1;
615							}
616
617							sub _at_line_end {
618	703952			703952		745948	my ($self) = @_;
619	703952					1175695	return $self->{line_cursor} >= $self->{line_length};
620							}
621
622
623
624
625
626							#####################################################################
627							# Altering Tokens in Tokenizer
628
629							# Finish the end of a token.
630							# Returns the resulting parse class as a convenience.
631							sub _finalize_token {
632	609119			609119		575557	my $self = shift;
633	609119	100				824516	return $self->{class} unless defined $self->{token};
634
635							# Add the token to the token buffer
636	578743					523691	push @{ $self->{tokens} }, $self->{token};
	578743					797870
637	578743					606541	$self->{token} = undef;
638
639							# Return the parse class to that of the zone we are in
640	578743					1020570	$self->{class} = $self->{zone};
641							}
642
643							# Creates a new token and sets it in the tokenizer
644							# The defined() in here prevent a ton of calls to PPI::Util::TRUE
645							sub _new_token {
646	381975			381975		361972	my $self = shift;
647							# throw PPI::Exception() unless @_;
648	381975	100				622299	my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
649							? shift : 'PPI::Token::' . shift;
650
651							# Finalize any existing token
652	381975	100				638362	$self->_finalize_token if defined $self->{token};
653
654							# Create the new token and update the parse class
655	381975	50				750259	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
656	381975					451827	$self->{class} = $class;
657
658	381975					421639	1;
659							}
660
661							# At the end of the file, we need to clean up the results of the erroneous
662							# space that we inserted at the beginning of the process.
663							sub _clean_eof {
664	16936			16936		18313	my $self = shift;
665
666							# Finish any partially completed token
667	16936	100				29391	$self->_finalize_token if $self->{token};
668
669							# Find the last token, and if it has no content, kill it.
670							# There appears to be some evidence that such "null tokens" are
671							# somehow getting created accidentally.
672	16936					21267	my $last_token = $self->{tokens}->[ -1 ];
673	16936	50				28853	unless ( length $last_token->{content} ) {
674	0					0	pop @{$self->{tokens}};
	0					0
675							}
676
677							# Now, if the last character of the last token is a space we added,
678							# chop it off, deleting the token if there's nothing else left.
679	16936	100				27323	if ( $self->{source_eof_chop} ) {
680	15519					19133	$last_token = $self->{tokens}->[ -1 ];
681	15519					64724	$last_token->{content} =~ s/ $//;
682	15519	100				29032	unless ( length $last_token->{content} ) {
683							# Popping token
684	13247					14044	pop @{$self->{tokens}};
	13247					18090
685							}
686
687							# The hack involving adding an extra space is now reversed, and
688							# now nobody will ever know. The perfect crime!
689	15519					23851	$self->{source_eof_chop} = '';
690							}
691
692	16936					47971	1;
693							}
694
695
696
697
698
699							#####################################################################
700							# Utility Methods
701
702							# Context
703							sub _last_token {
704	0			0		0	$_[0]->{tokens}->[-1];
705							}
706
707							sub _last_significant_token {
708	3108			3108		3583	my $self = shift;
709	3108					3670	my $cursor = $#{ $self->{tokens} };
	3108					4977
710	3108					6774	while ( $cursor >= 0 ) {
711	4162					5520	my $token = $self->{tokens}->[$cursor--];
712	4162	100				11266	return $token if $token->significant;
713							}
714	407					745	return;
715							}
716
717							# Get an array ref of previous significant tokens.
718							# Like _last_significant_token except it gets more than just one token
719							# Returns array with 0 to x entries
720							sub _previous_significant_tokens {
721	164856			164856		167121	my $self = shift;
722	164856		50			231821	my $count = shift \|\| 1;
723	164856					151620	my $cursor = $#{ $self->{tokens} };
	164856					211513
724
725	164856					169589	my @tokens;
726	164856					221409	while ( $cursor >= 0 ) {
727	266969					285924	my $token = $self->{tokens}->[$cursor--];
728	266969	100				466872	next if not $token->significant;
729	171887					179301	push @tokens, $token;
730	171887	100				264276	last if @tokens >= $count;
731							}
732
733	164856					280101	return @tokens;
734							}
735
736							my %OBVIOUS_CLASS = (
737							'PPI::Token::Symbol' => 'operator',
738							'PPI::Token::Magic' => 'operator',
739							'PPI::Token::Number' => 'operator',
740							'PPI::Token::ArrayIndex' => 'operator',
741							'PPI::Token::Quote::Double' => 'operator',
742							'PPI::Token::Quote::Interpolate' => 'operator',
743							'PPI::Token::Quote::Literal' => 'operator',
744							'PPI::Token::Quote::Single' => 'operator',
745							'PPI::Token::QuoteLike::Backtick' => 'operator',
746							'PPI::Token::QuoteLike::Command' => 'operator',
747							'PPI::Token::QuoteLike::Readline' => 'operator',
748							'PPI::Token::QuoteLike::Regexp' => 'operator',
749							'PPI::Token::QuoteLike::Words' => 'operator',
750							);
751
752							my %OBVIOUS_CONTENT = (
753							'(' => 'operand',
754							'{' => 'operand',
755							'[' => 'operand',
756							';' => 'operand',
757							'}' => 'operator',
758							);
759
760
761							my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
762
763							# Try to determine operator/operand context, if possible.
764							# Returns "operator", "operand", or "" if unknown.
765							sub _opcontext {
766	10943			10943		12005	my $self = shift;
767	10943					18373	my @tokens = $self->_previous_significant_tokens(1);
768	10943					12973	my $p0 = $tokens[0];
769	10943	100				27098	return '' if not $p0;
770	10823					15762	my $c0 = ref $p0;
771
772							# Map the obvious cases
773	10823	100				29976	return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
774	2232	100				4789	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
775
776							# Most of the time after an operator, we are an operand
777	1651	100				6627	return 'operand' if $p0->isa('PPI::Token::Operator');
778
779							# If there's NOTHING, it's operand
780	1468	50				2666	return 'operand' if $p0->content eq '';
781
782							# Otherwise, we don't know
783	1468					3163	return ''
784							}
785
786							# Assuming we are currently parsing the word 'x', return true
787							# if previous tokens imply the x is an operator, false otherwise.
788							sub _current_x_is_operator {
789	1158			1158		1535	my ( $self ) = @_;
790	1158	100				1096	return if !@{$self->{tokens}};
	1158					2622
791
792	928					1835	my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
793	928	50				2974	return if !$prev;
794
795	928	100				3625	return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
796
797							return (!$prev->isa('PPI::Token::Operator') \|\| $X_CAN_FOLLOW_OPERATOR{$prev})
798	766		100			5062	&& (!$prev->isa('PPI::Token::Structure') \|\| $X_CAN_FOLLOW_STRUCTURE{$prev})
799							&& !$prev->isa('PPI::Token::Label')
800							;
801							}
802
803
804							# Assuming we are at the end of parsing the current token that could be a word,
805							# a wordlike operator, or a version string, try to determine whether context
806							# before or after it forces it to be a bareword. This method is only useful
807							# during tokenization.
808							sub __current_token_is_forced_word {
809	33384			33384		52834	my ( $t, $word ) = @_;
810
811							# Check if forced by preceding tokens.
812
813	33384					46410	my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
814	33384	100				68241	if ( !$prev ) {
815	8885					19566	pos $t->{line} = $t->{line_cursor};
816							}
817							else {
818	24499					36881	my $content = $prev->{content};
819
820							# We are forced if we are a method name.
821							# '->' will always be an operator, so we don't check its type.
822	24499	100				37172	return 1 if $content eq '->';
823
824							# If we are contained in a pair of curly braces, we are probably a
825							# forced bareword hash key. '{' is never a word or operator, so we
826							# don't check its type.
827	24369					46075	pos $t->{line} = $t->{line_cursor};
828	24369	100	100			64637	return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
829
830							# sub, package, use, and no all indicate that what immediately follows
831							# is a word not an operator or (in the case of sub and package) a
832							# version string. However, we don't want to be fooled by 'package
833							# package v10' or 'use no v10'. We're a forced package unless we're
834							# preceded by 'package sub', in which case we're a version string.
835							# We also have to make sure that the sub/package/etc doing the forcing
836							# is not a method call.
837	24142	100				46576	if( $USUALLY_FORCES{$content}) {
838	5631	100	66			15816	return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
			100
			100
839	5621	100				23731	return 1 if not $prevprev;
840	236	100	100			536	return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
841	6					27	return;
842							}
843							}
844							# pos on $t->{line} is guaranteed to be set at this point.
845
846							# Check if forced by following tokens.
847
848							# If the word is followed by => it is probably a word, not a regex.
849	27396	100				66355	return 1 if $t->{line} =~ /\G\s*=>/gc;
850
851							# Otherwise we probably aren't forced
852	26606					119786	return '';
853							}
854
855							sub _features {
856	37			37		55	my ( $self, $arg ) = @_;
857	37	50	0			91	return $arg ? $self->{feature_set} = $arg : $self->{feature_set} \|\| {};
858							}
859
860	12282			12282		37435	sub _current_token_has_signatures_active { shift->{feature_set}{signatures} }
861
862							1;
863
864							=pod
865
866							=head1 NOTES
867
868							=head2 How the Tokenizer Works
869
870							Understanding the Tokenizer is not for the faint-hearted. It is by far
871							the most complex and twisty piece of perl I've ever written that is actually
872							still built properly and isn't a terrible spaghetti-like mess. In fact, you
873							probably want to skip this section.
874
875							But if you really want to understand, well then here goes.
876
877							=head2 Source Input and Clean Up
878
879							The Tokenizer starts by taking source in a variety of forms, sucking it
880							all in and merging into one big string, and doing our own internal line
881							split, using a "universal line separator" which allows the Tokenizer to
882							take source for any platform (and even supports a few known types of
883							broken newlines caused by mixed mac/pc/*nix editor screw ups).
884
885							The resulting array of lines is used to feed the tokenizer, and is also
886							accessed directly by the heredoc-logic to do the line-oriented part of
887							here-doc support.
888
889							=head2 Doing Things the Old Fashioned Way
890
891							Due to the complexity of perl, and after 2 previously aborted parser
892							attempts, in the end the tokenizer was fashioned around a line-buffered
893							character-by-character method.
894
895							That is, the Tokenizer pulls and holds a line at a time into a line buffer,
896							and then iterates a cursor along it. At each cursor position, a method is
897							called in whatever token class we are currently in, which will examine the
898							character at the current position, and handle it.
899
900							As the handler methods in the various token classes are called, they
901							build up an output token array for the source code.
902
903							Various parts of the Tokenizer use look-ahead, arbitrary-distance
904							look-behind (although currently the maximum is three significant tokens),
905							or both, and various other heuristic guesses.
906
907							I've been told it is officially termed a I<"backtracking parser
908							with infinite lookaheads">.
909
910							=head2 State Variables
911
912							Aside from the current line and the character cursor, the Tokenizer
913							maintains a number of different state variables.
914
915							=over
916
917							=item Current Class
918
919							The Tokenizer maintains the current token class at all times. Much of the
920							time is just going to be the "Whitespace" class, which is what the base of
921							a document is. As the tokenizer executes the various character handlers,
922							the class changes a lot as it moves a long. In fact, in some instances,
923							the character handler may not handle the character directly itself, but
924							rather change the "current class" and then hand off to the character
925							handler for the new class.
926
927							Because of this, and some other things I'll deal with later, the number of
928							times the character handlers are called does not in fact have a direct
929							relationship to the number of actual characters in the document.
930
931							=item Current Zone
932
933							Rather than create a class stack to allow for infinitely nested layers of
934							classes, the Tokenizer recognises just a single layer.
935
936							To put it a different way, in various parts of the file, the Tokenizer will
937							recognise different "base" or "substrate" classes. When a Token such as a
938							comment or a number is finalised by the tokenizer, it "falls back" to the
939							base state.
940
941							This allows proper tokenization of special areas such as __DATA__
942							and __END__ blocks, which also contain things like comments and POD,
943							without allowing the creation of any significant Tokens inside these areas.
944
945							For the main part of a document we use L for this,
946							with the idea being that code is "floating in a sea of whitespace".
947
948							=item Current Token
949
950							The final main state variable is the "current token". This is the Token
951							that is currently being built by the Tokenizer. For certain types, it
952							can be manipulated and morphed and change class quite a bit while being
953							assembled, as the Tokenizer's understanding of the token content changes.
954
955							When the Tokenizer is confident that it has seen the end of the Token, it
956							will be "finalized", which adds it to the output token array and resets
957							the current class to that of the zone that we are currently in.
958
959							I should also note at this point that the "current token" variable is
960							optional. The Tokenizer is capable of knowing what class it is currently
961							set to, without actually having accumulated any characters in the Token.
962
963							=back
964
965							=head2 Making It Faster
966
967							As I'm sure you can imagine, calling several different methods for each
968							character and running regexes and other complex heuristics made the first
969							fully working version of the tokenizer extremely slow.
970
971							During testing, I created a metric to measure parsing speed called
972							LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
973							cycles on a typical single-core CPU, and so a Tokenizer running at
974							"1000 lines per gigacycle" should generate around 1200 lines of tokenized
975							code when running on a 1200 MHz processor.
976
977							The first working version of the tokenizer ran at only 350 LPGC, so to
978							tokenize a typical large module such as L took
979							10-15 seconds. This sluggishness made it unpractical for many uses.
980
981							So in the current parser, there are multiple layers of optimisation
982							very carefully built in to the basic. This has brought the tokenizer
983							up to a more reasonable 1000 LPGC, at the expense of making the code
984							quite a bit twistier.
985
986							=head2 Making It Faster - Whole Line Classification
987
988							The first step in the optimisation process was to add a hew handler to
989							enable several of the more basic classes (whitespace, comments) to be
990							able to be parsed a line at a time. At the start of each line, a
991							special optional handler (only supported by a few classes) is called to
992							check and see if the entire line can be parsed in one go.
993
994							This is used mainly to handle things like POD, comments, empty lines,
995							and a few other minor special cases.
996
997							=head2 Making It Faster - Inlining
998
999							The second stage of the optimisation involved inlining a small
1000							number of critical methods that were repeated an extremely high number
1001							of times. Profiling suggested that there were about 1,000,000 individual
1002							method calls per gigacycle, and by cutting these by two thirds a significant
1003							speed improvement was gained, in the order of about 50%.
1004
1005							You may notice that many methods in the C code look
1006							very nested and long hand. This is primarily due to this inlining.
1007
1008							At around this time, some statistics code that existed in the early
1009							versions of the parser was also removed, as it was determined that
1010							it was consuming around 15% of the CPU for the entire parser, while
1011							making the core more complicated.
1012
1013							A judgment call was made that with the difficulties likely to be
1014							encountered with future planned enhancements, and given the relatively
1015							high cost involved, the statistics features would be removed from the
1016							Tokenizer.
1017
1018							=head2 Making It Faster - Quote Engine
1019
1020							Once inlining had reached diminishing returns, it became obvious from
1021							the profiling results that a huge amount of time was being spent
1022							stepping a char at a time though long, simple and "syntactically boring"
1023							code such as comments and strings.
1024
1025							The existing regex engine was expanded to also encompass quotes and
1026							other quote-like things, and a special abstract base class was added
1027							that provided a number of specialised parsing methods that would "scan
1028							ahead", looking out ahead to find the end of a string, and updating
1029							the cursor to leave it in a valid position for the next call.
1030
1031							This is also the point at which the number of character handler calls began
1032							to greatly differ from the number of characters. But it has been done
1033							in a way that allows the parser to retain the power of the original
1034							version at the critical points, while skipping through the "boring bits"
1035							as needed for additional speed.
1036
1037							The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1038							for the first time.
1039
1040							=head2 Making It Faster - The "Complete" Mechanism
1041
1042							As it became evident that great speed increases were available by using
1043							this "skipping ahead" mechanism, a new handler method was added that
1044							explicitly handles the parsing of an entire token, where the structure
1045							of the token is relatively simple. Tokens such as symbols fit this case,
1046							as once we are passed the initial sigil and word char, we know that we
1047							can skip ahead and "complete" the rest of the token much more easily.
1048
1049							A number of these have been added for most or possibly all of the common
1050							cases, with most of these "complete" handlers implemented using regular
1051							expressions.
1052
1053							In fact, so many have been added that at this point, you could arguably
1054							reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1055							tokenizer". More tokens are now consumed in "complete" methods in a
1056							typical program than are handled by the normal char-by-char methods.
1057
1058							Many of the these complete-handlers were implemented during the writing
1059							of the Lexer, and this has allowed the full parser to maintain around
1060							1000 LPGC despite the increasing weight of the Lexer.
1061
1062							=head2 Making It Faster - Porting To C (In Progress)
1063
1064							While it would be extraordinarily difficult to port all of the Tokenizer
1065							to C, work has started on a L "accelerator" package which acts as
1066							a separate and automatically-detected add-on to the main PPI package.
1067
1068							L implements faster versions of a variety of functions scattered
1069							over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1070							various other places, and implements them identically in XS/C.
1071
1072							In particular, the skip-ahead methods from the Quote Engine would appear
1073							to be extremely amenable to being done in C, and a number of other
1074							functions could be cherry-picked one at a time and implemented in C.
1075
1076							Each method is heavily tested to ensure that the functionality is
1077							identical, and a versioning mechanism is included to ensure that if a
1078							function gets out of sync, L will degrade gracefully and just
1079							not replace that single method.
1080
1081							=head1 TO DO
1082
1083							- Add an option to reset or seek the token stream...
1084
1085							- Implement more Tokenizer functions in L
1086
1087							=head1 SUPPORT
1088
1089							See the L in the main module.
1090
1091							=head1 AUTHOR
1092
1093							Adam Kennedy Eadamk@cpan.orgE
1094
1095							=head1 COPYRIGHT
1096
1097							Copyright 2001 - 2011 Adam Kennedy.
1098
1099							This program is free software; you can redistribute
1100							it and/or modify it under the same terms as Perl itself.
1101
1102							The full text of the license can be found in the
1103							LICENSE file included with this module.
1104
1105							=cut