File Coverage

blib/lib/PPI/Tokenizer.pm

Criterion	Covered	Total	%
statement	222	248	89.5
branch	116	142	81.6
condition	27	37	72.9
subroutine	29	33	87.8
pod	5	5	100.0
total	399	465	85.8

line	stmt	bran	cond	sub	pod	time	code
1							package PPI::Tokenizer;
2
3							=pod
4
5							=head1 NAME
6
7							PPI::Tokenizer - The Perl Document Tokenizer
8
9							=head1 SYNOPSIS
10
11							# Create a tokenizer for a file, array or string
12							$Tokenizer = PPI::Tokenizer->new( 'filename.pl' );
13							$Tokenizer = PPI::Tokenizer->new( \@lines );
14							$Tokenizer = PPI::Tokenizer->new( \$source );
15
16							# Return all the tokens for the document
17							my $tokens = $Tokenizer->all_tokens;
18
19							# Or we can use it as an iterator
20							while ( my $Token = $Tokenizer->get_token ) {
21							print "Found token '$Token'\n";
22							}
23
24							# If we REALLY need to manually nudge the cursor, you
25							# can do that to (The lexer needs this ability to do rollbacks)
26							$is_incremented = $Tokenizer->increment_cursor;
27							$is_decremented = $Tokenizer->decrement_cursor;
28
29							=head1 DESCRIPTION
30
31							PPI::Tokenizer is the class that provides Tokenizer objects for use in
32							breaking strings of Perl source code into Tokens.
33
34							By the time you are reading this, you probably need to know a little
35							about the difference between how perl parses Perl "code" and how PPI
36							parsers Perl "documents".
37
38							"perl" itself (the interpreter) uses a heavily modified lex specification
39							to specify its parsing logic, maintains several types of state as it
40							goes, and incrementally tokenizes, lexes AND EXECUTES at the same time.
41
42							In fact, it is provably impossible to use perl's parsing method without
43							simultaneously executing code. A formal mathematical proof has been
44							published demonstrating the method.
45
46							This is where the truism "Only perl can parse Perl" comes from.
47
48							PPI uses a completely different approach by abandoning the (impossible)
49							ability to parse Perl the same way that the interpreter does, and instead
50							parsing the source as a document, using a document structure independently
51							derived from the Perl documentation and approximating the perl interpreter
52							interpretation as closely as possible.
53
54							It was touch and go for a long time whether we could get it close enough,
55							but in the end it turned out that it could be done.
56
57							In this approach, the tokenizer C is implemented separately
58							from the lexer L.
59
60							The job of C is to take pure source as a string and break it
61							up into a stream/set of tokens, and contains most of the "black magic" used
62							in PPI. By comparison, the lexer implements a relatively straight forward
63							tree structure, and has an implementation that is uncomplicated (compared
64							to the insanity in the tokenizer at least).
65
66							The Tokenizer uses an immense amount of heuristics, guessing and cruft,
67							supported by a very B flexible internal API, but fortunately it was
68							possible to largely encapsulate the black magic, so there is not a lot that
69							gets exposed to people using the C itself.
70
71							=head1 METHODS
72
73							Despite the incredible complexity, the Tokenizer itself only exposes a
74							relatively small number of methods, with most of the complexity implemented
75							in private methods.
76
77							=cut
78
79							# Make sure everything we need is loaded so
80							# we don't have to go and load all of PPI.
81	67			67		348	use strict;
	67					139
	67					2261
82	67			67		234	use Params::Util qw{_INSTANCE _SCALAR0 _ARRAY0};
	67					122
	67					3445
83	67			67		316	use List::Util 1.33 ();
	67					1047
	67					1166
84	67			67		238	use PPI::Util ();
	67					101
	67					1103
85	67			67		246	use PPI::Element ();
	67					362
	67					885
86	67			67		237	use PPI::Token ();
	67					147
	67					790
87	67			67		242	use PPI::Exception ();
	67					104
	67					871
88	67			67		24011	use PPI::Exception::ParserRejection ();
	67					159
	67					1167
89	67			67		261	use PPI::Document ();
	67					93
	67					174948
90
91							our $VERSION = '1.287';
92
93							# The x operator cannot follow most Perl operators, implying that
94							# anything beginning with x following an operator is a word.
95							# These are the exceptions.
96							my %X_CAN_FOLLOW_OPERATOR = map { $_ => 1 } qw( -- ++ );
97
98							# The x operator cannot follow most structure elements, implying that
99							# anything beginning with x following a structure element is a word.
100							# These are the exceptions.
101							my %X_CAN_FOLLOW_STRUCTURE = map { $_ => 1 } qw( } ] \) );
102
103							# Something that looks like the x operator but follows a word
104							# is usually that word's argument.
105							# These are the exceptions.
106							# chop, chomp, dump are ambiguous because they can have either parms
107							# or no parms.
108							my %X_CAN_FOLLOW_WORD = map { $_ => 1 } qw(
109							endgrent
110							endhostent
111							endnetent
112							endprotoent
113							endpwent
114							endservent
115							fork
116							getgrent
117							gethostent
118							getlogin
119							getnetent
120							getppid
121							getprotoent
122							getpwent
123							getservent
124							setgrent
125							setpwent
126							time
127							times
128							wait
129							wantarray
130							__SUB__
131							);
132
133
134
135							#####################################################################
136							# Creation and Initialization
137
138							=pod
139
140							=head2 new $file \| \@lines \| \$source
141
142							The main C constructor creates a new Tokenizer object. These
143							objects have no configuration parameters, and can only be used once,
144							to tokenize a single perl source file.
145
146							It takes as argument either a normal scalar containing source code,
147							a reference to a scalar containing source code, or a reference to an
148							ARRAY containing newline-terminated lines of source code.
149
150							Returns a new C object on success, or throws a
151							L exception on error.
152
153							=cut
154
155							sub new {
156	16933		33	16933	1	61972	my $class = ref($_[0]) \|\| $_[0];
157
158							# Create the empty tokenizer struct
159	16933					142366	my $self = bless {
160							# Source code
161							source => undef,
162							source_bytes => undef,
163							document => undef,
164
165							# Line buffer
166							line => undef,
167							line_length => undef,
168							line_cursor => undef,
169							line_count => 0,
170
171							# Parse state
172							token => undef,
173							class => 'PPI::Token::BOM',
174							zone => 'PPI::Token::Whitespace',
175							feature_set => undef,
176
177							# Output token buffer
178							tokens => [],
179							token_cursor => 0,
180							token_eof => 0,
181
182							# Perl 6 blocks
183							perl6 => [],
184							}, $class;
185
186	16933	100				58799	if ( ! ref $_[1] ) {
		100
		50
187	515					2662	my $source = PPI::Util::_slurp($_[1]);
188	515	100				1709	PPI::Exception->throw("Tokenizer failed to open file: $source")
189							if not ref $source;
190	513					2038	$self->{source} = $$source;
191
192							} elsif ( _SCALAR0($_[1]) ) {
193	16415					19135	$self->{source} = ${$_[1]};
	16415					33118
194
195							} elsif ( _ARRAY0($_[1]) ) {
196	3					4	$self->{source} = join '', map "$_\n", @{$_[1]};
	3					16
197
198							} else {
199							# We don't support whatever this is
200	0					0	PPI::Exception->throw(ref($_[1]) . " is not supported as a source provider");
201							}
202
203							# We can't handle a null string
204	16931					31312	$self->{source_bytes} = length $self->{source};
205	16931	100				30840	if ( $self->{source_bytes} ) {
206							# Split on local newlines
207	16927					571059	$self->{source} =~ s/(?:\015{1,2}\012\|\015\|\012)/\n/g;
208	16927					276090	$self->{source} = [ split /(?<=\n)/, $self->{source} ];
209
210							} else {
211	4					9	$self->{source} = [ ];
212							}
213
214							### EVIL
215							# I'm explaining this earlier than I should so you can understand
216							# why I'm about to do something that looks very strange. There's
217							# a problem with the Tokenizer, in that tokens tend to change
218							# classes as each letter is added, but they don't get allocated
219							# their definite final class until the "end" of the token, the
220							# detection of which occurs in about a hundred different places,
221							# all through various crufty code (that triples the speed).
222							#
223							# However, in general, this does not apply to tokens in which a
224							# whitespace character is valid, such as comments, whitespace and
225							# big strings.
226							#
227							# So what we do is add a space to the end of the source. This
228							# triggers normal "end of token" functionality for all cases. Then,
229							# once the tokenizer hits end of file, it examines the last token to
230							# manually either remove the ' ' token, or chop it off the end of
231							# a longer one in which the space would be valid.
232	16931	100		76941		65833	if ( List::Util::any { /^__(?:DATA\|END)__\s*$/ } @{$self->{source}} ) {
	76941	100				153552
	16931	100				69426
233	10					28	$self->{source_eof_chop} = '';
234							} elsif ( ! defined $self->{source}->[0] ) {
235	4					8	$self->{source_eof_chop} = '';
236							} elsif ( $self->{source}->[-1] =~ /\s$/ ) {
237	1130					2997	$self->{source_eof_chop} = '';
238							} else {
239	15787					29093	$self->{source_eof_chop} = 1;
240	15787					28479	$self->{source}->[-1] .= ' ';
241							}
242
243	16931					72193	$self;
244							}
245
246							sub _document {
247	16840			16840		19777	my $self = shift;
248	16840	50				38416	return @_ ? $self->{document} = shift : $self->{document};
249							}
250
251
252
253
254
255							#####################################################################
256							# Main Public Methods
257
258							=pod
259
260							=head2 get_token
261
262							When using the PPI::Tokenizer object as an iterator, the C
263							method is the primary method that is used. It increments the cursor
264							and returns the next Token in the output array.
265
266							The actual parsing of the file is done only as-needed, and a line at
267							a time. When C hits the end of the token array, it will
268							cause the parser to pull in the next line and parse it, continuing
269							as needed until there are more tokens on the output array that
270							get_token can then return.
271
272							This means that a number of Tokenizer objects can be created, and
273							won't consume significant CPU until you actually begin to pull tokens
274							from it.
275
276							Return a L object on success, C<0> if the Tokenizer had
277							reached the end of the file, or C on error.
278
279							=cut
280
281							sub get_token {
282	595948			595948	1	590452	my $self = shift;
283
284							# Shortcut for EOF
285	595948	50	66			910719	if ( $self->{token_eof}
286	13579					26825	and $self->{token_cursor} > scalar @{$self->{tokens}}
287							) {
288	0					0	return 0;
289							}
290
291							# Return the next token if we can
292	595948	100				1219153	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
293	513705					523611	$self->{token_cursor}++;
294	513705					1228165	return $token;
295							}
296
297	82243					84026	my $line_rv;
298
299							# Catch exceptions and return undef, so that we
300							# can start to convert code to exception-based code.
301	82243					89945	my $rv = eval {
302							# No token, we need to get some more
303	82243					149175	while ( $line_rv = $self->_process_next_line ) {
304							# If there is something in the buffer, return it
305							# The defined() prevents a ton of calls to PPI::Util::TRUE
306	73141	100				155455	if ( defined( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) ) {
307	51825					54739	$self->{token_cursor}++;
308	51825					89810	return $token;
309							}
310							}
311	30417					38778	return undef;
312							};
313	82243	100				190390	if ( $@ ) {
		100
314	1	50				70	if ( _INSTANCE($@, 'PPI::Exception') ) {
315	1					6	$@->throw;
316							} else {
317	0					0	my $errstr = $@;
318	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
319	0					0	PPI::Exception->throw( $errstr );
320							}
321							} elsif ( $rv ) {
322	51825					227615	return $rv;
323							}
324
325	30417	50				43641	if ( defined $line_rv ) {
326							# End of file, but we can still return things from the buffer
327	30417	50				49442	if ( my $token = $self->{tokens}->[ $self->{token_cursor} ] ) {
328	0					0	$self->{token_cursor}++;
329	0					0	return $token;
330							}
331
332							# Set our token end of file flag
333	30417					32990	$self->{token_eof} = 1;
334	30417					82925	return 0;
335							}
336
337							# Error, pass it up to our caller
338	0					0	undef;
339							}
340
341							=pod
342
343							=head2 all_tokens
344
345							When not being used as an iterator, the C method tells
346							the Tokenizer to parse the entire file and return all of the tokens
347							in a single ARRAY reference.
348
349							It should be noted that C does B interfere with the
350							use of the Tokenizer object as an iterator (does not modify the token
351							cursor) and use of the two different mechanisms can be mixed safely.
352
353							Returns a reference to an ARRAY of L objects on success
354							or throws an exception on error.
355
356							=cut
357
358							sub all_tokens {
359	5			5	1	15	my $self = shift;
360
361							# Catch exceptions and return undef, so that we
362							# can start to convert code to exception-based code.
363	5					7	my $ok = eval {
364							# Process lines until we get EOF
365	5	50				12	unless ( $self->{token_eof} ) {
366	5					5	my $rv;
367	5					11	while ( $rv = $self->_process_next_line ) {}
368	5	50				8	unless ( defined $rv ) {
369	0					0	PPI::Exception->throw("Error while processing source");
370							}
371
372							# Clean up the end of the tokenizer
373	5					8	$self->_clean_eof;
374							}
375	5					6	1;
376							};
377	5	50				22	if ( !$ok ) {
378	0					0	my $errstr = $@;
379	0					0	$errstr =~ s/^(.*) at line .+$/$1/;
380	0					0	PPI::Exception->throw( $errstr );
381							}
382
383							# End of file, return a copy of the token array.
384	5					7	return [ @{$self->{tokens}} ];
	5					18
385							}
386
387							=pod
388
389							=head2 increment_cursor
390
391							Although exposed as a public method, C is implemented
392							for expert use only, when writing lexers or other components that work
393							directly on token streams.
394
395							It manually increments the token cursor forward through the file, in effect
396							"skipping" the next token.
397
398							Return true if the cursor is incremented, C<0> if already at the end of
399							the file, or C on error.
400
401							=cut
402
403							sub increment_cursor {
404							# Do this via the get_token method, which makes sure there
405							# is actually a token there to move to.
406	0	0		0	1	0	$_[0]->get_token and 1;
407							}
408
409							=pod
410
411							=head2 decrement_cursor
412
413							Although exposed as a public method, C is implemented
414							for expert use only, when writing lexers or other components that work
415							directly on token streams.
416
417							It manually decrements the token cursor backwards through the file, in
418							effect "rolling back" the token stream. And indeed that is what it is
419							primarily intended for, when the component that is consuming the token
420							stream needs to implement some sort of "roll back" feature in its use
421							of the token stream.
422
423							Return true if the cursor is decremented, C<0> if already at the
424							beginning of the file, or C on error.
425
426							=cut
427
428							sub decrement_cursor {
429	0			0	1	0	my $self = shift;
430
431							# Check for the beginning of the file
432	0	0				0	return 0 unless $self->{token_cursor};
433
434							# Decrement the token cursor
435	0					0	$self->{token_eof} = 0;
436	0					0	--$self->{token_cursor};
437							}
438
439
440
441
442
443							#####################################################################
444							# Working With Source
445
446							# Fetches the next line from the input line buffer
447							# Returns undef at EOF.
448							sub _get_line {
449	111478			111478		105701	my $self = shift;
450	111478	100				178540	return undef unless $self->{source}; # EOF hit previously
451
452							# Pull off the next line
453	95373					92663	my $line = shift @{$self->{source}};
	95373					156772
454
455							# Flag EOF if we hit it
456	95373	100				156340	$self->{source} = undef unless defined $line;
457
458							# Return the line (or EOF flag)
459	95373					148428	return $line; # string or undef
460							}
461
462							# Fetches the next line, ready to process
463							# Returns 1 on success
464							# Returns 0 on EOF
465							sub _fill_line {
466	109101			109101		109663	my $self = shift;
467	109101					114200	my $inscan = shift;
468
469							# Get the next line
470	109101					156399	my $line = $self->_get_line;
471	109101	100				166347	unless ( defined $line ) {
472							# End of file
473	32475	100				46924	unless ( $inscan ) {
474	30422					45815	delete $self->{line};
475	30422					33886	delete $self->{line_cursor};
476	30422					32629	delete $self->{line_length};
477	30422					53775	return 0;
478							}
479
480							# In the scan version, just set the cursor to the end
481							# of the line, and the rest should just cascade out.
482	2053					2889	$self->{line_cursor} = $self->{line_length};
483	2053					3591	return 0;
484							}
485
486							# Populate the appropriate variables
487	76626					122229	$self->{line} = $line;
488	76626					90405	$self->{line_cursor} = -1;
489	76626					94357	$self->{line_length} = length $line;
490	76626					83530	$self->{line_count}++;
491
492	76626					131347	1;
493							}
494
495							# Get the current character
496							sub _char {
497	0			0		0	my $self = shift;
498	0					0	substr( $self->{line}, $self->{line_cursor}, 1 );
499							}
500
501
502
503
504
505							####################################################################
506							# Per line processing methods
507
508							# Processes the next line
509							# Returns 1 on success completion
510							# Returns 0 if EOF
511							# Returns undef on error
512							sub _process_next_line {
513	103575			103575		104449	my $self = shift;
514
515							# Fill the line buffer
516	103575					108313	my $rv;
517	103575	100				153018	unless ( $rv = $self->_fill_line ) {
518	30422	50				43052	return undef unless defined $rv;
519
520							# End of file, finalize last token
521	30422					52155	$self->_finalize_token;
522	30422					55111	return 0;
523							}
524
525							# Run the __TOKENIZER__on_line_start
526	73153					191541	$rv = $self->{class}->__TOKENIZER__on_line_start( $self );
527	73153	100				115893	unless ( $rv ) {
528							# If there are no more source lines, then clean up
529	28373	100	66			49490	if ( ref $self->{source} eq 'ARRAY' and ! @{$self->{source}} ) {
	28373					58691
530	308					844	$self->_clean_eof;
531							}
532
533							# Defined but false means next line
534	28373	50				61267	return 1 if defined $rv;
535	0					0	PPI::Exception->throw("Error at line $self->{line_count}");
536							}
537
538							# If we can't deal with the entire line, process char by char
539	44780					77870	while ( $rv = $self->_process_next_char ) {}
540	44779	50				76253	unless ( defined $rv ) {
541	0					0	PPI::Exception->throw("Error at line $self->{line_count}, character $self->{line_cursor}");
542							}
543
544							# Trigger any action that needs to happen at the end of a line
545	44779					109358	$self->{class}->__TOKENIZER__on_line_end( $self );
546
547							# If there are no more source lines, then clean up
548	44779	100	100			96820	unless ( ref($self->{source}) eq 'ARRAY' and @{$self->{source}} ) {
	42484					98134
549	16618					31319	return $self->_clean_eof;
550							}
551
552	28161					57358	return 1;
553							}
554
555
556
557
558
559							#####################################################################
560							# Per-character processing methods
561
562							# Process on a per-character basis.
563							# Note that due the high number of times this gets
564							# called, it has been fairly heavily in-lined, so the code
565							# might look a bit ugly and duplicated.
566							sub _process_next_char {
567	703888			703888		674284	my $self = shift;
568
569							### FIXME - This checks for a screwed up condition that triggers
570							### several warnings, amongst other things.
571	703888	50	33			1578562	if ( ! defined $self->{line_cursor} or ! defined $self->{line_length} ) {
572							# $DB::single = 1;
573	0					0	return undef;
574							}
575
576	703888					684810	$self->{line_cursor}++;
577	703888	100				836213	return 0 if $self->_at_line_end;
578
579							# Pass control to the token class
580	659109					637099	my $result;
581	659109	100				1193995	unless ( $result = $self->{class}->__TOKENIZER__on_char( $self ) ) {
582							# undef is error. 0 is "Did stuff ourself, you don't have to do anything"
583	93825	50				242003	return defined $result ? 1 : undef;
584							}
585
586							# We will need the value of the current character
587	565283					746661	my $char = substr( $self->{line}, $self->{line_cursor}, 1 );
588	565283	100				755103	if ( $result eq '1' ) {
589							# If __TOKENIZER__on_char returns 1, it is signaling that it thinks that
590							# the character is part of it.
591
592							# Add the character
593	96033	50				126342	if ( defined $self->{token} ) {
594	96033					133003	$self->{token}->{content} .= $char;
595							} else {
596	0	0				0	defined($self->{token} = $self->{class}->new($char)) or return undef;
597							}
598
599	96033					180368	return 1;
600							}
601
602							# We have been provided with the name of a class
603	469250	100				746769	if ( $self->{class} ne "PPI::Token::$result" ) {
		100
604							# New class
605	228343					304106	$self->_new_token( $result, $char );
606							} elsif ( defined $self->{token} ) {
607							# Same class as current
608	44497					58820	$self->{token}->{content} .= $char;
609							} else {
610							# Same class, but no current
611	196410	50				345356	defined($self->{token} = $self->{class}->new($char)) or return undef;
612							}
613
614	469250					870107	1;
615							}
616
617							sub _at_line_end {
618	703888			703888		756729	my ($self) = @_;
619	703888					1207913	return $self->{line_cursor} >= $self->{line_length};
620							}
621
622
623
624
625
626							#####################################################################
627							# Altering Tokens in Tokenizer
628
629							# Finish the end of a token.
630							# Returns the resulting parse class as a convenience.
631							sub _finalize_token {
632	609223			609223		573060	my $self = shift;
633	609223	100				825814	return $self->{class} unless defined $self->{token};
634
635							# Add the token to the token buffer
636	578799					535213	push @{ $self->{tokens} }, $self->{token};
	578799					812106
637	578799					618530	$self->{token} = undef;
638
639							# Return the parse class to that of the zone we are in
640	578799					1040773	$self->{class} = $self->{zone};
641							}
642
643							# Creates a new token and sets it in the tokenizer
644							# The defined() in here prevent a ton of calls to PPI::Util::TRUE
645							sub _new_token {
646	382387			382387		388503	my $self = shift;
647							# throw PPI::Exception() unless @_;
648	382387	100				639382	my $class = substr( $_[0], 0, 12 ) eq 'PPI::Token::'
649							? shift : 'PPI::Token::' . shift;
650
651							# Finalize any existing token
652	382387	100				655410	$self->_finalize_token if defined $self->{token};
653
654							# Create the new token and update the parse class
655	382387	50				777437	defined($self->{token} = $class->new($_[0])) or PPI::Exception->throw;
656	382387					461250	$self->{class} = $class;
657
658	382387					432527	1;
659							}
660
661							# At the end of the file, we need to clean up the results of the erroneous
662							# space that we inserted at the beginning of the process.
663							sub _clean_eof {
664	16931			16931		20446	my $self = shift;
665
666							# Finish any partially completed token
667	16931	100				29286	$self->_finalize_token if $self->{token};
668
669							# Find the last token, and if it has no content, kill it.
670							# There appears to be some evidence that such "null tokens" are
671							# somehow getting created accidentally.
672	16931					22478	my $last_token = $self->{tokens}->[ -1 ];
673	16931	50				31435	unless ( length $last_token->{content} ) {
674	0					0	pop @{$self->{tokens}};
	0					0
675							}
676
677							# Now, if the last character of the last token is a space we added,
678							# chop it off, deleting the token if there's nothing else left.
679	16931	100				29033	if ( $self->{source_eof_chop} ) {
680	15515					22282	$last_token = $self->{tokens}->[ -1 ];
681	15515					64383	$last_token->{content} =~ s/ $//;
682	15515	100				29584	unless ( length $last_token->{content} ) {
683							# Popping token
684	13238					13392	pop @{$self->{tokens}};
	13238					17918
685							}
686
687							# The hack involving adding an extra space is now reversed, and
688							# now nobody will ever know. The perfect crime!
689	15515					24959	$self->{source_eof_chop} = '';
690							}
691
692	16931					47812	1;
693							}
694
695
696
697
698
699							#####################################################################
700							# Utility Methods
701
702							# Context
703							sub _last_token {
704	0			0		0	$_[0]->{tokens}->[-1];
705							}
706
707							sub _last_significant_token {
708	3133			3133		3594	my $self = shift;
709	3133					3494	my $cursor = $#{ $self->{tokens} };
	3133					4469
710	3133					6148	while ( $cursor >= 0 ) {
711	4202					6064	my $token = $self->{tokens}->[$cursor--];
712	4202	100				11845	return $token if $token->significant;
713							}
714	404					719	return;
715							}
716
717							# Get an array ref of previous significant tokens.
718							# Like _last_significant_token except it gets more than just one token
719							# Returns array with 0 to x entries
720							sub _previous_significant_tokens {
721	165003			165003		163302	my $self = shift;
722	165003		50			230646	my $count = shift \|\| 1;
723	165003					151495	my $cursor = $#{ $self->{tokens} };
	165003					210096
724
725	165003					174586	my @tokens;
726	165003					236998	while ( $cursor >= 0 ) {
727	266603					296814	my $token = $self->{tokens}->[$cursor--];
728	266603	100				499118	next if not $token->significant;
729	172047					194263	push @tokens, $token;
730	172047	100				277314	last if @tokens >= $count;
731							}
732
733	165003					285878	return @tokens;
734							}
735
736							my %OBVIOUS_CLASS = (
737							'PPI::Token::Symbol' => 'operator',
738							'PPI::Token::Magic' => 'operator',
739							'PPI::Token::Number' => 'operator',
740							'PPI::Token::ArrayIndex' => 'operator',
741							'PPI::Token::Quote::Double' => 'operator',
742							'PPI::Token::Quote::Interpolate' => 'operator',
743							'PPI::Token::Quote::Literal' => 'operator',
744							'PPI::Token::Quote::Single' => 'operator',
745							'PPI::Token::QuoteLike::Backtick' => 'operator',
746							'PPI::Token::QuoteLike::Command' => 'operator',
747							'PPI::Token::QuoteLike::Readline' => 'operator',
748							'PPI::Token::QuoteLike::Regexp' => 'operator',
749							'PPI::Token::QuoteLike::Words' => 'operator',
750							);
751
752							my %OBVIOUS_CONTENT = (
753							'(' => 'operand',
754							'{' => 'operand',
755							'[' => 'operand',
756							';' => 'operand',
757							'}' => 'operator',
758							);
759
760
761							my %USUALLY_FORCES = map { $_ => 1 } qw( sub package use no );
762
763							# Try to determine operator/operand context, if possible.
764							# Returns "operator", "operand", or "" if unknown.
765							sub _opcontext {
766	10949			10949		12219	my $self = shift;
767	10949					19514	my @tokens = $self->_previous_significant_tokens(1);
768	10949					13152	my $p0 = $tokens[0];
769	10949	100				27496	return '' if not $p0;
770	10827					15712	my $c0 = ref $p0;
771
772							# Map the obvious cases
773	10827	100				31716	return $OBVIOUS_CLASS{$c0} if defined $OBVIOUS_CLASS{$c0};
774	2265	100				4884	return $OBVIOUS_CONTENT{$p0} if defined $OBVIOUS_CONTENT{$p0};
775
776							# Most of the time after an operator, we are an operand
777	1692	100				7237	return 'operand' if $p0->isa('PPI::Token::Operator');
778
779							# If there's NOTHING, it's operand
780	1488	50				3063	return 'operand' if $p0->content eq '';
781
782							# Otherwise, we don't know
783	1488					3321	return ''
784							}
785
786							# Assuming we are currently parsing the word 'x', return true
787							# if previous tokens imply the x is an operator, false otherwise.
788							sub _current_x_is_operator {
789	1124			1124		1584	my ( $self ) = @_;
790	1124	100				1475	return if !@{$self->{tokens}};
	1124					2690
791
792	925					1999	my ($prev, $prevprev) = $self->_previous_significant_tokens(2);
793	925	100				3008	return if !$prev;
794
795	924	100				3732	return !$self->__current_token_is_forced_word if $prev->isa('PPI::Token::Word');
796
797							return (!$prev->isa('PPI::Token::Operator') \|\| $X_CAN_FOLLOW_OPERATOR{$prev})
798	764		100			5150	&& (!$prev->isa('PPI::Token::Structure') \|\| $X_CAN_FOLLOW_STRUCTURE{$prev})
799							&& !$prev->isa('PPI::Token::Label')
800							;
801							}
802
803
804							# Assuming we are at the end of parsing the current token that could be a word,
805							# a wordlike operator, or a version string, try to determine whether context
806							# before or after it forces it to be a bareword. This method is only useful
807							# during tokenization.
808							sub __current_token_is_forced_word {
809	33298			33298		49578	my ( $t, $word ) = @_;
810
811							# Check if forced by preceding tokens.
812
813	33298					51487	my ( $prev, $prevprev ) = $t->_previous_significant_tokens(2);
814	33298	100				69443	if ( !$prev ) {
815	8858					19215	pos $t->{line} = $t->{line_cursor};
816							}
817							else {
818	24440					38056	my $content = $prev->{content};
819
820							# We are forced if we are a method name.
821							# '->' will always be an operator, so we don't check its type.
822	24440	100				41442	return 1 if $content eq '->';
823
824							# If we are contained in a pair of curly braces, we are probably a
825							# forced bareword hash key. '{' is never a word or operator, so we
826							# don't check its type.
827	24311					48514	pos $t->{line} = $t->{line_cursor};
828	24311	100	100			63066	return 1 if $content eq '{' and $t->{line} =~ /\G\s*\}/gc;
829
830							# sub, package, use, and no all indicate that what immediately follows
831							# is a word not an operator or (in the case of sub and package) a
832							# version string. However, we don't want to be fooled by 'package
833							# package v10' or 'use no v10'. We're a forced package unless we're
834							# preceded by 'package sub', in which case we're a version string.
835							# We also have to make sure that the sub/package/etc doing the forcing
836							# is not a method call.
837	24082	100				50176	if( $USUALLY_FORCES{$content}) {
838	5631	100	66			19169	return if defined $word and $word =~ /^v[0-9]+$/ and ( $content eq "use" or $content eq "no" );
			100
			100
839	5621	100				22797	return 1 if not $prevprev;
840	236	100	100			492	return 1 if not $USUALLY_FORCES{$prevprev->content} and $prevprev->content ne '->';
841	6					28	return;
842							}
843							}
844							# pos on $t->{line} is guaranteed to be set at this point.
845
846							# Check if forced by following tokens.
847
848							# If the word is followed by => it is probably a word, not a regex.
849	27309	100				69634	return 1 if $t->{line} =~ /\G\s*=>/gc;
850
851							# Otherwise we probably aren't forced
852	26519					133256	return '';
853							}
854
855							sub _features {
856	37			37		60	my ( $self, $arg ) = @_;
857	37	50	0			93	return $arg ? $self->{feature_set} = $arg : $self->{feature_set} \|\| {};
858							}
859
860	12308			12308		38735	sub _current_token_has_signatures_active { shift->{feature_set}{signatures} }
861
862							1;
863
864							=pod
865
866							=head1 NOTES
867
868							=head2 How the Tokenizer Works
869
870							Understanding the Tokenizer is not for the faint-hearted. It is by far
871							the most complex and twisty piece of perl I've ever written that is actually
872							still built properly and isn't a terrible spaghetti-like mess. In fact, you
873							probably want to skip this section.
874
875							But if you really want to understand, well then here goes.
876
877							=head2 Source Input and Clean Up
878
879							The Tokenizer starts by taking source in a variety of forms, sucking it
880							all in and merging into one big string, and doing our own internal line
881							split, using a "universal line separator" which allows the Tokenizer to
882							take source for any platform (and even supports a few known types of
883							broken newlines caused by mixed mac/pc/*nix editor screw ups).
884
885							The resulting array of lines is used to feed the tokenizer, and is also
886							accessed directly by the heredoc-logic to do the line-oriented part of
887							here-doc support.
888
889							=head2 Doing Things the Old Fashioned Way
890
891							Due to the complexity of perl, and after 2 previously aborted parser
892							attempts, in the end the tokenizer was fashioned around a line-buffered
893							character-by-character method.
894
895							That is, the Tokenizer pulls and holds a line at a time into a line buffer,
896							and then iterates a cursor along it. At each cursor position, a method is
897							called in whatever token class we are currently in, which will examine the
898							character at the current position, and handle it.
899
900							As the handler methods in the various token classes are called, they
901							build up an output token array for the source code.
902
903							Various parts of the Tokenizer use look-ahead, arbitrary-distance
904							look-behind (although currently the maximum is three significant tokens),
905							or both, and various other heuristic guesses.
906
907							I've been told it is officially termed a I<"backtracking parser
908							with infinite lookaheads">.
909
910							=head2 State Variables
911
912							Aside from the current line and the character cursor, the Tokenizer
913							maintains a number of different state variables.
914
915							=over
916
917							=item Current Class
918
919							The Tokenizer maintains the current token class at all times. Much of the
920							time is just going to be the "Whitespace" class, which is what the base of
921							a document is. As the tokenizer executes the various character handlers,
922							the class changes a lot as it moves a long. In fact, in some instances,
923							the character handler may not handle the character directly itself, but
924							rather change the "current class" and then hand off to the character
925							handler for the new class.
926
927							Because of this, and some other things I'll deal with later, the number of
928							times the character handlers are called does not in fact have a direct
929							relationship to the number of actual characters in the document.
930
931							=item Current Zone
932
933							Rather than create a class stack to allow for infinitely nested layers of
934							classes, the Tokenizer recognises just a single layer.
935
936							To put it a different way, in various parts of the file, the Tokenizer will
937							recognise different "base" or "substrate" classes. When a Token such as a
938							comment or a number is finalised by the tokenizer, it "falls back" to the
939							base state.
940
941							This allows proper tokenization of special areas such as __DATA__
942							and __END__ blocks, which also contain things like comments and POD,
943							without allowing the creation of any significant Tokens inside these areas.
944
945							For the main part of a document we use L for this,
946							with the idea being that code is "floating in a sea of whitespace".
947
948							=item Current Token
949
950							The final main state variable is the "current token". This is the Token
951							that is currently being built by the Tokenizer. For certain types, it
952							can be manipulated and morphed and change class quite a bit while being
953							assembled, as the Tokenizer's understanding of the token content changes.
954
955							When the Tokenizer is confident that it has seen the end of the Token, it
956							will be "finalized", which adds it to the output token array and resets
957							the current class to that of the zone that we are currently in.
958
959							I should also note at this point that the "current token" variable is
960							optional. The Tokenizer is capable of knowing what class it is currently
961							set to, without actually having accumulated any characters in the Token.
962
963							=back
964
965							=head2 Making It Faster
966
967							As I'm sure you can imagine, calling several different methods for each
968							character and running regexes and other complex heuristics made the first
969							fully working version of the tokenizer extremely slow.
970
971							During testing, I created a metric to measure parsing speed called
972							LPGC, or "lines per gigacycle" . A gigacycle is simple a billion CPU
973							cycles on a typical single-core CPU, and so a Tokenizer running at
974							"1000 lines per gigacycle" should generate around 1200 lines of tokenized
975							code when running on a 1200 MHz processor.
976
977							The first working version of the tokenizer ran at only 350 LPGC, so to
978							tokenize a typical large module such as L took
979							10-15 seconds. This sluggishness made it unpractical for many uses.
980
981							So in the current parser, there are multiple layers of optimisation
982							very carefully built in to the basic. This has brought the tokenizer
983							up to a more reasonable 1000 LPGC, at the expense of making the code
984							quite a bit twistier.
985
986							=head2 Making It Faster - Whole Line Classification
987
988							The first step in the optimisation process was to add a hew handler to
989							enable several of the more basic classes (whitespace, comments) to be
990							able to be parsed a line at a time. At the start of each line, a
991							special optional handler (only supported by a few classes) is called to
992							check and see if the entire line can be parsed in one go.
993
994							This is used mainly to handle things like POD, comments, empty lines,
995							and a few other minor special cases.
996
997							=head2 Making It Faster - Inlining
998
999							The second stage of the optimisation involved inlining a small
1000							number of critical methods that were repeated an extremely high number
1001							of times. Profiling suggested that there were about 1,000,000 individual
1002							method calls per gigacycle, and by cutting these by two thirds a significant
1003							speed improvement was gained, in the order of about 50%.
1004
1005							You may notice that many methods in the C code look
1006							very nested and long hand. This is primarily due to this inlining.
1007
1008							At around this time, some statistics code that existed in the early
1009							versions of the parser was also removed, as it was determined that
1010							it was consuming around 15% of the CPU for the entire parser, while
1011							making the core more complicated.
1012
1013							A judgment call was made that with the difficulties likely to be
1014							encountered with future planned enhancements, and given the relatively
1015							high cost involved, the statistics features would be removed from the
1016							Tokenizer.
1017
1018							=head2 Making It Faster - Quote Engine
1019
1020							Once inlining had reached diminishing returns, it became obvious from
1021							the profiling results that a huge amount of time was being spent
1022							stepping a char at a time though long, simple and "syntactically boring"
1023							code such as comments and strings.
1024
1025							The existing regex engine was expanded to also encompass quotes and
1026							other quote-like things, and a special abstract base class was added
1027							that provided a number of specialised parsing methods that would "scan
1028							ahead", looking out ahead to find the end of a string, and updating
1029							the cursor to leave it in a valid position for the next call.
1030
1031							This is also the point at which the number of character handler calls began
1032							to greatly differ from the number of characters. But it has been done
1033							in a way that allows the parser to retain the power of the original
1034							version at the critical points, while skipping through the "boring bits"
1035							as needed for additional speed.
1036
1037							The addition of this feature allowed the tokenizer to exceed 1000 LPGC
1038							for the first time.
1039
1040							=head2 Making It Faster - The "Complete" Mechanism
1041
1042							As it became evident that great speed increases were available by using
1043							this "skipping ahead" mechanism, a new handler method was added that
1044							explicitly handles the parsing of an entire token, where the structure
1045							of the token is relatively simple. Tokens such as symbols fit this case,
1046							as once we are passed the initial sigil and word char, we know that we
1047							can skip ahead and "complete" the rest of the token much more easily.
1048
1049							A number of these have been added for most or possibly all of the common
1050							cases, with most of these "complete" handlers implemented using regular
1051							expressions.
1052
1053							In fact, so many have been added that at this point, you could arguably
1054							reclassify the tokenizer as a "hybrid regex, char-by=char heuristic
1055							tokenizer". More tokens are now consumed in "complete" methods in a
1056							typical program than are handled by the normal char-by-char methods.
1057
1058							Many of the these complete-handlers were implemented during the writing
1059							of the Lexer, and this has allowed the full parser to maintain around
1060							1000 LPGC despite the increasing weight of the Lexer.
1061
1062							=head2 Making It Faster - Porting To C (In Progress)
1063
1064							While it would be extraordinarily difficult to port all of the Tokenizer
1065							to C, work has started on a L "accelerator" package which acts as
1066							a separate and automatically-detected add-on to the main PPI package.
1067
1068							L implements faster versions of a variety of functions scattered
1069							over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and
1070							various other places, and implements them identically in XS/C.
1071
1072							In particular, the skip-ahead methods from the Quote Engine would appear
1073							to be extremely amenable to being done in C, and a number of other
1074							functions could be cherry-picked one at a time and implemented in C.
1075
1076							Each method is heavily tested to ensure that the functionality is
1077							identical, and a versioning mechanism is included to ensure that if a
1078							function gets out of sync, L will degrade gracefully and just
1079							not replace that single method.
1080
1081							=head1 TO DO
1082
1083							- Add an option to reset or seek the token stream...
1084
1085							- Implement more Tokenizer functions in L
1086
1087							=head1 SUPPORT
1088
1089							See the L in the main module.
1090
1091							=head1 AUTHOR
1092
1093							Adam Kennedy Eadamk@cpan.orgE
1094
1095							=head1 COPYRIGHT
1096
1097							Copyright 2001 - 2011 Adam Kennedy.
1098
1099							This program is free software; you can redistribute
1100							it and/or modify it under the same terms as Perl itself.
1101
1102							The full text of the license can be found in the
1103							LICENSE file included with this module.
1104
1105							=cut