File Coverage

blib/lib/App/Rsnapshot/XML/Tiny.pm

Criterion	Covered	Total	%
statement	79	79	100.0
branch	63	64	98.4
condition	19	21	90.4
subroutine	4	4	100.0
pod	1	1	100.0
total	166	169	98.2

line	stmt	bran	cond	sub	pod	time	code
1							package App::Rsnapshot::XML::Tiny;
2
3	13			13		34091	use strict;
	13					24
	13					535
4
5							require Exporter;
6
7	13			13		66	use vars qw($VERSION @EXPORT_OK @ISA);
	13					20
	13					21856
8
9							$VERSION = '1.12';
10							@EXPORT_OK = qw(parsefile);
11							@ISA = qw(Exporter);
12
13							# localising prevents the warningness leaking out of this module
14							local $^W = 1; # can't use warnings as that's a 5.6-ism
15
16							=head1 NAME
17
18							App::Rsnapshot::XML::Tiny - simple lightweight parser for a subset of XML
19
20							=head1 DESCRIPTION
21
22							App::Rsnapshot::XML::Tiny is a simple lightweight parser for a subset of XML
23
24							=head1 SYNOPSIS
25
26							use App::Rsnapshot::XML::Tiny qw(parsefile);
27							open($xmlfile, 'something.xml);
28							my $document = parsefile($xmlfile);
29
30							This will leave C<$document> looking something like this:
31
32							[
33							{
34							type => 'e',
35							attrib => { ... },
36							name => 'rootelementname',
37							content => [
38							...
39							more elements and text content
40							...
41							]
42							}
43							]
44
45							=head1 FUNCTIONS
46
47							The C function is optionally exported. By default nothing is
48							exported. There is no objecty interface.
49
50							=head2 parsefile
51
52							This takes at least one parameter, optionally more. The compulsory
53							parameter may be:
54
55							=over 4
56
57							=item a filename
58
59							in which case the file is read and parsed;
60
61							=item a string of XML
62
63							in which case it is read and parsed. How do we tell if we've got a string
64							or a filename? If it begins with C<_TINY_XML_STRING_> then it's
65							a string. That prefix is, of course, ignored when it comes to actually
66							parsing the data. This is intended primarily for use by wrappers which
67							want to retain compatibility with Ye Aunciente Perl. Normal users who want
68							to pass in a string would be expected to use L.
69
70							=item a glob-ref or IO::Handle object
71
72							in which case again, the file is read and parsed.
73
74							=back
75
76							The former case is for compatibility with older perls, but makes no
77							attempt to properly deal with character sets. If you open a file in a
78							character-set-friendly way and then pass in a handle / object, then the
79							method should Do The Right Thing as it only ever works with character
80							data.
81
82							The remaining parameters are a list of key/value pairs to make a hash of
83							options:
84
85							=over 4
86
87							=item fatal_declarations
88
89							If set to true, E!ENTITY...E and E!DOCTYPE...E declarations
90							in the document
91							are fatal errors - otherwise they are ignored.
92
93							=item no_entity_parsing
94
95							If set to true, the five built-in entities are passed through unparsed.
96							Note that special characters in CDATA and attributes may have been turned
97							into C<&>, C<<> and friends.
98
99							=item strict_entity_parsing
100
101							If set to true, any unrecognised entities (ie, those outside the core five
102							plus numeric entities) cause a fatal error. If you set both this and
103							C (but why would you do that?) then the latter takes
104							precedence.
105
106							Obviously, if you want to maximise compliance with the XML spec, you should
107							turn on fatal_declarations and strict_entity_parsing.
108
109							=back
110
111							The function returns a structure describing the document. This contains
112							one or more nodes, each being either an 'element' node or a 'text' mode.
113							The structure is an arrayref which contains a single 'element' node which
114							represents the document entity. The arrayref is redundant, but exists for
115							compatibility with L.
116
117							Element nodes are hashrefs with the following keys:
118
119							=over 4
120
121							=item type
122
123							The node's type, represented by the letter 'e'.
124
125							=item name
126
127							The element's name.
128
129							=item attrib
130
131							A hashref containing the element's attributes, as key/value pairs where
132							the key is the attribute name.
133
134							=item content
135
136							An arrayref of the element's contents. The array's contents is a list of
137							nodes, in the order they were encountered in the document.
138
139							=back
140
141							Text nodes are hashrefs with the following keys:
142
143							=over 4
144
145							=item type
146
147							The node's type, represented by the letter 't'.
148
149							=item content
150
151							A scalar piece of text.
152
153							=back
154
155							=cut
156
157							my %regexps = (
158							name => '[:a-z][\\w:\\.-]*'
159							);
160
161							my $strict_entity_parsing; # mmm, global. don't worry, parsefile sets it
162							# explicitly every time
163							sub parsefile {
164	212			212	1	39290	my($arg, %params) = @_;
165	212					701	my($file, $elem) = ('', { content => [] });
166	212					525	local $/; # sluuuuurp
167
168	212					281	$strict_entity_parsing = $params{strict_entity_parsing};
169
170	212	100				604	if(ref($arg) eq '') { # we were passed a filename or a string
171	210	100				408	if($arg =~ /^_TINY_XML_STRING_/) { # it's a string
172	22					227	$file = substr($arg, 17);
173							} else {
174	188					332	local *FH;
175	188	100				8015	open(FH, $arg) \|\| die(__PACKAGE__."::parsefile: Can't open $arg\n");
176	187					3729	$file = ;
177	187					1795	close(FH);
178							}
179	2					50	} else { $file = <$arg>; }
180	211	100	66			1555	die("No elements\n") if (!defined($file) \|\| $file =~ /^\s*$/);
181
182							# illegal low-ASCII chars
183	209	100				649	die("Not well-formed\n") if($file =~ /[\x00-\x08\x0b\x0c\x0e-\x1f]/);
184
185							# turn CDATA into PCDATA
186	204					324	$file =~ s{}{
187	8					22	$_ = $1.chr(0); # this makes sure that empty CDATAs become
188	8					14	s/([&<>])/ # the empty string and aren't just thrown away.
189	4	100				19	$1 eq '&' ? '&' :
		100
190							$1 eq '<' ? '<' :
191							'>'
192							/eg;
193	8					23	$_;
194							}egs;
195
196	217	100				1074	die("Not well-formed\n") if(
197							$file =~ /]]>/ \|\| # ]]> not delimiting CDATA
198							$file =~ //s \|\| # ---> can't end a comment
199	204	100	100			9368	grep { $_ && /--/ } ($file =~ /^\s+\|\|\s+$/gs) # -- in comm
			100
200							);
201
202							# strip leading/trailing whitespace and comments (which don't nest - phew!)
203	194					7392	$file =~ s/^\s+\|\|\s+$//gs;
204
205							# turn quoted > in attribs into >
206							# double- and single-quoted attrib values get done seperately
207	194					12232	while($file =~ s/($regexps{name}\s=\s"[^"])>([^"]")/$1>$2/gsi) {}
208	194					12934	while($file =~ s/($regexps{name}\s=\s'[^'])>([^']')/$1>$2/gsi) {}
209
210	194	100	100			1449	if($params{fatal_declarations} && $file =~ /
211	111					1054	die("I can't handle this document\n");
212							}
213
214							# ignore empty tokens/whitespace tokens
215	83	100				1197	foreach my $token (grep { length && $_ !~ /^\s+$/ }
	1290					11686
216							split(/(<[^>]+>)/, $file)) {
217	749	100	100			1318440	if(
		100
		100
		100
218							$token =~ /<\?$regexps{name}.*?\?>/is \|\| # PI
219							$token =~ /^
220							) {
221	6					18	next;
222							} elsif($token =~ m!^!i) { # close tag
223	241	100				863	die("Not well-formed\n\tat $token\n") if($elem->{name} ne $1);
224	238					633	$elem = delete $elem->{parent};
225							} elsif($token =~ /^<$regexps{name}(\s[^>])(\s*\/)?>/is) { # open tag
226	351					2940	my($tagname, $attribs_raw) = ($token =~ m!<(\S)(.?)(\s*/)?>!s);
227							# first make attribs into a list so we can spot duplicate keys
228	351					4389	my $attrib = [
229							# do double- and single- quoted attribs seperately
230							$attribs_raw =~ /\s($regexps{name})\s=\s"([^"]*?)"/gi,
231							$attribs_raw =~ /\s($regexps{name})\s=\s'([^']*?)'/gi
232							];
233	351	100				501	if(@{$attrib} == 2 * keys %{{@{$attrib}}}) {
	351					559
	351					378
	351					1373
234	350					404	$attrib = { @{$attrib} }
	350					718
235	1					12	} else { die("Not well-formed - duplicate attribute\n"); }
236
237							# now trash any attribs that we did manage to parse and see
238							# if there's anything left
239	350					3491	$attribs_raw =~ s/\s($regexps{name})\s=\s"([^"]*?)"//gi;
240	350					2130	$attribs_raw =~ s/\s($regexps{name})\s=\s'([^']*?)'//gi;
241	350	100	100			1277	die("Not well-formed\n$attribs_raw") if($attribs_raw =~ /\S/ \|\| grep { /
	154					625
	340					1130
242
243	338	100				731	unless($params{no_entity_parsing}) {
244	337					369	foreach my $key (keys %{$attrib}) {
	337					1679
245	152					342	$attrib->{$key} = _fixentities($attrib->{$key})
246							}
247							}
248							$elem = {
249	334					1639	content => [],
250							name => $tagname,
251							type => 'e',
252							attrib => $attrib,
253							parent => $elem
254							};
255	334					400	push @{$elem->{parent}->{content}}, $elem;
	334					736
256							# now handle self-closing tags
257	334	100				1670	if($token =~ /\s*\/>$/) {
258	70					160	$elem->{name} =~ s/\/$//;
259	70					509	$elem = delete $elem->{parent};
260							}
261							} elsif($token =~ /^
262	13					145	die("I can't handle this document\n\tat $token\n");
263							} else { # ordinary content
264	138					204	$token =~ s/\x00//g; # get rid of our CDATA marker
265	138	100				318	unless($params{no_entity_parsing}) { $token = _fixentities($token); }
	137					271
266	130					209	push @{$elem->{content}}, { content => $token, type => 't' };
	130					696
267							}
268							}
269	42	50				286	die("Not well-formed\n") if(exists($elem->{parent}));
270	42	100				92	die("Junk after end of document\n") if($#{$elem->{content}} > 0);
	42					235
271	30					218	die("No elements\n") if(
272	30	100	66			61	$#{$elem->{content}} == -1 \|\| $elem->{content}->[0]->{type} ne 'e'
273							);
274	29					6139	return $elem->{content};
275							}
276
277							sub _fixentities {
278	289			289		412	my $thingy = shift;
279
280	289	100				507	my $junk = ($strict_entity_parsing) ? '\|.*' : '';
281	289					1506	$thingy =~ s/&((#(\d+\|x[a-fA-F0-9]+);)\|lt;\|gt;\|quot;\|apos;\|amp;$junk)/
282	219	100				1531	$3 ? (
		100
		100
		100
		100
		100
		100
283							substr($3, 0, 1) eq 'x' ? # using a =~ match here clobbers $3
284							chr(hex(substr($3, 1))) : # so don't "fix" it!
285							chr($3)
286							) :
287							$1 eq 'lt;' ? '<' :
288							$1 eq 'gt;' ? '>' :
289							$1 eq 'apos;' ? "'" :
290							$1 eq 'quot;' ? '"' :
291							$1 eq 'amp;' ? '&' :
292							die("Illegal ampersand or entity\n\tat $1\n")
293							/ge;
294	277					929	$thingy;
295							}
296
297							=head1 COMPATIBILITY
298
299							=head2 With other modules
300
301							The C function is so named because it is intended to work in a
302							similar fashion to L with the L style.
303							Instead of saying this:
304
305							use XML::Parser;
306							use XML::Parser::EasyTree;
307							$XML::Parser::EasyTree::Noempty=1;
308							my $p=new XML::Parser(Style=>'EasyTree');
309							my $tree=$p->parsefile('something.xml');
310
311							you would say:
312
313							use App::Rsnapshot::XML::Tiny;
314							my $tree = App::Rsnapshot::XML::Tiny::parsefile('something.xml');
315
316							Any valid document that can be parsed like that using App::Rsnapshot::XML::Tiny should
317							produce identical results if you use the above example of how to use
318							L.
319
320							If you find a document where that is not the case, please report it as
321							a bug.
322
323							=head2 With perl 5.004
324
325							The module is intended to be fully compatible with every version of perl
326							back to and including 5.004, and may be compatible with even older
327							versions of perl 5.
328
329							The lack of Unicode and friends in older perls means that App::Rsnapshot::XML::Tiny
330							does nothing with character sets. If you have a document with a funny
331							character set, then you will need to open the file in an appropriate
332							mode using a character-set-friendly perl and pass the resulting file
333							handle to the module.
334
335							=head2 The subset of XML that we understand
336
337							=over 4
338
339							=item Element tags and attributes
340
341							Including "self-closing" tags like Epie type = 'steak n kidney' /E;
342
343							=item Comments
344
345							Which are ignored;
346
347							=item The five "core" entities
348
349							ie C<&>, C<<>, C<>>, C<'> and C<">;
350
351							=item Numeric entities
352
353							eg C<A> and C<A>;
354
355							=item CDATA
356
357							This is simply turned into PCDATA before parsing. Note how this may interact
358							with the various entity-handling options;
359
360							=back
361
362							The following parts of the XML standard are handled incorrectly or not at
363							all - this is not an exhaustive list:
364
365							=over 4
366
367							=item Namespaces
368
369							While documents that use namespaces will be parsed just fine, there's no
370							special treatment of them. Their names are preserved in element and
371							attribute names like 'rdf:RDF'.
372
373							=item DTDs and Schemas
374
375							This is not a validating parser. declarations are ignored
376							if you've not made them fatal.
377
378							=item Entities and references
379
380							declarations are ignored if you've not made them fatal.
381							Unrecognised entities are ignored by default, as are naked & characters.
382							This means that if entity parsing is enabled you won't be able to tell
383							the difference between C< > and C< >. If your
384							document might use any non-core entities then please consider using
385							the C option, and then use something like
386							L.
387
388							=item Processing instructions
389
390							These are ignored.
391
392							=item Whitespace
393
394							We do not guarantee to correctly handle leading and trailing whitespace.
395
396							=item Character sets
397
398							This is not practical with older versions of perl
399
400							=back
401
402							=head1 PHILOSOPHY and JUSTIFICATION
403
404							While feedback from real users about this module has been uniformly
405							positive and helpful, some people seem to take issue with this module
406							because it doesn't implement every last jot and tittle of the XML
407							standard and merely implements a useful subset. A very useful subset,
408							as it happens, which can cope with common light-weight XML-ish tasks
409							such as parsing the results of queries to the Amazon Web Services.
410							Many, perhaps most, users of XML do not in fact need a full implementation
411							of the standard, and are understandably reluctant to install large complex
412							pieces of software which have many dependencies. In fact, when they
413							realise what installing and using a full implementation entails, they
414							quite often don't want it. Another class of users, people
415							distributing applications, often can not rely on users being able to
416							install modules from the CPAN, or even having tools like make or a shell
417							available. App::Rsnapshot::XML::Tiny exists for those people.
418
419							=head1 BUGS and FEEDBACK
420
421							I welcome feedback about my code, including constructive criticism.
422							Bug reports should be made using L or by email,
423							and should include the smallest possible chunk of code, along with
424							any necessary XML data, which demonstrates the bug. Ideally, this
425							will be in the form of a file which I can drop in to the module's
426							test suite. Please note that such files must work in perl 5.004.
427
428							If you are feeling particularly generous you can encourage me in my
429							open source endeavours by buying me something from my wishlist:
430							L
431
432							=head1 SEE ALSO
433
434							=over 4
435
436							=item For more capable XML parsers:
437
438							L
439
440							L
441
442							=item The requirements for a module to be Tiny
443
444							L
445
446							=back
447
448							=head1 AUTHOR
449
450							David Cantrell EFE
451
452							Thanks to David Romano for some compatibility patches for Ye Aunciente Perl;
453
454							to Matt Knecht and David Romano for prodding me to support attributes,
455							and to Matt for providing code to implement it in a quick n dirty minimal
456							kind of way;
457
458							to the people on L and elsewhere who have been kind
459							enough to point out ways it could be improved;
460
461							to Sergio Fanchiotti for pointing out a bug in handling self-closing tags,
462							and for reporting another bug that I introduced when fixing the first one;
463
464							to 'Corion' for finding a bug with localised filehandles and providing a fix.
465
466							=head1 COPYRIGHT and LICENCE
467
468							Copyright 2007 David Cantrell
469
470							This module is free-as-in-speech software, and may be used, distributed,
471							and modified under the same terms as Perl itself.
472
473							=head1 CONSPIRACY
474
475							This module is also free-as-in-mason software.
476
477							=cut
478
479							'zero';