File Coverage

blib/lib/App/Rsnapshot/XML/Tiny.pm
Criterion Covered Total %
statement 79 79 100.0
branch 63 64 98.4
condition 19 21 90.4
subroutine 4 4 100.0
pod 1 1 100.0
total 166 169 98.2


line stmt bran cond sub pod time code
1             package App::Rsnapshot::XML::Tiny;
2              
3 13     13   34091 use strict;
  13         24  
  13         535  
4              
5             require Exporter;
6              
7 13     13   66 use vars qw($VERSION @EXPORT_OK @ISA);
  13         20  
  13         21856  
8              
9             $VERSION = '1.12';
10             @EXPORT_OK = qw(parsefile);
11             @ISA = qw(Exporter);
12              
13             # localising prevents the warningness leaking out of this module
14             local $^W = 1; # can't use warnings as that's a 5.6-ism
15              
16             =head1 NAME
17              
18             App::Rsnapshot::XML::Tiny - simple lightweight parser for a subset of XML
19              
20             =head1 DESCRIPTION
21              
22             App::Rsnapshot::XML::Tiny is a simple lightweight parser for a subset of XML
23              
24             =head1 SYNOPSIS
25              
26             use App::Rsnapshot::XML::Tiny qw(parsefile);
27             open($xmlfile, 'something.xml);
28             my $document = parsefile($xmlfile);
29              
30             This will leave C<$document> looking something like this:
31              
32             [
33             {
34             type => 'e',
35             attrib => { ... },
36             name => 'rootelementname',
37             content => [
38             ...
39             more elements and text content
40             ...
41             ]
42             }
43             ]
44              
45             =head1 FUNCTIONS
46              
47             The C function is optionally exported. By default nothing is
48             exported. There is no objecty interface.
49              
50             =head2 parsefile
51              
52             This takes at least one parameter, optionally more. The compulsory
53             parameter may be:
54              
55             =over 4
56              
57             =item a filename
58              
59             in which case the file is read and parsed;
60              
61             =item a string of XML
62              
63             in which case it is read and parsed. How do we tell if we've got a string
64             or a filename? If it begins with C<_TINY_XML_STRING_> then it's
65             a string. That prefix is, of course, ignored when it comes to actually
66             parsing the data. This is intended primarily for use by wrappers which
67             want to retain compatibility with Ye Aunciente Perl. Normal users who want
68             to pass in a string would be expected to use L.
69              
70             =item a glob-ref or IO::Handle object
71              
72             in which case again, the file is read and parsed.
73              
74             =back
75              
76             The former case is for compatibility with older perls, but makes no
77             attempt to properly deal with character sets. If you open a file in a
78             character-set-friendly way and then pass in a handle / object, then the
79             method should Do The Right Thing as it only ever works with character
80             data.
81              
82             The remaining parameters are a list of key/value pairs to make a hash of
83             options:
84              
85             =over 4
86              
87             =item fatal_declarations
88              
89             If set to true, E!ENTITY...E and E!DOCTYPE...E declarations
90             in the document
91             are fatal errors - otherwise they are *ignored*.
92              
93             =item no_entity_parsing
94              
95             If set to true, the five built-in entities are passed through unparsed.
96             Note that special characters in CDATA and attributes may have been turned
97             into C<&>, C<<> and friends.
98              
99             =item strict_entity_parsing
100              
101             If set to true, any unrecognised entities (ie, those outside the core five
102             plus numeric entities) cause a fatal error. If you set both this and
103             C (but why would you do that?) then the latter takes
104             precedence.
105              
106             Obviously, if you want to maximise compliance with the XML spec, you should
107             turn on fatal_declarations and strict_entity_parsing.
108              
109             =back
110              
111             The function returns a structure describing the document. This contains
112             one or more nodes, each being either an 'element' node or a 'text' mode.
113             The structure is an arrayref which contains a single 'element' node which
114             represents the document entity. The arrayref is redundant, but exists for
115             compatibility with L.
116              
117             Element nodes are hashrefs with the following keys:
118              
119             =over 4
120              
121             =item type
122              
123             The node's type, represented by the letter 'e'.
124              
125             =item name
126              
127             The element's name.
128              
129             =item attrib
130              
131             A hashref containing the element's attributes, as key/value pairs where
132             the key is the attribute name.
133              
134             =item content
135              
136             An arrayref of the element's contents. The array's contents is a list of
137             nodes, in the order they were encountered in the document.
138              
139             =back
140              
141             Text nodes are hashrefs with the following keys:
142              
143             =over 4
144              
145             =item type
146              
147             The node's type, represented by the letter 't'.
148              
149             =item content
150              
151             A scalar piece of text.
152              
153             =back
154              
155             =cut
156              
157             my %regexps = (
158             name => '[:a-z][\\w:\\.-]*'
159             );
160              
161             my $strict_entity_parsing; # mmm, global. don't worry, parsefile sets it
162             # explicitly every time
163             sub parsefile {
164 212     212 1 39290 my($arg, %params) = @_;
165 212         701 my($file, $elem) = ('', { content => [] });
166 212         525 local $/; # sluuuuurp
167              
168 212         281 $strict_entity_parsing = $params{strict_entity_parsing};
169              
170 212 100       604 if(ref($arg) eq '') { # we were passed a filename or a string
171 210 100       408 if($arg =~ /^_TINY_XML_STRING_/) { # it's a string
172 22         227 $file = substr($arg, 17);
173             } else {
174 188         332 local *FH;
175 188 100       8015 open(FH, $arg) || die(__PACKAGE__."::parsefile: Can't open $arg\n");
176 187         3729 $file = ;
177 187         1795 close(FH);
178             }
179 2         50 } else { $file = <$arg>; }
180 211 100 66     1555 die("No elements\n") if (!defined($file) || $file =~ /^\s*$/);
181              
182             # illegal low-ASCII chars
183 209 100       649 die("Not well-formed\n") if($file =~ /[\x00-\x08\x0b\x0c\x0e-\x1f]/);
184              
185             # turn CDATA into PCDATA
186 204         324 $file =~ s{}{
187 8         22 $_ = $1.chr(0); # this makes sure that empty CDATAs become
188 8         14 s/([&<>])/ # the empty string and aren't just thrown away.
189 4 100       19 $1 eq '&' ? '&' :
    100          
190             $1 eq '<' ? '<' :
191             '>'
192             /eg;
193 8         23 $_;
194             }egs;
195              
196 217 100       1074 die("Not well-formed\n") if(
197             $file =~ /]]>/ || # ]]> not delimiting CDATA
198             $file =~ //s || # ---> can't end a comment
199 204 100 100     9368 grep { $_ && /--/ } ($file =~ /^\s+||\s+$/gs) # -- in comm
      100        
200             );
201              
202             # strip leading/trailing whitespace and comments (which don't nest - phew!)
203 194         7392 $file =~ s/^\s+||\s+$//gs;
204            
205             # turn quoted > in attribs into >
206             # double- and single-quoted attrib values get done seperately
207 194         12232 while($file =~ s/($regexps{name}\s*=\s*"[^"]*)>([^"]*")/$1>$2/gsi) {}
208 194         12934 while($file =~ s/($regexps{name}\s*=\s*'[^']*)>([^']*')/$1>$2/gsi) {}
209              
210 194 100 100     1449 if($params{fatal_declarations} && $file =~ /
211 111         1054 die("I can't handle this document\n");
212             }
213              
214             # ignore empty tokens/whitespace tokens
215 83 100       1197 foreach my $token (grep { length && $_ !~ /^\s+$/ }
  1290         11686  
216             split(/(<[^>]+>)/, $file)) {
217 749 100 100     1318440 if(
    100          
    100          
    100          
218             $token =~ /<\?$regexps{name}.*?\?>/is || # PI
219             $token =~ /^
220             ) {
221 6         18 next;
222             } elsif($token =~ m!^!i) { # close tag
223 241 100       863 die("Not well-formed\n\tat $token\n") if($elem->{name} ne $1);
224 238         633 $elem = delete $elem->{parent};
225             } elsif($token =~ /^<$regexps{name}(\s[^>]*)*(\s*\/)?>/is) { # open tag
226 351         2940 my($tagname, $attribs_raw) = ($token =~ m!<(\S*)(.*?)(\s*/)?>!s);
227             # first make attribs into a list so we can spot duplicate keys
228 351         4389 my $attrib = [
229             # do double- and single- quoted attribs seperately
230             $attribs_raw =~ /\s($regexps{name})\s*=\s*"([^"]*?)"/gi,
231             $attribs_raw =~ /\s($regexps{name})\s*=\s*'([^']*?)'/gi
232             ];
233 351 100       501 if(@{$attrib} == 2 * keys %{{@{$attrib}}}) {
  351         559  
  351         378  
  351         1373  
234 350         404 $attrib = { @{$attrib} }
  350         718  
235 1         12 } else { die("Not well-formed - duplicate attribute\n"); }
236            
237             # now trash any attribs that we *did* manage to parse and see
238             # if there's anything left
239 350         3491 $attribs_raw =~ s/\s($regexps{name})\s*=\s*"([^"]*?)"//gi;
240 350         2130 $attribs_raw =~ s/\s($regexps{name})\s*=\s*'([^']*?)'//gi;
241 350 100 100     1277 die("Not well-formed\n$attribs_raw") if($attribs_raw =~ /\S/ || grep { /
  154         625  
  340         1130  
242              
243 338 100       731 unless($params{no_entity_parsing}) {
244 337         369 foreach my $key (keys %{$attrib}) {
  337         1679  
245 152         342 $attrib->{$key} = _fixentities($attrib->{$key})
246             }
247             }
248             $elem = {
249 334         1639 content => [],
250             name => $tagname,
251             type => 'e',
252             attrib => $attrib,
253             parent => $elem
254             };
255 334         400 push @{$elem->{parent}->{content}}, $elem;
  334         736  
256             # now handle self-closing tags
257 334 100       1670 if($token =~ /\s*\/>$/) {
258 70         160 $elem->{name} =~ s/\/$//;
259 70         509 $elem = delete $elem->{parent};
260             }
261             } elsif($token =~ /^
262 13         145 die("I can't handle this document\n\tat $token\n");
263             } else { # ordinary content
264 138         204 $token =~ s/\x00//g; # get rid of our CDATA marker
265 138 100       318 unless($params{no_entity_parsing}) { $token = _fixentities($token); }
  137         271  
266 130         209 push @{$elem->{content}}, { content => $token, type => 't' };
  130         696  
267             }
268             }
269 42 50       286 die("Not well-formed\n") if(exists($elem->{parent}));
270 42 100       92 die("Junk after end of document\n") if($#{$elem->{content}} > 0);
  42         235  
271 30         218 die("No elements\n") if(
272 30 100 66     61 $#{$elem->{content}} == -1 || $elem->{content}->[0]->{type} ne 'e'
273             );
274 29         6139 return $elem->{content};
275             }
276              
277             sub _fixentities {
278 289     289   412 my $thingy = shift;
279              
280 289 100       507 my $junk = ($strict_entity_parsing) ? '|.*' : '';
281 289         1506 $thingy =~ s/&((#(\d+|x[a-fA-F0-9]+);)|lt;|gt;|quot;|apos;|amp;$junk)/
282 219 100       1531 $3 ? (
    100          
    100          
    100          
    100          
    100          
    100          
283             substr($3, 0, 1) eq 'x' ? # using a =~ match here clobbers $3
284             chr(hex(substr($3, 1))) : # so don't "fix" it!
285             chr($3)
286             ) :
287             $1 eq 'lt;' ? '<' :
288             $1 eq 'gt;' ? '>' :
289             $1 eq 'apos;' ? "'" :
290             $1 eq 'quot;' ? '"' :
291             $1 eq 'amp;' ? '&' :
292             die("Illegal ampersand or entity\n\tat $1\n")
293             /ge;
294 277         929 $thingy;
295             }
296              
297             =head1 COMPATIBILITY
298              
299             =head2 With other modules
300              
301             The C function is so named because it is intended to work in a
302             similar fashion to L with the L style.
303             Instead of saying this:
304              
305             use XML::Parser;
306             use XML::Parser::EasyTree;
307             $XML::Parser::EasyTree::Noempty=1;
308             my $p=new XML::Parser(Style=>'EasyTree');
309             my $tree=$p->parsefile('something.xml');
310              
311             you would say:
312              
313             use App::Rsnapshot::XML::Tiny;
314             my $tree = App::Rsnapshot::XML::Tiny::parsefile('something.xml');
315              
316             Any valid document that can be parsed like that using App::Rsnapshot::XML::Tiny should
317             produce identical results if you use the above example of how to use
318             L.
319              
320             If you find a document where that is not the case, please report it as
321             a bug.
322              
323             =head2 With perl 5.004
324              
325             The module is intended to be fully compatible with every version of perl
326             back to and including 5.004, and may be compatible with even older
327             versions of perl 5.
328              
329             The lack of Unicode and friends in older perls means that App::Rsnapshot::XML::Tiny
330             does nothing with character sets. If you have a document with a funny
331             character set, then you will need to open the file in an appropriate
332             mode using a character-set-friendly perl and pass the resulting file
333             handle to the module.
334              
335             =head2 The subset of XML that we understand
336              
337             =over 4
338              
339             =item Element tags and attributes
340              
341             Including "self-closing" tags like Epie type = 'steak n kidney' /E;
342              
343             =item Comments
344              
345             Which are ignored;
346              
347             =item The five "core" entities
348              
349             ie C<&>, C<<>, C<>>, C<'> and C<">;
350              
351             =item Numeric entities
352              
353             eg C<A> and C<A>;
354              
355             =item CDATA
356              
357             This is simply turned into PCDATA before parsing. Note how this may interact
358             with the various entity-handling options;
359              
360             =back
361              
362             The following parts of the XML standard are handled incorrectly or not at
363             all - this is not an exhaustive list:
364              
365             =over 4
366              
367             =item Namespaces
368              
369             While documents that use namespaces will be parsed just fine, there's no
370             special treatment of them. Their names are preserved in element and
371             attribute names like 'rdf:RDF'.
372              
373             =item DTDs and Schemas
374              
375             This is not a validating parser. declarations are ignored
376             if you've not made them fatal.
377              
378             =item Entities and references
379              
380             declarations are ignored if you've not made them fatal.
381             Unrecognised entities are ignored by default, as are naked & characters.
382             This means that if entity parsing is enabled you won't be able to tell
383             the difference between C<&nbsp;> and C< >. If your
384             document might use any non-core entities then please consider using
385             the C option, and then use something like
386             L.
387              
388             =item Processing instructions
389              
390             These are ignored.
391              
392             =item Whitespace
393              
394             We do not guarantee to correctly handle leading and trailing whitespace.
395              
396             =item Character sets
397              
398             This is not practical with older versions of perl
399              
400             =back
401              
402             =head1 PHILOSOPHY and JUSTIFICATION
403              
404             While feedback from real users about this module has been uniformly
405             positive and helpful, some people seem to take issue with this module
406             because it doesn't implement every last jot and tittle of the XML
407             standard and merely implements a useful subset. A very useful subset,
408             as it happens, which can cope with common light-weight XML-ish tasks
409             such as parsing the results of queries to the Amazon Web Services.
410             Many, perhaps most, users of XML do not in fact need a full implementation
411             of the standard, and are understandably reluctant to install large complex
412             pieces of software which have many dependencies. In fact, when they
413             realise what installing and using a full implementation entails, they
414             quite often don't *want* it. Another class of users, people
415             distributing applications, often can not rely on users being able to
416             install modules from the CPAN, or even having tools like make or a shell
417             available. App::Rsnapshot::XML::Tiny exists for those people.
418              
419             =head1 BUGS and FEEDBACK
420              
421             I welcome feedback about my code, including constructive criticism.
422             Bug reports should be made using L or by email,
423             and should include the smallest possible chunk of code, along with
424             any necessary XML data, which demonstrates the bug. Ideally, this
425             will be in the form of a file which I can drop in to the module's
426             test suite. Please note that such files must work in perl 5.004.
427              
428             If you are feeling particularly generous you can encourage me in my
429             open source endeavours by buying me something from my wishlist:
430             L
431              
432             =head1 SEE ALSO
433              
434             =over 4
435              
436             =item For more capable XML parsers:
437              
438             L
439              
440             L
441              
442             =item The requirements for a module to be Tiny
443              
444             L
445              
446             =back
447              
448             =head1 AUTHOR
449              
450             David Cantrell EFE
451              
452             Thanks to David Romano for some compatibility patches for Ye Aunciente Perl;
453              
454             to Matt Knecht and David Romano for prodding me to support attributes,
455             and to Matt for providing code to implement it in a quick n dirty minimal
456             kind of way;
457              
458             to the people on L and elsewhere who have been kind
459             enough to point out ways it could be improved;
460              
461             to Sergio Fanchiotti for pointing out a bug in handling self-closing tags,
462             and for reporting another bug that I introduced when fixing the first one;
463              
464             to 'Corion' for finding a bug with localised filehandles and providing a fix.
465              
466             =head1 COPYRIGHT and LICENCE
467              
468             Copyright 2007 David Cantrell
469              
470             This module is free-as-in-speech software, and may be used, distributed,
471             and modified under the same terms as Perl itself.
472              
473             =head1 CONSPIRACY
474              
475             This module is also free-as-in-mason software.
476              
477             =cut
478              
479             'zero';