File Coverage

1073							" align = 'center' summary = 'body'".
1074
1075							Note the leading space.
1076
1077							=item o content
1078
1079							This is an arrayref of bits and pieces of content.
1080
1081							Consider this fragment of HTML:
1082
1083							I did not say I liked debugging.
1084
1085							When parsing 'I did ', the number of child nodes (of ) is 0, since has not yet been detected.
1086
1087							So, 'I did ' is stored in the 0th element of the arrayref belonging to .
1088
1089							Likewise, 'not' is stored in the 0th element of the arrayref belonging to the node .
1090
1091							Next, ' say I ' is stored in the 1st element of the arrayref belonging to ,
1092							because it follows the 1st child node ().
1093
1094							Likewise, ' debugging' is stored in the 2nd element of the arrayref belonging to .
1095
1096							This way, the input string can be reproduced by successively outputting the elements of the arrayref of content
1097							interspersed with the contents of the child nodes (processed recusively).
1098
1099							Note: If you are processing this tree, never forget that there can be content after the last child node has been closed,
1100							but before the current node is closed.
1101
1102							Note: The DOCTYPE declaration is stored as the 0th element of the content of the root node.
1103
1104							=item o depth
1105
1106							The nesting depth of the tag within the document.
1107
1108							The root is at depth 0, '' is at depth 1, '' and '' are a depth 2, and so on.
1109
1110							It's just there in case you need it.
1111
1112							=item o name
1113
1114							So, the tag '' will mean the name is 'html'.
1115
1116							Tag names are stored in lower-case.
1117
1118							The root of the tree is called 'root', and holds the DOCTYPE, if any, as content.
1119
1120							The root has the node 'html' as the only child, of course.
1121
1122							=item o node_type
1123
1124							This holds 'global' before '' and between '' and '', and after ''.
1125
1126							It holds 'head' for all nodes from '' to '', and holds 'body' from '' to ''.
1127
1128							It's just there in case you need it.
1129
1130							=back
1131
1132							=head2 How are tags and attributes handled?
1133
1134							Tags are stored in lower-case, in a tree managed by L.
1135
1136							Attributes are stored in the same case as in the original HTML.
1137
1138							The root of the tree is returned be L.
1139
1140							=head2 How are HTML comments handled?
1141
1142							They are treated as content. This includes the prefix ''.
1143
1144							=head2 How is DOCTYPE handled?
1145
1146							It is treated as content belonging to the root of the tree.
1147
1148							=head2 How is the XML declaration handled?
1149
1150							It is treated as content belonging to the root of the tree.
1151
1152							=head2 Does this module handle all HTML pages?
1153
1154							No, never.
1155
1156							=head2 Which versions of HTML does this module handle?
1157
1158							Up to V 4.
1159
1160							=head2 What do I do if this module does not handle my HTML page?
1161
1162							Make yourself a nice cup of tea, and then fix your page.
1163
1164							=head2 Does this validate the HTML input?
1165
1166							No.
1167
1168							For example, if you feed in a HTML page without the title tag, this module does not care.
1169
1170							=head2 How do I view the output HTML?
1171
1172							There are various ways.
1173
1174							=over 4
1175
1176							=item o See scripts/parse.html.pl
1177
1178							=item o By installing HTML::Revelation, of course!
1179
1180							Sample output:
1181
1182							L.
1183
1184							=back
1185
1186							=head2 How do I test this module (or my file)?
1187
1188							Preferably, see the previous question, or...
1189
1190							Suggested steps:
1191
1192							Note: There are quite a few files involved. Proceed with caution.
1193
1194							=over 4
1195
1196							=item o Select a HTML file to test
1197
1198							Call this input.html.
1199
1200							=item o Run input.html thru reveal.pl
1201
1202							Reveal.pl ships with HTML::Revelation.
1203
1204							Call the output file output.1.html.
1205
1206							=item o Run input.html thru parse.html.pl
1207
1208							parse.html.pl ships with HTML::Parser::Simple.
1209
1210							Call the output file parsed.html.
1211
1212							=item o Run parsed.html thru reveal.pl
1213
1214							Call the output file output.2.html.
1215
1216							=item o Compare output.1.html and output.2.html
1217
1218							If they match, or even if they don't match, you're finished.
1219
1220							=back
1221
1222							=head2 Will you implement a 'quirks' mode to handle my special HTML file?
1223
1224							No, never.
1225
1226							Help with quirks: L.
1227
1228							=head2 Is there anything I should be aware of?
1229
1230							Yes. If your HTML file is not nice, the interpretation of tag nesting will not match
1231							your preconceptions.
1232
1233							In such cases, do not seek to fix the code. Instead, fix your (faulty) preconceptions, and fix your HTML file.
1234
1235							The 'a' tag, for example, is defined to be an inline tag, but the 'div' tag is a block-level tag.
1236
1237							I don't define 'a' to be inline, others do, e.g. L and hence HTML::Tagset.
1238
1239							Inline means:
1240
1241							NAME
1242
1243							will I be parsed as an 'a' containing a 'div'.
1244
1245							The 'a' tag will be closed before the 'div' is opened. So, the result will look like:
1246
1247							NAME
1248
1249							To achieve what was presumably intended, use 'span':
1250
1251							NAME
1252
1253							Some people (cough cough) have had to redo their entire websites due to this very problem.
1254
1255							Of course, this is just one of a vast set of possible problems.
1256
1257							You have been warned.
1258
1259							=head2 Why did you use Tree::Simple but not Tree or Tree::Fast or Tree::DAG_Node?
1260
1261							During testing, Tree::Fast crashed, so I replaced it with Tree and everything worked. Spooky.
1262
1263							Late news: Tree does not cope with an arrayref stored in the metadata, so I've switched to Tree::DAG_Node.
1264
1265							Stop press: As an experiment I switched to Tree::Simple. Since it also works I'll just keep using it.
1266
1267							=head2 Why isn't this module called HTML::Parser::PurePerl?
1268
1269							=over 4
1270
1271							=item o The API
1272
1273							That name sounds like a pure Perl version of the same API as used by HTML::Parser.
1274
1275							But the API's are not, and are not meant to be, compatible.
1276
1277							=item o The tie-in
1278
1279							Some people might falsely assume HTML::Parser can automatically fall back to HTML::Parser::PurePerl in the absence of a compiler.
1280
1281							=back
1282
1283							=head2 How do I output my own stuff while traversing the tree?
1284
1285							=over 4
1286
1287							=item o The sophisticated way
1288
1289							As always with OO code, sub-class! In this case, you write a new version of the traverse() method.
1290
1291							See L, for example. It overrides L.
1292
1293							=item o The crude way
1294
1295							Alternately, implement another method in your sub-class, e.g. process(), which recurses like traverse().
1296							Then call parse() and process().
1297
1298							=back
1299
1300							=head2 Is the code on github?
1301
1302							Yes. See: git://github.com/ronsavage/html--parser--simple.git
1303
1304							=head2 How is the source formatted?
1305
1306							I edit with UltraEdit. That means, in general, leading 4-space tabs.
1307
1308							All vertical alignment within lines is done manually with spaces.
1309
1310							Perl::Critic is off the agenda.
1311
1312							=head2 Why did you choose Moos?
1313
1314							For this year's (2012) Google Code-in, I had a quick look at 122 class-building classes, and decided
1315							L was suitable, given it is pure-Perl and has the trigger feature I needed.
1316
1317							See L.
1318
1319							=head1 Credits
1320
1321							This Perl HTML parser has been converted from a JavaScript one written by John Resig.
1322
1323							L.
1324
1325							Well done John!
1326
1327							Note also the comments published here:
1328
1329							L.
1330
1331							=head1 Author
1332
1333							C was written by Ron Savage Iron@savage.net.auE> in 2009.
1334
1335							Home page: L.
1336
1337							=head1 Copyright
1338
1339							Australian copyright (c) 2009 Ron Savage.
1340
1341							All Programs of mine are 'OSI Certified Open Source Software';
1342							you can redistribute them and/or modify them under the terms of
1343							The Artistic License, a copy of which is available at:
1344							http://www.opensource.org/licenses/index.html
1345
1346							=cut