File Coverage

blib/lib/Text/TEI/Markup.pm

Criterion	Covered	Total	%
statement	13	15	86.6
branch			n/a
condition			n/a
subroutine	5	5	100.0
pod			n/a
total	18	20	90.0

line	stmt	sub	time	code
1				package Text::TEI::Markup;
2
3	1	1	48448	use strict;
	1		2
	1		36
4	1	1	4	use vars qw( $VERSION @EXPORT_OK );
	1		1
	1		45
5	1	1	5	use Encode;
	1		2
	1		80
6	1	1	4	use Exporter 'import';
	1		1
	1		20
7	1	1	1621	use XML::LibXML;
	0
	0
8
9				use utf8;
10
11				$VERSION = '1.9';
12				@EXPORT_OK = qw( &to_xml &word_tag_wrap );
13
14				=head1 NAME
15
16				Text::TEI::Markup - a transcription markup syntax for TEI XML
17
18				=head1 SYNOPSIS
19
20				use Text::TEI::Markup qw( to_xml );
21				my $xml_string = to_xml( file => $markup_file,
22				template => $template_xml_string,
23				%opts ); # see below for available options
24
25				use Text::TEI::Markup qw( word_tag_wrap );
26				my $word_wrapped_xml = word_tag_wrap( $tei_xml_string );
27
28				=head1 DESCRIPTION
29
30				TEI XML is a wonderful thing. The elements defined therein allow a
31				transcriber to record and represent just about any feature of a text that
32				he or she encounters.
33
34				The problem is the transcription itself. When I am transcribing a
35				manuscript, especially if that manuscript is in a bunch of funny characters
36				on the keymap for another language, I do not want to be switching back and
37				forth between keyboard layouts in order to type "
38				arrow-arrow-arrow-arrow-arrow " every six seconds. It's prone to
39				typo, it's astonishingly slow, and it makes my wrists hurt just to think
40				about it. I also don't really want to fire up an XML editor, select the
41				words or characters that need to be tagged, and click a lot. That way is
42				not prone to typo, but it's still pretty darn slow, and it makes my wrists
43				hurt B to think about.
44
45				Text::TEI::Markup is my solution to that problem. It defines a bunch of
46				single- or double-character sigils that represent tags. These are a lot
47				faster and easier to type; I don't have to worry about typos; and I can do
48				it all with a plain text editor, thus minimizing use of the mouse.
49
50				I have tried to pick sigils that don't conflict with characters that are
51				found in manuscripts. I have succeeded for my particular set of
52				manuscripts, but I have not succeeded for the general case. If you like the
53				idea behind this module, you are still almost guaranteed to hate the sigils
54				I've picked. That's okay; you can re-define them.
55
56				=head2 Extra bonus solution: word wrapping with and
57
58				Even if you are happy as a clam in the graphical XML editor of your choice,
59				this module exports a function that may be useful to you. The TEI P5
60				guidelines include a module called "analysis", which allows the user to tag
61				sentences, clauses, words, morphemes, or any other sort of semantic segment
62				of a text. This is really good for programmatic applications, but very
63				boring and repetitive to have to tag.
64
65				The function B solves part of this problem for you. It takes
66				an XML string as input, looks for words (defined by whitespace separation)
67				and returns an XML string with each of these words wrapped in an
68				appropriate tag. If the word has complex elements (e.g. editorial
69				expansion), it will be wrapped in a 70	be in a simple tag. It handles line breaks and page breaks within
71				words, as long as there is no trailing whitespace before the (or
72				) tag, and as long as the whitespace after the tag contains a carriage
73				return.
74
75				=head1 MARKUP SYNTAX
76
77				The input file has a header and a body. The header begins with a '=HEAD'
78				tag, and consists of a colon-separated list of key_value pairs. These keys,
79				which are case insensitive, get directly substituted into an XML template;
80				the idea is that your TEI header won't change very much between files, so
81				you write it once with template values, pass it to &to_xml, and the
82				substitution happens as if by magic. The keyword /MAIN/i is reserved for
83				the content between the tags - that is, all the content that
84				will be generated after the '=BODY' tag.
85
86				A very simple template looks like this:
87
88
89
90
91
92
93				__TITLE__
94
95
96				Transcription by
97				__MYNAME__
98
99
100
101
102
103
104				__MAIN__
105
106
107
108
109				Your input file should then begin something like this:
110
111				=HEAD
112				title:My Summer Vacation: a novel
113				author:John Smith
114				myinitials:tla
115				myname:Tara L Andrews
116				=BODY
117				The ^real^ text b\e\gins +(above)t+here.
118				...
119
120
121				The real work begins after the '=BODY' tag. The currently-defined sigil
122				list is:
123
124				%SIGILS = (
125				'comment' => '##',
126				'add' => '+',
127				'del' => '-',
128				'subst' => "\x{b1}", # Unicode PLUS-MINUS SIGN
129				'div' => "\x{a7}", # Unicode SECTION SIGN
130				'p' => "\x{b6}", # Unicode PILCROW SIGN
131				'ex' => '\\',
132				'expan' => '^',
133				'supplied' => '@',
134				'abbr' => [ '{', '}' ],
135				'num' => '%',
136				'pb' => [ '[', ']' ],
137				'cb' => '\|',
138				'hi' => '*',
139				'unclear' => '?',
140				'q' => "\x{2020}", # Unicode DAGGER
141				);
142
143				Non-identical matched sets of sigla (e.g. '{}' for abbreviations) should be
144				specified in a listref, as seen here.
145
146				Whitespace is only significant at the end of lines. If a line which
147				contains non-tag text (i.e. words) ends in whitespace, it is assumed that
148				the previous word is a complete word. If the line ends with a
149				non-whitespace character, it is assume that the word continues onto the
150				next line.
151
152				All the sigils must be balanced, and they must nest properly. Remember that
153				this is a shorthand for XML. I could be convinced to try to autocorrect
154				some unbalanced sigils, but it would be worth at least a few pints of cider
155				(or, of course, a patch.)
156
157				=head2 Tag arguments
158
159				Certain of the tags can be passed extra arguments:
160
161				=over 4
162
163				=item C
164
165				Anything that appears in parentheses immediately after the add/del opening
166				sigil ( + or - in the examples above) will get added as an attribute. If
167				the string in parentheses has no '=' sign in it, the attribute for the
168				"add" tag will be "place", and the attribute for the "del" tag will be
169				"type". Ergo:
170
171				+(margin)This is an addition+
172				-(overwrite)and a deletion- to the sentence.
173
174				will get translated to
175
176				This is an addition
177				~~and a deletion~~ to the sentence.
178
179				This behavior ought to be more configurable and/or flexible; make it worth
180				my while.
181
182				=item C
183
184				A number value can calculated using a number_conversion function, or it can
185				simply be specified. It is also possible to specify the type of number being
186				represented (Binal, Binal, Btion, Bentage). The arguments
187				are separated with a comma, and in the order "value", "type". So for example:
188
189				The lead was taken by the Exeter %(8)VIII%. This was their
190				%(13,ord)thirteenth% straight win.
191
192				will become:
193
194				The lead was taken by the Exeter VIII. This was their
195				thirteenth straight win.
196
197				=item C
198
199				When text highlighting is encoded, it is almost always a good idea to say
200				something about how the highlight was rendered. This information can be passed
201				as an argument:
202
203				(red)IN the beginning was the word
204
205				will become
206
207				IN the beginning was the word
208
209				=back
210
211				=head1 SUBROUTINES
212
213				=over 4
214
215				=item B( file => '$filename', %opts );
216
217				Takes the name of a file that holds a marked-up version of text. Returns a
218				TEI XML string to represent that text. Options include:
219
220				=over 4
221
222				=item C
223
224				a string containing the XML template that you want to use for the markup.
225				If none is specified, there is a default. That default is useful for me,
226				but is very unlikely to be useful for you. =item C
227
228				a mode string to pass to the open() call on the file. Default "<:utf8".
229
230				=item C
231
232				a subroutine ref that will calculate the value of number representations.
233				Useful for, e.g., Latin numerals. This is optional - if nothing is passed,
234				no number value calculation will be attempted. =item C
235
236				a hashref containing the preferred sigil representations of TEI tags.
237				Defaults to the list above.
238
239				=item C
240
241				Defaults to "true". If you pass a false value, the word wrapping will be
242				skipped.
243
244				=item C
245
246				Defaults to 0. Controls whether rudimentary formatting is applied to the
247				XML returned. Possible values are 0, 1, and "more than 1". See
248				XML::LibXML::Document::serialize for more information. (Personally I just
249				xmllint it separately.)
250
251				=back
252
253				The return string is run through the basic formatting mechanism provided by
254				XML::LibXML. You may wish to pass it through a pretty printer more to your
255				taste.
256
257				=cut
258
259				# Default list of funky signs I use.
260				# TODO: Add header support
261				my %SIGILS = (
262				'comment' => '##',
263				'add' => '+',
264				'del' => '-',
265				'subst' => "\x{b1}",
266				'div' => "\x{a7}",
267				'p' => "\x{b6}",
268				'ex' => '\\',
269				'expan' => '^',
270				'supplied' => '@',
271				'abbr' => [ '{', '}' ],
272				'num' => '%',
273				'pb' => [ '[', ']' ],
274				'cb' => '\|',
275				'hi' => '*',
276				'unclear' => '?',
277				'q' => "\x{2020}",
278				);
279
280				my @DTL = ;
281				my $DEFAULT_TEMPLATE = join( '', @DTL );
282
283				sub to_xml {
284				my %opts = (
285				'number_conversion' => undef,
286				'fileopen_mode' => '<:utf8',
287				'wrap_words' => 1,
288				'sigils' => \%SIGILS,
289				'template' => undef,
290				'format' => 0,
291				@_,
292				);
293
294				unless( defined( $opts{'file'} ) ) {
295				warn "No file specified! Doing nothing.";
296				return undef;
297				}
298
299				if( defined $opts{'number_conversion'}
300				&& ref( $opts{'number_conversion'} ) ne 'CODE' ) {
301				warn "number_conversion argument must be a subroutine ref";
302				$opts{'number_conversion'} = undef;
303				}
304
305				my $inbody;
306
307				my $rc = open( FILE, $opts{'fileopen_mode'}, $opts{'file'} );
308				unless( $rc ) {
309				warn "Could not open $opts{'file'}: $@";
310				return undef;
311				}
312
313				my $tmpl;
314				if( defined $opts{'template'} ) {
315				$tmpl = $opts{'template'};
316				} else {
317				$tmpl = $DEFAULT_TEMPLATE;
318				}
319
320				my $main_xml;
321
322				my( $in_p, $in_div ) = ( undef, undef );
323				while() {
324				s/\R+$//g; # chomp, no matter the newline char
325				next if /^\s*$/;
326				s/^\s*//; # but keep trailing spaces - they're significant!
327				_current_context( $_ );
328
329				if( /^=BODY/ ) {
330				$inbody = 1;
331				# Have we found a responsible person?
332				unless( exists $opts{'resp'} ) {
333				warn "No responsible person specified for edits!";
334				}
335				next;
336				}
337
338				if( /^(\w+)\s:\s(.*)$/ ) {
339				# Make the header template substitution.
340				_make_warning( "We are in the BODY section but this looks like a header" )
341				if $inbody;
342				my( $key, $val ) = ( lc( $1 ), $2 );
343				$val =~ s/\s+$//;
344				if( $key eq 'main' ) {
345				warn "You cannot use '$key' as a substitution key!";
346				} else {
347				$tmpl =~ s/__${key}__/$val/gi;
348				}
349				if( $key eq 'transcriberid' ) {
350				$opts{'resp'} = '#' . $val;
351				}
352				}
353
354				if( $inbody ) {
355				# Send it to the parser.
356				my $line;
357				## TODO: Upgrade to perl 5.10 to get state variables.
358				( $line, $in_div, $in_p ) = _process_line( $_, $in_div, $in_p, %opts );
359				$main_xml .= $line;
360				}
361				}
362				close FILE;
363
364				$tmpl =~ s/__MAIN__/$main_xml/;
365				if( $opts{'wrap_words'} ) {
366				$tmpl = word_tag_wrap( $tmpl, $opts{'format'} );
367				} else {
368				# Just make sure it parses, and format it if asked.
369				my $parser = XML::LibXML->new();
370				my $doc;
371				my $ok = eval{ $doc = $parser->parse_string( $tmpl ); };
372				unless( $ok ) {
373				warn "Parsing of the new XML doc failed: $@";
374				return undef;
375				}
376				$tmpl = decode( $doc->encoding, $doc->serialize( $opts{'format'} ) );
377				}
378				return $tmpl;
379				}
380
381				sub _process_line {
382				my( $line, $in_div, $in_p, %opts ) = @_;
383				chomp $line;
384				my $checkline = $line; # This should be well-formed by the end
385				my $clopts = { %opts, 'nowarn' => 1 };
386				# Look for paragraph and div markers, i.e. our tags that can span multiple lines
387				# and that should be disregarded in the checkline.
388				my $sigils = $opts{'sigils'};
389				my( $divsig, $pgsig ) = ( $sigils->{'div'}, $sigils->{'p'} );
390				while( $line =~ /\Q$divsig\E(\d*)/g ) {
391				my $divno = $1;
392				# Calculate the starting position.
393				my $pos = pos( $line ) - 1;
394				$pos -= length( $divno ) if $divno;
395
396				if( $in_div ) {
397				_make_warning( "Nonsensical division number at end-division tag; are your '$divsig' tags balanced?" )
398				if $divno;
399				substr( $line, $pos, 1, '' );
400				} else {
401				my $divstr = '";
402				substr( $line, $pos, pos( $line ) - $pos, $divstr );
403				}
404				$in_div = !$in_div;
405				}
406				$checkline =~ s/\Q$divsig\E//g;
407
408				while( $line =~ /\Q$pgsig\E/g ) {
409				my $p_str = '<' . ( $in_p ? '/' : '' ) . 'p>';
410				substr( $line, pos( $line ) - 1, 1, $p_str );
411				$in_p = !$in_p;
412				}
413				$checkline =~ s/\Q$pgsig\E//g;
414
415				# Add and delete tags. Do this first so that we do not stomp later
416				# instances of the dash (e.g. in XML comments).
417				my $add_del_re = qr/([-+])($([^$]+)\))?(.*?)\1/;
418				while( $line =~ /$add_del_re/g ) {
419				my( $op, $attr, $word ) = ( $1, $3, $4 );
420				# Calculate starting position.
421				my $pos = pos( $line ) - ( length( $word ) + 2 );
422				# Also for the checkline.
423				$checkline =~ /$add_del_re/g;
424				my $cpos = pos( $checkline ) - ( length( $word ) + 2 );
425				$pos -= ( length( $attr ) + 2 ) if $attr;
426				$cpos -= ( length( $attr ) + 2 ) if $attr;
427				# Figure out what the attribute string, if any, should be.
428				my $attr_str;
429				if( $attr && $attr =~ /\=/ ) {
430				$attr_str = $attr;
431				} elsif ( $attr ) {
432				$attr_str = ( $op eq '+' ? "place" : "type" )
433				. "=\"$attr\"";
434				}
435				my $interp_str = '<' . ( $op eq '+' ? 'add' : 'del' )
436				. ( $attr_str ? " $attr_str" : '' )
437				. ">$word';
438				substr( $line, $pos, pos( $line ) - $pos, $interp_str );
439				substr( $checkline, $cpos, pos( $checkline ) - $cpos, $interp_str );
440				}
441
442				# All the tags that are not very special cases.
443				foreach my $tag ( qw( subst abbr hi ex expan num unclear q supplied ) ) {
444				my $tag_sig = $sigils->{$tag};
445				my( $tag_open, $tag_close );
446				if( ref( $tag_sig ) eq 'ARRAY' ) {
447				( $tag_open, $tag_close ) = @$tag_sig;
448				} else {
449				$tag_open = $tag_close = $tag_sig;
450				}
451				$line =~ s\|\Q$tag_open\E(.*?)\Q$tag_close\E\|_open_tag( $tag, $1, \%opts ) . ""\|ge;
452				$checkline =~ s\|\Q$tag_open\E(.*?)\Q$tag_close\E\|_open_tag( $tag, $1, $clopts ) . ""\|ge;
453				}
454
455				# Standalone tags that aren't special cases. Currently only cb.
456				foreach my $tag ( qw( cb ) ) {
457				my $tag_sig = $sigils->{$tag};
458				$line =~ s\|\Q$tag_sig\E\|"<$tag/>"\|ge;
459				$checkline =~ s\|\Q$tag_sig\E\|"<$tag/>"\|ge;
460				}
461
462
463				# Page breaks. Defined by the delimiters, plus an optional
464				# page/folio number & recto/verso indicator, on a line by itself.
465				# Of course other languages may use other sigils to indicate recto
466				# verso, so do not look for 'r' and 'v' specifically.
467				my $pb_sig = $sigils->{'pb'};
468				my ( $pb_open, $pb_close );
469				if( ref( $pb_sig ) eq 'ARRAY' ) {
470				( $pb_open, $pb_close ) = @$pb_sig;
471				} else {
472				$pb_open = $pb_sig;
473				$pb_close = $pb_sig;
474				}
475				$line =~ s\|^\Q$pb_open\E(\d+(.)?)\Q$pb_close\E\s*$\|\|;
476				$checkline =~ s\|^\Q$pb_open\E(\d+(.)?)\Q$pb_close\E\s*$\|\|;
477
478				# XML comments. Convert ## text ## to
479				my $com_sig = $sigils->{'comment'};
480				my ( $com_open, $com_close );
481				if( ref( $com_sig ) eq 'ARRAY' ) {
482				( $com_open, $com_close ) = @$com_sig;
483				} else {
484				$com_open = $com_close = $com_sig;
485				}
486				$line =~ s\|\Q$com_open\E(.*?)\Q$com_close\E\|\|g;
487				$checkline =~ s\|\Q$com_open\E(.*?)\Q$com_close\E\|\|g;
488
489				# At this point our check-line should be well-balance. Send a warning if not.
490				my $parser = XML::LibXML->new();
491				my $fragment;
492				my $ok = eval{ $fragment = $parser->parse_balanced_chunk( "$checkline" ); };
493				unless( $ok ) {
494				_make_warning( "Sigils are not properly nested." );
495				}
496
497				# Finally, every line with text outside an XML tag must have a line
498				# break. Any lb tag should be inside a cb, p, or div tag.
499				my $testline = $line;
500				$testline =~ s/<[^>]*>//g;
501				if( $testline =~ /\S/ ) {
502				no warnings 'uninitialized';
503				$line =~ s!( \|\|)?$!$1!;
504				}
505
506				# Return the expanded line.
507				return( "$line\n", $in_div, $in_p );
508				}
509
510				sub _open_tag {
511				my( $tag, $text, $opts ) = @_;
512
513				my $opened_tag;
514				# Does the tag take a parenthesized argument?
515				my $arg = '';
516				if( $text =~ /^$([^$]+)\)(.*)$/ ) {
517				( $arg, $text ) = ( $1, $2 );
518				}
519				if( $tag =~ /^(ex\|expan\|supplied)$/ ) {
520				# It takes a resp agent.
521				$opened_tag = '<'. $tag .' resp="' . $opts->{'resp'} . "\">$text";
522				} elsif ( $tag eq 'q' ) {
523				# Special case - we mean a biblical quote.
524				$opened_tag = '' . $text;
525				} elsif ( $tag eq 'num' ) {
526				# Derive the number's value if requested.
527				my $nv;
528				if( $arg ) {
529				my $nt;
530				my %ntabbr = (
531				'ord' => 'ordinal',
532				'card' => 'cardinal',
533				'frac' => 'fraction',
534				'perc' => 'percentage' );
535				( $nv, $nt ) = split( /,/, $arg );
536				$nt = $ntabbr{$nt} \|\| $nt;
537				if( $nt ) {
538				$opened_tag = sprintf( '%s',
539				$nv, $nt, $text );
540				} else {
541				$opened_tag = sprintf( '%s', $nv, $text );
542				}
543				}
544				unless( defined $nv ) {
545				my $numconvert = $opts->{'number_conversion'};
546				if( defined $numconvert ) {
547				# Strip any XML markup from the element contents.
548				my $parser = XML::LibXML->new();
549				my $fragment;
550				my $ok = eval{ $fragment = $parser->parse_balanced_chunk( $text ); };
551				if( $ok ) {
552				$nv = &$numconvert( uc( $fragment->textContent() ) );
553				} else {
554				_make_warning( "Unbalanced chunk in number tag: $text" )
555				unless $opts->{nowarn};
556				}
557				$opened_tag = sprintf( '%s', $nv, $text )
558				if defined $nv;
559				}
560				}
561				} elsif ( $tag eq 'hi' ) {
562				unless( $arg ) {
563				_make_warning( "What kind of highlighting is this?" )
564				unless $opts->{nowarn};
565				$arg = 'DEFAULT';
566				}
567				$arg =~ s/\s+/_/g;
568				$opened_tag = sprintf( '<%s rend="%s">%s', $tag, $arg, $text );
569				}
570
571				# The default
572				$opened_tag = "<$tag>$text" unless $opened_tag;
573				return $opened_tag;
574				}
575
576				sub _make_warning {
577				my $message = shift;
578				my $context = _current_context();
579				my $warning = "($.) $context\n\tPossible problem! $message";
580				warn $warning;
581				}
582
583
584				## Utility to keep track of where we are
585				{
586				my $curr_line;
587
588				sub _current_context {
589				if( @_ ) {
590				$curr_line = shift;
591				}
592				return $curr_line;
593				}
594				}
595
596				=item B( $xml_string )
597
598				Takes a string containing a TEI XML document, and returns that
599				document with all its words wrapped in (or ) tags. A
600				"word" is defined as a series of text characters separated by
601				whitespace. A word can have a line break, or even a page break, in
602				the middle; if this is the case, there I be any whitespace
603				between the end of the first word segment and the (or )
604				tag. Conversely, there I be whitespace separating the (or
605				) from a complete word.
606
607				=cut
608
609				sub word_tag_wrap {
610				my( $xml, $format ) = @_;
611
612				my $ret;
613				my $doc;
614				my $root;
615				if( !ref( $xml ) ) {
616				$ret = 'string';
617				my $parser = XML::LibXML->new();
618				$doc = $parser->parse_string( $xml );
619				$root = $doc->getDocumentElement();
620				} elsif( ref( $xml ) eq 'XML::LibXML::Document' ) {
621				$ret = 'xml';
622				$root = $xml->getDocumentElement();
623				} elsif( ref( $xml ) eq 'XML::LibXML::Element' ) {
624				$ret = 'xml';
625				$root = $xml;
626				} else {
627				die "Passed argument is neither string, Document, or Element";
628				}
629
630				my @paragraphs;
631				foreach my $t ( $root->getElementsByTagName( 'text' ) ) {
632				# Get the paragraphs in this text node; if it is already the same as
633				# a paragraph in our list, skip it.
634				foreach my $p ( $t->getElementsByTagName( 'p' ) ) {
635				next if grep { $_->isSameNode( $p ) } @paragraphs;
636				push( @paragraphs, $p );
637				}
638				}
639				foreach my $p ( @paragraphs ) {
640				my $new_p = _wrap_children( $p );
641				# Remove the final whitespace from the paragraphs
642				my $lc = $new_p->lastChild;
643				if( ref( $lc ) eq 'XML::LibXML::Text' && $lc->data =~ /^\s+$/ ) {
644				$new_p->removeChild( $lc );
645				}
646				$p->replaceNode( $new_p );
647				}
648
649				# Annoyingly, we have to decode the encoding that takes place when
650				# the string is returned.
651				if( $ret eq 'string' ) {
652				$format = 0 unless $format;
653				return decode( $doc->encoding(), $doc->serialize( $format ) );
654				} # else the doc has been modified and we need return nothing.
655				}
656
657				sub _wrap_children {
658				my $node = shift;
659				my @children = $node->childNodes;
660
661				# Make a new version of the element in question, with its name & attributes
662				my $new_node = XML::LibXML::Element->new( $node->nodeName );
663				# Set the namespace
664				my $docns = $node->namespaceURI;
665				$new_node->setNamespace( $docns );
666				foreach my $attr ( $node->attributes ) {
667				my( $aname, $aval ) = split( /=/, $attr );
668				$aname =~ s/\s+//g;
669				$aval =~ s/\"//g;
670				$new_node->setAttribute( $aname, $aval );
671				}
672				my $open_word_node = undef;
673				foreach my $c ( @children ) {
674				# Is it a text node?
675				if( ref( $c ) eq 'XML::LibXML::Text' ) {
676				# Get the text.
677				my $str = $c->textContent;
678				# Strip out carriage returns and their surrounding spaces.
679				# Carriage returns should only occur after elements,
680				# and the spaces around them should therefore be insignificant.
681				$str =~ s/^\s\n\s//gs;
682				# If there is nothing at all but a newline + initial spaces,
683				# pretend that the node isn't there at all.
684				next unless $str;
685
686				# Get the individual words.
687				my @words = split( /\s+/, $str );
688
689				# Finish out the last word if we need to.
690				if( $open_word_node ) {
691				# If there are any words in this text string, the
692				# first one should be used to close out the open node.
693				# If the first word is empty, it's a space and the
694				# word should just be closed. If there are no words
695				# at all, it was just a space. If the first word was
696				# all there is, we haven't encountered a space yet and
697				# need to keep the word open.
698				if( @words ) {
699				my $first = shift @words;
700				$open_word_node->appendText( $first ) if $first;
701				} else {
702				$open_word_node = undef unless @words;
703				}
704				}
705
706				foreach( @words ) {
707				# Skip whitespace "words"
708				next unless /\S/;
709
710				# Make a new node for the word
711				my $word_node = XML::LibXML::Element->new( 'w' );
712				$word_node->setNamespace( $docns );
713				$word_node->appendText( $_ );
714				$new_node->appendChild( $word_node );
715				$new_node->appendText(' ');
716				# ...and keep it open until we find a new word or a space
717				$open_word_node = $word_node;
718				}
719
720				# Close the last word node if our text node ends in a space.
721				if( $str =~ /\s+$/s ) {
722				$open_word_node = undef;
723				}
724				} else {
725				my $wrapped_child;
726				if ( ref( $c ) ne 'XML::LibXML::Comment' && $c->textContent ne ''
727				&& $c->textContent =~ /\s+/ ) {
728				# Recurse on any node that itself contains whitespace-separated text.
729				my $new_c = _wrap_children( $c );
730				$wrapped_child = ( $c->toString() ne $new_c->toString() );
731				$c = $new_c;
732				}
733
734				# If there is an open word node, make it a seg and append
735				# our result there; if the child has text content but no
736				# word children, wrap it in a new seg; otherwise just pass
737				# it on through.
738				if( $open_word_node ) {
739				$open_word_node->setNodeName( 'seg' );
740				$open_word_node->setAttribute( 'type', 'word' );
741				$open_word_node->appendChild( $c );
742				} elsif( ref( $c ) eq 'XML::LibXML::Comment' \|\| $c->textContent eq ''
743				\|\| $wrapped_child ) {
744				$new_node->appendChild( $c );
745				} else {
746				my $segment_node = XML::LibXML::Element->new( 'seg' );
747				$segment_node->setNamespace( $docns );
748				$segment_node->setAttribute( 'type', 'word' );
749				$segment_node->appendChild( $c );
750				$new_node->appendChild( $segment_node );
751				$new_node->appendText(' ');
752				# Keep it open in case there is not a leading space on the next
753				# text node.
754				$open_word_node = $segment_node;
755				}
756				}
757				}
758
759				return $new_node;
760				}
761
762				1;
763
764				=back
765
766				=head1 BUGS / TODO
767
768				The XML is not currently validated against a schema. This is mostly
769				because I have been unable to get RelaxNG validation to work against
770				certain TEI schemas.
771
772				This module is currently in a state that I know to be useful to me.
773				If it looks like it might be useful to you, but something is bugging
774				you about it, report it!
775
776				=head1 LICENSE
777
778				This package is free software and is provided "as is" without express
779				or implied warranty. You can redistribute it and/or modify it under
780				the same terms as Perl itself.
781
782				=head1 AUTHOR
783
784				Tara L Andrews, L
785
786
787				=cut
788
789				__DATA__