File Coverage

blib/lib/Treex/Core.pm

Criterion	Covered	Total	%
statement	7	9	77.7
branch			n/a
condition			n/a
subroutine	3	3	100.0
pod			n/a
total	10	12	83.3

line	stmt	sub	time	code
1				package Treex::Core;
2				$Treex::Core::VERSION = '2.20160630';
3	8	8	43612	use strict;
	8		22
	8		231
4	8	8	49	use warnings;
	8		18
	8		216
5	8	8	4473	use Treex::Core::Document;
	0
	0
6				use Treex::Core::Node;
7				use Treex::Core::Bundle;
8				use Treex::Core::Scenario;
9
10				1;
11
12				__END__
13
14				=pod
15
16				=encoding utf8
17
18				=head1 NAME
19
20				Treex::Core - interface to linguistic structures and processing units in Treex
21
22				=head1 VERSION
23
24				version 2.20160630
25
26				=head1 SYNOPSIS
27
28				use Treex::Core;
29
30				my $doc = Treex::Core::Document->new;
31
32				my $bundle = $doc->create_bundle;
33				my $zone = $bundle->create_zone('en');
34				my $atree = $zone->create_atree;
35
36				my $predicate = $atree->create_child({form=>'loves'});
37
38				foreach my $argument (qw(John Mary)) {
39				my $child = $atree->create_child( { form=>$argument } );
40				$child->set_parent($predicate);
41				}
42
43				$doc->save('demo.treex');
44
45
46				=head1 DESCRIPTION
47
48				C<Treex::Core> is a library of modules for processing linguistic data,
49				especially tree-shaped syntactic representations of natural language
50				sentences, both for language analysis and synthesis purposes.
51
52				C<Treex::Core> is meant to be as language universal as possible.
53				It makes only a few assumptions: the language's written form must be
54				representable by Unicode characters, and it should be possible to segment
55				texts in such language into sentences (or sentence-like units) and words
56				(or word-like units).
57
58				C<Treex::Core> is tightly coupled with the tree editor TrEd, which
59				makes browsing the linguistic data structures very comfortable.
60
61				C<Treex::Core> uses TrEd's L<Treex::PML> for the memory
62				representation, as well as for storing the data into *.treex files, using
63				the XML-based Prague Markup Language.
64
65
66				=head2 Zones parametrized by language codes and selectors
67
68				Treex documents can contain parallel texts in two or more languages,
69				as well as alternative linguistic representations (such as two
70				dependency parses of a same sentence, resulting from different parsers).
71				Such contents of the same type are separated by introducing zones.
72
73				Zones (classes derived from L<Treex::Core::Zone>) are
74				parametrized by language ISO codes, and optionally also by so called
75				selectors. Selector can be any string identifying the source or purpose of the
76				given piece of data. It can distinguish e.g. reference translation from
77				machine-translated text, or the most probable parse of a given sentence from
78				the second most probable parse. In Treex data structures, zones are used at
79				two levels:
80
81				- L<Treex::Core::DocZone> - allows to have multiple texts
82				stored in the same document
83
84				- L<Treex::Core::BundleZone> - allows to have multiple
85				sentences and their representations in each bundle.
86
87				As for Treex processing units (scenarios and blocks, see below), each
88				processing unit either limits itself to a certain zone, or it can be
89				zone-parametrized (especially in the case of language-universal blocks).
90
91				=head2 Data structure units
92
93				In Treex, linguistic representations of running texts are organized
94				in the following hierarchy:
95
96				=head3 Documents
97
98				The smallest independently storable unit is a document
99				(L<Treex::Core::Document>).
100
101				Technically, each document consists of a set of document zones, and of a
102				sequence of bundles.
103
104				=head3 Document zone
105
106				A document can contain one ore more zone
107				(L<Treex::Core::DocZone>), each of them containing a text.
108
109				=head3 Bundle
110
111				A bundle (L<Treex::Core::Bundle>) corresponds to a
112				sentence (or a tuple of parallel or alternative sentences) and all its (or
113				their) linguistic analyses.
114
115				Technically, a bundle contains a set of bundle zones.
116
117				=head3 Bundle zone
118
119				Bundle zone (L<Treex::Core::Bundle>) contains one sentence
120				and at most one its linguistic analysis for each layer of analysis. The
121				following layers are currently distinguished:
122
123				* a-layer - analytical layer (surface syntax dependency layer) merged with the
124				morphological layer, as defined in the Prague Dependency Treebank.
125
126				* t-layer - tectogrammatical layer (deep-syntactic dependency)
127
128				* p-layer - phrase-structure layer
129
130				* n-layer - named entity layer
131
132				Each layer representation has a form of a tree, represented by the tree's root node.
133
134				=head3 Node
135
136				Each node has a parent (unless it is the root) and a set of predefined
137				attributes, depending on the layer it belongs to. There is an abstract class
138				L<Treex::Core::Node> defining the functionality which is
139				common to all types of trees (such as functions for accessing node's parent or
140				children). Functionality specific for the individual linguistic layers is
141				implemented in the derived classes:
142
143				* L<Treex::Core::Node::A>
144
145				* L<Treex::Core::Node::T>
146
147				* L<Treex::Core::Node::P>
148
149				* L<Treex::Core::Node::N>
150
151				=head3 Attributes
152
153				Nodes contain attribute-value pairs. Some attributes are universal (such as
154				identifier), but most of them are specific for a certain layer. Even if node
155				instances are regular Moose objects (i.e., blessed hashes), node's attributes
156				should be accessed exclusively via predefined accessors.
157
158				Attribute values can be plain or further structured using PML data types (e.g.
159				sequences), according to the PML schema.
160
161
162				=head2 Processing units
163
164				=head3 Block
165
166				Blocks (descendants of L<Treex::Core::Block>) are the
167				smallest processing units applicable on Treex documents.
168
169				=head3 Scenario
170
171				Scenarios (instances of L<Treex::Core::Scenario>) are
172				sequences of blocks. Blocks from a scenario are applied on a document one
173				after another.
174
175				=head2 Support for visualizing Treex trees in TrEd
176
177				C<Treex::Core> also contains a TrEd extension for browsing .treex files.
178				The extension itself is only a thin wrapper for the viewing functionality
179				implemented in L<Treex::Core::TredView>.
180
181
182				=head1 AUTHOR
183
184				ZdenÄ›k Å½abokrtskÃ½ <zabokrtsky@ufal.mff.cuni.cz>
185
186				Martin Popel <popel@ufal.mff.cuni.cz>
187
188				David MareÄek <marecek@ufal.mff.cuni.cz>
189
190				=head1 COPYRIGHT AND LICENSE
191
192				Copyright Â© 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
193
194				This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.