line |
stmt |
bran |
cond |
sub |
pod |
time |
code |
1
|
|
|
|
|
|
|
package Treex::Core; |
2
|
|
|
|
|
|
|
$Treex::Core::VERSION = '2.20150928'; |
3
|
8
|
|
|
8
|
|
61175
|
use strict; |
|
8
|
|
|
|
|
16
|
|
|
8
|
|
|
|
|
203
|
|
4
|
8
|
|
|
8
|
|
39
|
use warnings; |
|
8
|
|
|
|
|
12
|
|
|
8
|
|
|
|
|
252
|
|
5
|
8
|
|
|
8
|
|
5145
|
use Treex::Core::Document; |
|
0
|
|
|
|
|
|
|
|
0
|
|
|
|
|
|
|
6
|
|
|
|
|
|
|
use Treex::Core::Node; |
7
|
|
|
|
|
|
|
use Treex::Core::Bundle; |
8
|
|
|
|
|
|
|
use Treex::Core::Scenario; |
9
|
|
|
|
|
|
|
|
10
|
|
|
|
|
|
|
1; |
11
|
|
|
|
|
|
|
|
12
|
|
|
|
|
|
|
__END__ |
13
|
|
|
|
|
|
|
|
14
|
|
|
|
|
|
|
=pod |
15
|
|
|
|
|
|
|
|
16
|
|
|
|
|
|
|
=encoding utf8 |
17
|
|
|
|
|
|
|
|
18
|
|
|
|
|
|
|
=head1 NAME |
19
|
|
|
|
|
|
|
|
20
|
|
|
|
|
|
|
Treex::Core - interface to linguistic structures and processing units in Treex |
21
|
|
|
|
|
|
|
|
22
|
|
|
|
|
|
|
=head1 VERSION |
23
|
|
|
|
|
|
|
|
24
|
|
|
|
|
|
|
version 2.20150928 |
25
|
|
|
|
|
|
|
|
26
|
|
|
|
|
|
|
=head1 SYNOPSIS |
27
|
|
|
|
|
|
|
|
28
|
|
|
|
|
|
|
use Treex::Core; |
29
|
|
|
|
|
|
|
|
30
|
|
|
|
|
|
|
my $doc = Treex::Core::Document->new; |
31
|
|
|
|
|
|
|
|
32
|
|
|
|
|
|
|
my $bundle = $doc->create_bundle; |
33
|
|
|
|
|
|
|
my $zone = $bundle->create_zone('en'); |
34
|
|
|
|
|
|
|
my $atree = $zone->create_atree; |
35
|
|
|
|
|
|
|
|
36
|
|
|
|
|
|
|
my $predicate = $atree->create_child({form=>'loves'}); |
37
|
|
|
|
|
|
|
|
38
|
|
|
|
|
|
|
foreach my $argument (qw(John Mary)) { |
39
|
|
|
|
|
|
|
my $child = $atree->create_child( { form=>$argument } ); |
40
|
|
|
|
|
|
|
$child->set_parent($predicate); |
41
|
|
|
|
|
|
|
} |
42
|
|
|
|
|
|
|
|
43
|
|
|
|
|
|
|
$doc->save('demo.treex'); |
44
|
|
|
|
|
|
|
|
45
|
|
|
|
|
|
|
|
46
|
|
|
|
|
|
|
=head1 DESCRIPTION |
47
|
|
|
|
|
|
|
|
48
|
|
|
|
|
|
|
C<Treex::Core> is a library of modules for processing linguistic data, |
49
|
|
|
|
|
|
|
especially tree-shaped syntactic representations of natural language |
50
|
|
|
|
|
|
|
sentences, both for language analysis and synthesis purposes. |
51
|
|
|
|
|
|
|
|
52
|
|
|
|
|
|
|
C<Treex::Core> is meant to be as language universal as possible. |
53
|
|
|
|
|
|
|
It makes only a few assumptions: the language's written form must be |
54
|
|
|
|
|
|
|
representable by Unicode characters, and it should be possible to segment |
55
|
|
|
|
|
|
|
texts in such language into sentences (or sentence-like units) and words |
56
|
|
|
|
|
|
|
(or word-like units). |
57
|
|
|
|
|
|
|
|
58
|
|
|
|
|
|
|
C<Treex::Core> is tightly coupled with the tree editor TrEd, which |
59
|
|
|
|
|
|
|
makes browsing the linguistic data structures very comfortable. |
60
|
|
|
|
|
|
|
|
61
|
|
|
|
|
|
|
C<Treex::Core> uses TrEd's L<Treex::PML> for the memory |
62
|
|
|
|
|
|
|
representation, as well as for storing the data into *.treex files, using |
63
|
|
|
|
|
|
|
the XML-based Prague Markup Language. |
64
|
|
|
|
|
|
|
|
65
|
|
|
|
|
|
|
|
66
|
|
|
|
|
|
|
=head2 Zones parametrized by language codes and selectors |
67
|
|
|
|
|
|
|
|
68
|
|
|
|
|
|
|
Treex documents can contain parallel texts in two or more languages, |
69
|
|
|
|
|
|
|
as well as alternative linguistic representations (such as two |
70
|
|
|
|
|
|
|
dependency parses of a same sentence, resulting from different parsers). |
71
|
|
|
|
|
|
|
Such contents of the same type are separated by introducing zones. |
72
|
|
|
|
|
|
|
|
73
|
|
|
|
|
|
|
Zones (classes derived from L<Treex::Core::Zone>) are |
74
|
|
|
|
|
|
|
parametrized by language ISO codes, and optionally also by so called |
75
|
|
|
|
|
|
|
selectors. Selector can be any string identifying the source or purpose of the |
76
|
|
|
|
|
|
|
given piece of data. It can distinguish e.g. reference translation from |
77
|
|
|
|
|
|
|
machine-translated text, or the most probable parse of a given sentence from |
78
|
|
|
|
|
|
|
the second most probable parse. In Treex data structures, zones are used at |
79
|
|
|
|
|
|
|
two levels: |
80
|
|
|
|
|
|
|
|
81
|
|
|
|
|
|
|
- L<Treex::Core::DocZone> - allows to have multiple texts |
82
|
|
|
|
|
|
|
stored in the same document |
83
|
|
|
|
|
|
|
|
84
|
|
|
|
|
|
|
- L<Treex::Core::BundleZone> - allows to have multiple |
85
|
|
|
|
|
|
|
sentences and their representations in each bundle. |
86
|
|
|
|
|
|
|
|
87
|
|
|
|
|
|
|
As for Treex processing units (scenarios and blocks, see below), each |
88
|
|
|
|
|
|
|
processing unit either limits itself to a certain zone, or it can be |
89
|
|
|
|
|
|
|
zone-parametrized (especially in the case of language-universal blocks). |
90
|
|
|
|
|
|
|
|
91
|
|
|
|
|
|
|
=head2 Data structure units |
92
|
|
|
|
|
|
|
|
93
|
|
|
|
|
|
|
In Treex, linguistic representations of running texts are organized |
94
|
|
|
|
|
|
|
in the following hierarchy: |
95
|
|
|
|
|
|
|
|
96
|
|
|
|
|
|
|
=head3 Documents |
97
|
|
|
|
|
|
|
|
98
|
|
|
|
|
|
|
The smallest independently storable unit is a document |
99
|
|
|
|
|
|
|
(L<Treex::Core::Document>). |
100
|
|
|
|
|
|
|
|
101
|
|
|
|
|
|
|
Technically, each document consists of a set of document zones, and of a |
102
|
|
|
|
|
|
|
sequence of bundles. |
103
|
|
|
|
|
|
|
|
104
|
|
|
|
|
|
|
=head3 Document zone |
105
|
|
|
|
|
|
|
|
106
|
|
|
|
|
|
|
A document can contain one ore more zone |
107
|
|
|
|
|
|
|
(L<Treex::Core::DocZone>), each of them containing a text. |
108
|
|
|
|
|
|
|
|
109
|
|
|
|
|
|
|
=head3 Bundle |
110
|
|
|
|
|
|
|
|
111
|
|
|
|
|
|
|
A bundle (L<Treex::Core::Bundle>) corresponds to a |
112
|
|
|
|
|
|
|
sentence (or a tuple of parallel or alternative sentences) and all its (or |
113
|
|
|
|
|
|
|
their) linguistic analyses. |
114
|
|
|
|
|
|
|
|
115
|
|
|
|
|
|
|
Technically, a bundle contains a set of bundle zones. |
116
|
|
|
|
|
|
|
|
117
|
|
|
|
|
|
|
=head3 Bundle zone |
118
|
|
|
|
|
|
|
|
119
|
|
|
|
|
|
|
Bundle zone (L<Treex::Core::Bundle>) contains one sentence |
120
|
|
|
|
|
|
|
and at most one its linguistic analysis for each layer of analysis. The |
121
|
|
|
|
|
|
|
following layers are currently distinguished: |
122
|
|
|
|
|
|
|
|
123
|
|
|
|
|
|
|
* a-layer - analytical layer (surface syntax dependency layer) merged with the |
124
|
|
|
|
|
|
|
morphological layer, as defined in the Prague Dependency Treebank. |
125
|
|
|
|
|
|
|
|
126
|
|
|
|
|
|
|
* t-layer - tectogrammatical layer (deep-syntactic dependency) |
127
|
|
|
|
|
|
|
|
128
|
|
|
|
|
|
|
* p-layer - phrase-structure layer |
129
|
|
|
|
|
|
|
|
130
|
|
|
|
|
|
|
* n-layer - named entity layer |
131
|
|
|
|
|
|
|
|
132
|
|
|
|
|
|
|
Each layer representation has a form of a tree, represented by the tree's root node. |
133
|
|
|
|
|
|
|
|
134
|
|
|
|
|
|
|
=head3 Node |
135
|
|
|
|
|
|
|
|
136
|
|
|
|
|
|
|
Each node has a parent (unless it is the root) and a set of predefined |
137
|
|
|
|
|
|
|
attributes, depending on the layer it belongs to. There is an abstract class |
138
|
|
|
|
|
|
|
L<Treex::Core::Node> defining the functionality which is |
139
|
|
|
|
|
|
|
common to all types of trees (such as functions for accessing node's parent or |
140
|
|
|
|
|
|
|
children). Functionality specific for the individual linguistic layers is |
141
|
|
|
|
|
|
|
implemented in the derived classes: |
142
|
|
|
|
|
|
|
|
143
|
|
|
|
|
|
|
* L<Treex::Core::Node::A> |
144
|
|
|
|
|
|
|
|
145
|
|
|
|
|
|
|
* L<Treex::Core::Node::T> |
146
|
|
|
|
|
|
|
|
147
|
|
|
|
|
|
|
* L<Treex::Core::Node::P> |
148
|
|
|
|
|
|
|
|
149
|
|
|
|
|
|
|
* L<Treex::Core::Node::N> |
150
|
|
|
|
|
|
|
|
151
|
|
|
|
|
|
|
=head3 Attributes |
152
|
|
|
|
|
|
|
|
153
|
|
|
|
|
|
|
Nodes contain attribute-value pairs. Some attributes are universal (such as |
154
|
|
|
|
|
|
|
identifier), but most of them are specific for a certain layer. Even if node |
155
|
|
|
|
|
|
|
instances are regular Moose objects (i.e., blessed hashes), node's attributes |
156
|
|
|
|
|
|
|
should be accessed exclusively via predefined accessors. |
157
|
|
|
|
|
|
|
|
158
|
|
|
|
|
|
|
Attribute values can be plain or further structured using PML data types (e.g. |
159
|
|
|
|
|
|
|
sequences), according to the PML schema. |
160
|
|
|
|
|
|
|
|
161
|
|
|
|
|
|
|
|
162
|
|
|
|
|
|
|
=head2 Processing units |
163
|
|
|
|
|
|
|
|
164
|
|
|
|
|
|
|
=head3 Block |
165
|
|
|
|
|
|
|
|
166
|
|
|
|
|
|
|
Blocks (descendants of L<Treex::Core::Block>) are the |
167
|
|
|
|
|
|
|
smallest processing units applicable on Treex documents. |
168
|
|
|
|
|
|
|
|
169
|
|
|
|
|
|
|
=head3 Scenario |
170
|
|
|
|
|
|
|
|
171
|
|
|
|
|
|
|
Scenarios (instances of L<Treex::Core::Scenario>) are |
172
|
|
|
|
|
|
|
sequences of blocks. Blocks from a scenario are applied on a document one |
173
|
|
|
|
|
|
|
after another. |
174
|
|
|
|
|
|
|
|
175
|
|
|
|
|
|
|
=head2 Support for visualizing Treex trees in TrEd |
176
|
|
|
|
|
|
|
|
177
|
|
|
|
|
|
|
C<Treex::Core> also contains a TrEd extension for browsing .treex files. |
178
|
|
|
|
|
|
|
The extension itself is only a thin wrapper for the viewing functionality |
179
|
|
|
|
|
|
|
implemented in L<Treex::Core::TredView>. |
180
|
|
|
|
|
|
|
|
181
|
|
|
|
|
|
|
|
182
|
|
|
|
|
|
|
=head1 AUTHOR |
183
|
|
|
|
|
|
|
|
184
|
|
|
|
|
|
|
ZdenÄk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz> |
185
|
|
|
|
|
|
|
|
186
|
|
|
|
|
|
|
Martin Popel <popel@ufal.mff.cuni.cz> |
187
|
|
|
|
|
|
|
|
188
|
|
|
|
|
|
|
David MareÄek <marecek@ufal.mff.cuni.cz> |
189
|
|
|
|
|
|
|
|
190
|
|
|
|
|
|
|
=head1 COPYRIGHT AND LICENSE |
191
|
|
|
|
|
|
|
|
192
|
|
|
|
|
|
|
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague |
193
|
|
|
|
|
|
|
|
194
|
|
|
|
|
|
|
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. |