File Coverage

blib/lib/Treex/Core.pm
Criterion Covered Total %
statement 18 18 100.0
branch n/a
condition n/a
subroutine 6 6 100.0
pod n/a
total 24 24 100.0


line stmt bran cond sub pod time code
1             package Treex::Core;
2             $Treex::Core::VERSION = '2.20210102';
3 8     8   309604 use strict;
  8         64  
  8         241  
4 8     8   40 use warnings;
  8         16  
  8         232  
5 8     8   4081 use Treex::Core::Document;
  8         33  
  8         379  
6 8     8   83 use Treex::Core::Node;
  8         21  
  8         219  
7 8     8   48 use Treex::Core::Bundle;
  8         20  
  8         195  
8 8     8   5352 use Treex::Core::Scenario;
  8         97  
  8         477  
9              
10             1;
11              
12             __END__
13              
14             =pod
15              
16             =encoding utf8
17              
18             =head1 NAME
19              
20             Treex::Core - interface to linguistic structures and processing units in Treex
21              
22             =head1 VERSION
23              
24             version 2.20210102
25              
26             =head1 SYNOPSIS
27              
28             use Treex::Core;
29            
30             my $doc = Treex::Core::Document->new;
31            
32             my $bundle = $doc->create_bundle;
33             my $zone = $bundle->create_zone('en');
34             my $atree = $zone->create_atree;
35            
36             my $predicate = $atree->create_child({form=>'loves'});
37            
38             foreach my $argument (qw(John Mary)) {
39             my $child = $atree->create_child( { form=>$argument } );
40             $child->set_parent($predicate);
41             }
42            
43             $doc->save('demo.treex');
44              
45              
46             =head1 DESCRIPTION
47              
48             C<Treex::Core> is a library of modules for processing linguistic data,
49             especially tree-shaped syntactic representations of natural language
50             sentences, both for language analysis and synthesis purposes.
51              
52             C<Treex::Core> is meant to be as language universal as possible.
53             It makes only a few assumptions: the language's written form must be
54             representable by Unicode characters, and it should be possible to segment
55             texts in such language into sentences (or sentence-like units) and words
56             (or word-like units).
57              
58             C<Treex::Core> is tightly coupled with the tree editor TrEd, which
59             makes browsing the linguistic data structures very comfortable.
60              
61             C<Treex::Core> uses TrEd's L<Treex::PML> for the memory
62             representation, as well as for storing the data into *.treex files, using
63             the XML-based Prague Markup Language.
64              
65              
66             =head2 Zones parametrized by language codes and selectors
67              
68             Treex documents can contain parallel texts in two or more languages,
69             as well as alternative linguistic representations (such as two
70             dependency parses of a same sentence, resulting from different parsers).
71             Such contents of the same type are separated by introducing zones.
72              
73             Zones (classes derived from L<Treex::Core::Zone>) are
74             parametrized by language ISO codes, and optionally also by so called
75             selectors. Selector can be any string identifying the source or purpose of the
76             given piece of data. It can distinguish e.g. reference translation from
77             machine-translated text, or the most probable parse of a given sentence from
78             the second most probable parse. In Treex data structures, zones are used at
79             two levels:
80              
81             - L<Treex::Core::DocZone> - allows to have multiple texts
82             stored in the same document
83              
84             - L<Treex::Core::BundleZone> - allows to have multiple
85             sentences and their representations in each bundle.
86              
87             As for Treex processing units (scenarios and blocks, see below), each
88             processing unit either limits itself to a certain zone, or it can be
89             zone-parametrized (especially in the case of language-universal blocks).
90              
91             =head2 Data structure units
92              
93             In Treex, linguistic representations of running texts are organized
94             in the following hierarchy:
95              
96             =head3 Documents
97              
98             The smallest independently storable unit is a document
99             (L<Treex::Core::Document>).
100              
101             Technically, each document consists of a set of document zones, and of a
102             sequence of bundles.
103              
104             =head3 Document zone
105              
106             A document can contain one ore more zone
107             (L<Treex::Core::DocZone>), each of them containing a text.
108              
109             =head3 Bundle
110              
111             A bundle (L<Treex::Core::Bundle>) corresponds to a
112             sentence (or a tuple of parallel or alternative sentences) and all its (or
113             their) linguistic analyses.
114              
115             Technically, a bundle contains a set of bundle zones.
116              
117             =head3 Bundle zone
118              
119             Bundle zone (L<Treex::Core::Bundle>) contains one sentence
120             and at most one its linguistic analysis for each layer of analysis. The
121             following layers are currently distinguished:
122              
123             * a-layer - analytical layer (surface syntax dependency layer) merged with the
124             morphological layer, as defined in the Prague Dependency Treebank.
125              
126             * t-layer - tectogrammatical layer (deep-syntactic dependency)
127              
128             * p-layer - phrase-structure layer
129              
130             * n-layer - named entity layer
131              
132             Each layer representation has a form of a tree, represented by the tree's root node.
133              
134             =head3 Node
135              
136             Each node has a parent (unless it is the root) and a set of predefined
137             attributes, depending on the layer it belongs to. There is an abstract class
138             L<Treex::Core::Node> defining the functionality which is
139             common to all types of trees (such as functions for accessing node's parent or
140             children). Functionality specific for the individual linguistic layers is
141             implemented in the derived classes:
142              
143             * L<Treex::Core::Node::A>
144              
145             * L<Treex::Core::Node::T>
146              
147             * L<Treex::Core::Node::P>
148              
149             * L<Treex::Core::Node::N>
150              
151             =head3 Attributes
152              
153             Nodes contain attribute-value pairs. Some attributes are universal (such as
154             identifier), but most of them are specific for a certain layer. Even if node
155             instances are regular Moose objects (i.e., blessed hashes), node's attributes
156             should be accessed exclusively via predefined accessors.
157              
158             Attribute values can be plain or further structured using PML data types (e.g.
159             sequences), according to the PML schema.
160              
161              
162             =head2 Processing units
163              
164             =head3 Block
165              
166             Blocks (descendants of L<Treex::Core::Block>) are the
167             smallest processing units applicable on Treex documents.
168              
169             =head3 Scenario
170              
171             Scenarios (instances of L<Treex::Core::Scenario>) are
172             sequences of blocks. Blocks from a scenario are applied on a document one
173             after another.
174              
175             =head2 Support for visualizing Treex trees in TrEd
176              
177             C<Treex::Core> also contains a TrEd extension for browsing .treex files.
178             The extension itself is only a thin wrapper for the viewing functionality
179             implemented in L<Treex::Core::TredView>.
180              
181              
182             =head1 AUTHOR
183              
184             Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>
185              
186             Martin Popel <popel@ufal.mff.cuni.cz>
187              
188             David Mareček <marecek@ufal.mff.cuni.cz>
189              
190             =head1 COPYRIGHT AND LICENSE
191              
192             Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
193              
194             This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.