File Coverage

blib/lib/Lingua/StarDict/Writer.pm
Criterion Covered Total %
statement 87 87 100.0
branch 6 8 75.0
condition n/a
subroutine 14 14 100.0
pod 2 3 66.6
total 109 112 97.3


line stmt bran cond sub pod time code
1             package Lingua::StarDict::Writer;
2              
3 2     2   58918 use 5.008;
  2         6  
4 2     2   11 use strict;
  2         3  
  2         35  
5 2     2   8 use warnings;
  2         4  
  2         71  
6 2     2   1446 use Path::Tiny;
  2         20539  
  2         120  
7 2     2   1088 use Unicode::UTF8;
  2         954  
  2         78  
8 2     2   946 use Time::Piece;
  2         16583  
  2         11  
9 2     2   1161 use Moo;
  2         18813  
  2         11  
10              
11              
12 2     2   3302 use Lingua::StarDict::Writer::Entry;
  2         7  
  2         640  
13              
14             =encoding utf8
15             =cut
16              
17             =head1 NAME
18              
19             Lingua::StarDict::Writer - A module that allows to create a StarDict dictionary
20              
21             =head1 VERSION
22              
23             Version 0.01
24              
25             =cut
26              
27             our $VERSION = '0.01';
28              
29              
30             =head1 SYNOPSIS
31              
32             A module that allows to create a StarDict-compatible dictionary, with multipart
33             and multitype entries.
34              
35             use Lingua::StarDict::Writer;
36              
37             my $stardict_writer = StarDict::Writer->new(name=>'My Cool Dictionary', date=>"2020-12-31");
38              
39             $stardict_writer->entry('42')->add_part(type=> "t", data => "ˈfɔɹti tuː");
40             $stardict_writer->entry('42')->add_part(type=> "m", data => "Answer to the Ultimate Question of Life, the Universe, and Everything");
41              
42             $stardict_writer->entry('Perl')->add_part(type=> "t", data => "pɛʁl");
43             $stardict_writer->entry('Perl')->add_part(type=> "h", data => "The best programming language ever");
44              
45             $stardict_writer->write;
46              
47             =head1 DESCRIPTION
48              
49             StarDict is a popular dictionary format, supported by many dictionary and book reading programs.
50              
51             StarDict entry may consist of several parts of various text or media types.
52              
53             This module allows to create a new StarDict dictionary with entries consisting of parts of
54             arbitrary types.
55              
56             =head1 METHODS
57              
58             =head2 new ( option_name => 'value')
59              
60             Constructs and returns a new C object. This object will accept parts for the
61             dictionary entry via C method. You can write the resulting dictionary with C method.
62             C method accepts arguments represented as C<< name=>value >> options hash. Following options are available:
63              
64             =over
65              
66             =item * C - sets a name for the dictionary. It will be specified in StarDict dictionary C<.ifo> file as
67             dictionary name. When you call C method, writer will create C dir in the C dir, and all
68             dictionary files that will be written there will use C as the base part of file name.
69             By default, the name will be set to C<"Some Dictionary written by Lingua::StarDict::Writer"> if none is provided.
70              
71             =item * C - date of dictionary creation in C format. Will be saved in Stardict C<.ifo> file.
72             By default, current date will be used.
73              
74             =item * C - path where dictionary files will be saved. By default, current dir will be used as the C.
75              
76             =back
77              
78             =head2 entry($entry_title)
79              
80             Returns dictionary entry named C<$entry_title>, if entry C<$entry_title> does not exist, a new empty empty
81             dictionary entry will be created and returned. The only reason you may want to get dictionary entry is to
82             add a new part using C method (See below)
83              
84             Entries can be added in arbitrary order, they will be sorted alphabetically using StarDict sorting algorithm, when
85             dictionary is written.
86              
87             =head2 entry->($enry_title)->add_part(type => $part_type, data => $part_data)
88              
89             Adds new part to an entry.
90              
91             =over
92              
93             =item * C - part type, coded as one Latin letter as specified in StarDictFileFormat. (C<'m'> for plaintext, C<'h'> for html,
94             C<'t'> for pronunciation, etc. See StarDictFileFormat in "See Also" chapter for more info). By default C<'m'> type will
95             be used if none is specified.
96              
97             =item * C - Content of added entry part: a text string that can be formatted using chosen C markup.
98              
99             =back
100              
101             Parts will be saved in the entry in the order they were added.
102              
103             =head2 write
104              
105             This method will write all entries to the disk formatted as StarDict dictionary. C<.dict> C<.idx> and C<.ifo> files
106             will be placed in a C dir at the path you've specified in C option. You should put them to C or
107             C<~/.stardict/dic> path to make them visible to StarDict.
108              
109             =head1 ENCODING ISSUE
110              
111             All methods expect to recieve data encoded as perl character strings, not byte string (I.e. Cyrillic "я" should be encoded as C<\x{44f}>,
112             and not as C<\x{d1}\x{8f}>. If you have read utf-8 source data from a file, database or from web, make sure that utf-8 bytes you've got
113             are converted to perl characters. See L for more info.
114              
115              
116             =head1 CAVEATS
117              
118             =over
119              
120             =item * C mode is not implemented. Use custom sequence mode instead.
121              
122             =item * Support for binary parts is not implemented.
123              
124             =item * Dictionary compression is not implemented.
125              
126             =item * Synonyms is not implemented.
127              
128             =back
129              
130             etc...
131              
132             =head1 SEE ALSO
133              
134             =over
135              
136             =item * L - StarDict
137             format description. A copy of this file can be found in this package in C dir.
138              
139             =item * L - another module for writing StarDict dictionaries. It supports only
140             single-part plain text entries.
141              
142             =back
143              
144             =cut
145              
146             has 'name' => (
147             is => 'rw',
148             default => "Some Dictionary written by Lingua::StarDict::Writer",
149             );
150              
151             has 'date' => (
152             is => 'rw',
153             default => sub {localtime->ymd},
154             );
155              
156             has 'output_dir' => (
157             is => 'rw',
158             default => "./",
159             );
160              
161             has '_entries' => (
162             is => 'rw',
163             default => sub {+{}}
164             );
165              
166             sub entry
167             {
168 39     39 1 21398 my $self = shift;
169 39         53 my $name = shift;
170              
171 39 100       99 if (! exists $self->_entries->{$name})
172             {
173 18         668 $self->_entries->{$name} = Lingua::StarDict::Writer::Entry->new(name=>$name); # Create new entry if does not exist
174             }
175 39         157 return $self->_entries->{$name};
176             }
177              
178             sub write
179             {
180 3     3 1 704 my $self = shift;
181 3         18 my $dir = path($self->output_dir,$self->name);
182              
183 3 50       262 unless(-d $dir)
184             {
185 3 50       143 mkdir($dir,0755) or die "Cant create directory $dir\n";
186             }
187              
188 3         304 my $dict_fh = $dir->child($self->name.".dict")->openw_raw;
189 3         556 my $idx_fh = $dir->child($self->name.".idx" )->openw_raw;
190 3         431 my $ifo_fh = $dir->child($self->name.".ifo" )->openw_utf8;
191              
192 3         11068 my $byte_count = 0;
193 3         8 my @ordered_keys = sort {stardict_strcmp($a,$b)} keys(%{$self->_entries});
  41         54  
  3         24  
194              
195 3         9 foreach my $word (@ordered_keys)
196             {
197 17         23 my $start_pos = $byte_count;
198 17         19 my $word_bytes;
199             {
200 2     2   17 use warnings FATAL => 'utf8';
  2         4  
  2         195  
  17         18  
201 17         25 $word_bytes = Unicode::UTF8::encode_utf8($word);
202             }
203 17         50 print $idx_fh pack('a*x',$word_bytes);
204 17         29 print $idx_fh pack('N',$byte_count);
205              
206 17         19 foreach my $part (@{$self->entry($word)->_parts})
  17         23  
207             {
208 19         31 my $data = $part->{data};
209 19         20 my $data_bytes;
210             {
211 2     2   13 use warnings FATAL => 'utf8';
  2         5  
  2         503  
  19         22  
212 19         31 $data_bytes = Unicode::UTF8::encode_utf8($data);
213             }
214 19         55 print $dict_fh $part->{type};
215 19         27 print $dict_fh "$data_bytes\0";
216 19         30 $byte_count += length($data_bytes) + 1 + 1; # one for media type char, one for \0
217             }
218              
219 17         33 print $idx_fh pack('N',$byte_count-$start_pos);
220             }
221              
222 3         6 my $word_count = scalar (keys %{$self->_entries});
  3         6  
223              
224 3         13 print $ifo_fh "StarDict's dict ifo file\n";
225 3         6 print $ifo_fh "version=2.4.2\n";
226 3         8 print $ifo_fh "wordcount=$word_count\n";
227 3         45 print $ifo_fh "bookname=".$self->name."\n";
228 3         17 print $ifo_fh "idxfilesize=", tell($idx_fh),"\n";
229 3         12 print $ifo_fh "date=".$self->date."\n";
230             # if($^O eq "MSWin32"){ print $ifo_fh "sametypesequence=m\n";} # do not support sametypesequence for now
231             # else { print $ifo_fh "sametypesequence=x\n";}
232              
233 3         124 close $dict_fh;
234 3         57 close $idx_fh;
235 3         119 close $ifo_fh;
236             }
237              
238              
239             # g_ascii_strcasecmp, strcmp, stardict_strcmp_old (formerly known as stardict_strcmp)
240             # are pure perl reimplementation of sort functons used by StarDirct for index lookup.
241             # Index file should be ordered with exactly the same functions that is used for lookup
242             # g_ascii_strcasecmp, strcmp, stardict_strcmp_old are left here commented out for historical
243             # reasons.
244             # stardict_strcmp is a perl-way implementation of sort functions that do same ordering
245             # as function mentioned above
246              
247             #sub g_ascii_strcasecmp
248             #{
249             # # pure perl re-implementation of g_ascii_strcasecmp
250             # my $s1 = shift;
251             # my $s2 = shift;
252             # no locale;
253             # $s1=~s/([A-Z])/lc($1)/ge;
254             # $s2=~s/([A-Z])/lc($1)/ge;
255             # while (length($s1) || length($s2))
256             # {
257             # return -1 if length($s1)==0;
258             # return 1 if length($s2)==0;
259             # $s1=~s/^(.)//;
260             # my $c1 = $1;
261             # $s2=~s/^(.)//;
262             # my $c2 = $1;
263             # return ord($c1)-ord($c2) if $c1 ne $c2;
264             # }
265             # return 0;
266             #}
267              
268              
269             #sub strcmp
270             #{
271             # # pure perl re-implementation of strcmp
272             # my $s1 = shift;
273             # my $s2 = shift;
274             # no locale;
275             # while (length($s1) || length($s2))
276             # {
277             # return -1 if length($s1)==0;
278             # return 1 if length($s2)==0;
279             # $s1=~s/^(.)//;
280             # my $c1 = $1;
281             # $s2=~s/^(.)//;
282             # my $c2 = $1;
283             # return ord($c1)-ord($c2) if $c1 ne $c2;
284             # }
285             # return 0;
286             #}
287              
288             #sub stardict_strcmp_old
289             #{
290             # # pure perl re-implementation of stardict_strcmp
291             # my $s1 = shift;
292             # my $s2 = shift;
293             #
294             # my $i = g_ascii_strcasecmp($s1, $s2);
295             # return $i if $i;
296             # return strcmp($s1,$s2);
297             #}
298              
299              
300              
301             # StarDict expects index file to be sorted in a specific way.
302             # UTF-8 strings are treated as bytes, all latin (and only latin) characters
303             # are taken to lower case, and then strings are compared. If strings copmared
304             # that way found equal, they are compared again, now without converting latin
305             # letters to lower case.
306              
307             # For more info see doc/StarDictFileFormat or StarDict code. You should look for
308             # strcmp and stardict_strcmp functions there, and for g_ascii_strcasecmp in glibc.
309              
310             sub stardict_strcmp
311             {
312 41     41 0 52 my $s1 = shift;
313 41         43 my $s2 = shift;
314              
315 41         50 my $s1_bytes;
316             my $s2_bytes;
317             {
318 2     2   14 use warnings FATAL => 'utf8';
  2         5  
  2         249  
  41         43  
319 41         58 $s1_bytes = Unicode::UTF8::encode_utf8($s1); # Convert sting from unicde characters to bytes
320 41         56 $s2_bytes = Unicode::UTF8::encode_utf8($s2);
321             }
322              
323 41         65 my $s1_lc_bytes = $s1_bytes =~ tr/A-Z/a-z/r; # do lower case
324 41         55 my $s2_lc_bytes = $s2_bytes =~ tr/A-Z/a-z/r;
325              
326 41         51 my $res = $s1_lc_bytes cmp $s2_lc_bytes; # Compare lower case string represented as bytes
327              
328 41 100       70 $res = $s1_bytes cmp $s2_bytes unless $res; # if equal, compare unlowercased string represented as bytes;
329              
330 41         57 return $res;
331             }
332              
333              
334             =head1 AUTHOR
335              
336             Nikolay Shaplov, C<< >>
337              
338             =head1 CREDITS
339              
340             Special thanks to B from C C<#perl> for deep code review.
341              
342              
343             =head1 BUGS
344              
345             Please report any bugs or feature requests through
346             the web interface at L
347              
348             =head1 SUPPORT
349              
350             You can find documentation for this module with the perldoc command.
351              
352             perldoc Lingua::StarDict::Writer
353              
354              
355             You can also look for information at:
356              
357             =over 4
358              
359             =item * The project's bug tracker (report bugs and request features here)
360              
361             L
362              
363             =item * AnnoCPAN: Annotated CPAN documentation
364              
365             L
366              
367             =item * CPAN Ratings
368              
369             L
370              
371             =item * Search CPAN
372              
373             L
374              
375             =back
376              
377              
378             =head1 ACKNOWLEDGEMENTS
379              
380              
381             =head1 LICENSE AND COPYRIGHT
382              
383             Copyright 2021 Nikolay Shaplov.
384              
385             This program is free software; you can redistribute it and/or modify it
386             under the terms of the the Artistic License (2.0). You may obtain a
387             copy of the full license at:
388              
389             L
390              
391             Any use, modification, and distribution of the Standard or Modified
392             Versions is governed by this Artistic License. By using, modifying or
393             distributing the Package, you accept this license. Do not use, modify,
394             or distribute the Package, if you do not accept this license.
395              
396             If your Modified Version has been derived from a Modified Version made
397             by someone other than you, you are nevertheless required to ensure that
398             your Modified Version complies with the requirements of this license.
399              
400             This license does not grant you the right to use any trademark, service
401             mark, tradename, or logo of the Copyright Holder.
402              
403             This license includes the non-exclusive, worldwide, free-of-charge
404             patent license to make, have made, use, offer to sell, sell, import and
405             otherwise transfer the Package with respect to any patent claims
406             licensable by the Copyright Holder that are necessarily infringed by the
407             Package. If you institute patent litigation (including a cross-claim or
408             counterclaim) against any party alleging that the Package constitutes
409             direct or contributory patent infringement, then this Artistic License
410             to you shall terminate on the date that such litigation is filed.
411              
412             Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER
413             AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES.
414             THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
415             PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY
416             YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR
417             CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR
418             CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE,
419             EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
420              
421             =cut
422              
423             1; # End of Lingua::StarDict::Writer