File Coverage

script/jupiter
Criterion Covered Total %
statement 331 352 94.0
branch 58 84 69.0
condition 38 50 76.0
subroutine 49 52 94.2
pod n/a
total 476 538 88.4


line stmt bran cond sub pod time code
1             #! /usr/bin/env perl
2             # Planet Jupiter is a feed aggregator that creates a single HTML file
3             # Copyright (C) 2020–2021 Alex Schroeder <alex@gnu.org>
4              
5             # This program is free software: you can redistribute it and/or modify
6             # it under the terms of the GNU Affero General Public License as published by
7             # the Free Software Foundation, either version 3 of the License, or
8             # (at your option) any later version.
9             #
10             # This program is distributed in the hope that it will be useful,
11             # but WITHOUT ANY WARRANTY; without even the implied warranty of
12             # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13             # GNU Affero General Public License for more details.
14             #
15             # You should have received a copy of the GNU Affero General Public License
16             # along with this program. If not, see <https://www.gnu.org/licenses/>.
17              
18             package Jupiter;
19              
20 15     15   5310177 use Encode::Locale;
  15         63904  
  15         717  
21 15     15   128 use Encode;
  15         43  
  15         1287  
22 15     15   666 use utf8;
  15         43  
  15         82  
23              
24             binmode(STDOUT, ":utf8");
25             binmode(STDERR, ":utf8");
26              
27             =encoding UTF8
28              
29             =head1 NAME
30              
31             jupiter - turn a list of feeds into a HTML page, a river of news
32              
33             =head1 SYNOPSIS
34              
35             To update the feeds from one or more OPML files:
36              
37             B<jupiter update> I<feed.opml> … [I</regex/> …]
38              
39             To generate F<index.html>:
40              
41             B<jupiter html> I<feed.opml>
42              
43             =head1 DESCRIPTION
44              
45             Planet Jupiter is used to pull together the latest updates from a bunch of other
46             sites and display them on a single web page, the "river of news". The sites we
47             get our updates from are defined in an OPML file.
48              
49             A river of news, according to Dave Winer, is a feed aggregator. New items appear
50             at the top and old items disappear at the bottom. When it's gone, it's gone.
51             There is no count of unread items. The goal is to fight the I<fear of missing
52             out> (FOMO).
53              
54             Each item looks similar to every other: headline, link, an extract, maybe a date
55             and an author. Extracts contain but the beginning of the article's text; all
56             markup is removed; no images. The goal is to make it the page easy to skim.
57             Scroll down until you find something interesting and follow the link to the
58             original article if you want to read it.
59              
60             =head2 The OPML file
61              
62             You B<need> an OPML file. It's an XML file linking to I<feeds>. Here's an
63             example listing just one feed. In order to add more, add more C<outline>
64             elements with the C<xmlUrl> attribute. The exact order and nesting does not
65             matter. People can I<import> these OPML files into their own feed readers and
66             thus it may make sense to spend a bit more effort in making it presentable.
67              
68             <opml version="2.0">
69             <body>
70             <outline title="Alex Schroeder"
71             xmlUrl="https://alexschroeder.ch/wiki?action=rss"/>
72             </body>
73             </opml>
74              
75             =head2 Update the feeds in your cache
76              
77             This is how you update the feeds in a file called C<feed.opml>. It downloads all
78             the feeds linked to in the OPML file and stores them in the cache directory.
79              
80             jupiter update feed.opml
81              
82             The directory used to keep a copy of all the feeds in the OPML file has the same
83             name as the OPML file but without the .opml extension. In other words, if your
84             OPML file is called C<feed.opml> then the cache directory is called C<feed>.
85              
86             This operation takes long because it requests an update from all the sites
87             listed in your OPML file. Don't run it too often or you'll annoy the site
88             owners.
89              
90             The OPML file must use the .opml extension. You can update the feeds for
91             multiple OPML files in one go.
92              
93             =head2 Adding just one feed
94              
95             After a while, the list of feeds in your OPML starts getting unwieldy. When you
96             add a new feed, you might not want to fetch all of them. In this case, provide a
97             regular expression surrounded by slashes to the C<update> command:
98              
99             jupiter update feed.opml /example/
100              
101             Assuming a feed with a URL or title that matches the regular expression is
102             listed in your OPML file, only that feed is going to get updated.
103              
104             There is no need to escape slashes in the regular expression: C<//rss/> works
105             just fine. Beware shell escaping, however. Most likely, you need to surround the
106             regular expression with single quotes if it contains spaces:
107              
108             jupiter update feed.opml '/Halberds & Helmets/'
109              
110             Notice how we assume that named entities such as C<&amp;> have already been
111             parsed into the appropriate strings.
112              
113             =head2 Generate the HTML
114              
115             This is how you generate the C<index.html> file based on the feeds of your
116             C<feed.opml>. It assumes that you have already updated all the feeds (see
117             above).
118              
119             jupiter html feed.opml
120              
121             See L</OPTIONS> for ways to change how the HTML is generated.
122              
123             =head2 Generate the RSS feed
124              
125             This happens at the same time as when you generate the HTML. It takes all the
126             entries that are being added to the HTML and puts the into a feed.
127              
128             See L</OPTIONS> for ways to change how the HTML is generated.
129              
130             =head2 Why separate the two steps?
131              
132             The first reason is that tinkering with the templates involves running the
133             program again and again, and you don't want to contact all the sites whenever
134             you update your templates.
135              
136             The other reason is that it allows you to create subsets. For example, you can
137             fetch the feeds for three different OPML files:
138              
139             jupiter update osr.opml indie.opml other.opml
140              
141             And then you can create three different HTML files:
142              
143             jupiter html osr.html osr.opml
144             jupiter html indie.html indie.opml
145             jupiter html rpg.html osr.opml indie.opml other.opml
146              
147             For an example of how it might look, check out the setup for the planets I run.
148             L<https://alexschroeder.ch/cgit/planet/about/>
149              
150             =head2 What about the JSON file?
151              
152             There's a JSON file that gets generated and updated as you run Planet Jupiter.
153             It's name depends on the OPML files used. It records metadata for every feed in
154             the OPML file that isn't stored in the feeds themselves.
155              
156             C<message> is the HTTP status message, or a similar message such as "No entry
157             newer than 90 days." This is set when update the feeds in your cache.
158              
159             C<message> is the HTTP status code; this code could be the real status code from
160             the server (such as 404 for a "not found" status) or one generated by Jupiter
161             such that it matches the status message (such as 206 for a "partial content"
162             status when there aren't any recent entries in the feed). This is set when
163             update the feeds in your cache.
164              
165             C<title> is the site's title. When you update the feeds in your cache, it is
166             taken from the OPML file. That's how the feed can have a title even if the
167             download failed. When you generate the HTML, the feeds in the cache are parsed
168             and if a title is provided, it is stored in the JSON file and overrides the
169             title in the OPML file.
170              
171             C<link> is the site's link for humans. When you generate the HTML, the feeds in
172             the cache are parsed and if a link is provided, it is stored in the JSON file.
173             If the OPML element contained a C<htmlURL> attribute, however, that takes
174             precedence. The reasoning is that when a podcast is hosted on a platform which
175             generates a link that you don't like and you know the link to the human-readable
176             blog elsehwere, use the C<htmlURL> attribute in the OPML file to override this.
177              
178             C<last_modified> and C<etag> are two headers used for caching from the HTTP
179             response that cannot be changed by data in the feed.
180              
181             If we run into problems downloading a feed, this setup allows us to still link
182             to the feeds that aren't working, using their correct names, and describing the
183             error we encountered.
184              
185             =head2 Logging
186              
187             Use the C<--log=LEVEL> to set the log level. Valid values for LEVEL are debug,
188             info, warn, error, and fatal.
189              
190             =head1 LICENSE
191              
192             GNU Affero General Public License
193              
194             =head1 INSTALLATION
195              
196             Using C<cpan>:
197              
198             cpan App::jupiter
199              
200             Manual install:
201              
202             perl Makefile.PL
203             make
204             make install
205              
206             =head2 Dependencies
207              
208             To run Jupiter on Debian we need:
209              
210             C<libmodern-perl-perl> for L<Modern::Perl>
211              
212             C<libmojolicious-perl> for L<Mojo::Template>, L<Mojo::UserAgent>, L<Mojo::Log>,
213             L<Mojo::JSON>, and L<Mojo::Util>
214              
215             C<libxml-libxml-perl> for L<XML::LibXML>
216              
217             C<libfile-slurper-perl> for L<File::Slurper>
218              
219             C<libdatetime-perl> for L<DateTime>
220              
221             C<libdatetime-format-mail-perl> for L<DateTime::Format::Mail>
222              
223             C<libdatetime-format-iso8601-perl> for L<DateTime::Format::ISO8601>
224              
225             Unfortunately, L<Mojo::UserAgent::Role::Queued> isn't packaged for Debian.
226             Therefore, let's build it and install it as a Debian package.
227              
228             sudo apt-get install libmodule-build-tiny-perl
229             sudo apt-get install dh-make-perl
230             sudo dh-make-perl --build --cpan Mojo::UserAgent::Role::Queued
231             dpkg --install libmojo-useragent-role-queued-perl_1.15-1_all.deb
232              
233             To generate the C<README.md> from the source file, you need F<pod2markdown>
234             which you get in C<libpod-markdown-perl>.
235              
236             =head1 FILES
237              
238             There are a number of files in the F<share> directory which you can use as
239             starting points.
240              
241             F<template.html> is the HTML template.
242              
243             F<default.css> is a small CSS file used by F<template.html>.
244              
245             F<personalize.js> is a small Javascript file used by F<template.html> used to
246             allow visitors to jump from one article to the next using C<J> and C<K>.
247              
248             F<jupiter.png> is used by F<template.html> as the icon.
249              
250             F<jupiter.svg> is used by F<template.html> as the logo.
251              
252             F<feed.png> is used by F<template.html> as the icon for the feeds in the
253             sidebar.
254              
255             F<feed.rss> is the feed template.
256              
257             =head1 OPTIONS
258              
259             HTML generation uses a template, C<template.html>. It is written for
260             C<Mojo::Template> and you can find it in the F<share> directory of your
261             distribution. The default templates use other files, such as the logo, the feed
262             icon, a CSS file, and a small Javascript snippet to enable navigation using the
263             C<J> and C<K> keys (see above).
264              
265             You can specify a different HTML file to generate:
266              
267             B<jupiter html> I<your.html feed.opml>
268              
269             If you specify two HTML files, the first is the HTML file to generate and the
270             second is the template to use. Both must use the C<.html> extension.
271              
272             B<jupiter html> I<your.html your-template.html feed.opml>
273              
274             Feed generation uses a template, C<feed.rss>. It writes all the entries into a
275             file called C<feed.xml>. Again, the template is written for C<Mojo::Template>.
276              
277             You can specify up to two XML, RSS or ATOM files. They must uses one of these
278             three extensions: C<.xml>, C<.rss>, or C<.atom>. The first is the name of the
279             feed to generate, the second is the template to use:
280              
281             B<jupiter html> I<atom.xml template.xml planet.html template.html feed.opml>
282              
283             In the above case, Planet Jupiter will write a feed called F<atom.xml> based on
284             F<template.xml> and a HTML file called F<planet.html> based on F<template.html>,
285             using the cached entries matching the feeds in F<feed.opml>.
286              
287             =cut
288              
289 15     15   14037 use DateTime;
  15         6905971  
  15         802  
290 15     15   8386 use DateTime::Format::Mail;
  15         85456  
  15         606  
291 15     15   9241 use DateTime::Format::ISO8601;
  15         6593858  
  15         878  
292 15     15   167 use File::Basename;
  15         32  
  15         1353  
293 15     15   700 use File::Slurper qw(read_binary write_binary read_text write_text);
  15         3465  
  15         1040  
294 15     15   115 use List::Util qw(uniq min shuffle);
  15         36  
  15         1157  
295 15     15   127 use Modern::Perl;
  15         46  
  15         207  
296 15     15   14477 use Mojo::Log;
  15         398978  
  15         189  
297 15     15   1246 use Mojo::JSON qw(decode_json encode_json);
  15         19516  
  15         982  
298 15     15   8119 use Mojo::Template;
  15         65861  
  15         99  
299 15     15   9139 use Mojo::UserAgent;
  15         436255  
  15         135  
300 15     15   9604 use Pod::Simple::Text;
  15         82213  
  15         687  
301 15     15   10188 use XML::LibXML;
  15         455640  
  15         112  
302 15     15   2430 use Mojo::Util qw(slugify trim xml_escape html_unescape);
  15         48  
  15         1024  
303 15     15   146 use File::ShareDir 'dist_file';
  15         34  
  15         711  
304              
305 15     15   94 use vars qw($log);
  15         46  
  15         85135  
306             our $log = Mojo::Log->new;
307              
308             my $xpc = XML::LibXML::XPathContext->new;
309             $xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');
310             $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
311             $xpc->registerNs('dc', 'http://purl.org/dc/elements/1.1/');
312             $xpc->registerNs('itunes', 'http://www.itunes.com/dtds/podcast-1.0.dtd');
313              
314             my $undefined_date = DateTime->from_epoch( epoch => 0 );
315              
316             my (%wday, %month, $wday_re, $month_re);
317             %wday = qw (lun. Mon mar. Tue mer. Wed jeu. Thu ven. Fri sam. Sat dim. Sun);
318             %month = qw (janv. Jan févr. Feb mars Mar avr. Apr mai May juin Jun
319             juil. Jul août Aug sept. Sep oct. Oct nov. Nov déc. Dec);
320             $wday_re = join('|', map { quotemeta } keys %wday) unless $wday_re;
321             $month_re = join('|', map { quotemeta } keys %month) unless $month_re;
322              
323             # Our tests don't want to call main
324             __PACKAGE__->main unless caller;
325              
326             sub main {
327 0     0   0 my ($log_level) = grep /^--log=/, @ARGV;
328 0 0       0 $log->level(substr($log_level, 6)) if $log_level;
329 0         0 my ($command) = grep /^[a-z]+$/, @ARGV;
330 0   0     0 $command ||= 'help';
331 0 0       0 if ($command eq 'update') {
    0          
332 0         0 update_cache(@ARGV);
333             } elsif ($command eq 'html') {
334 0         0 make_html(@ARGV);
335             } else {
336 0         0 my $parser = Pod::Simple::Text->new();
337 0         0 $parser->parse_file($0);
338             }
339             }
340              
341             sub update_cache {
342 18     18   1337878 my ($feeds, $files) = read_opml(@_);
343 18         100 make_directories($feeds);
344 18         158 load_feed_metadata($feeds, $files);
345 18         272 my $ua = Mojo::UserAgent->new->with_roles('+Queued')
346             ->max_redirects(3)
347             ->max_active(5);
348 18         88165 make_promises($ua, $feeds);
349 18         21937 fetch_feeds($feeds);
350 18         21680 save_feed_metadata($feeds, $files);
351 18         9111 cleanup_cache($feeds);
352             }
353              
354             sub make_promises {
355 18     18   62 my $ua = shift;
356 18         44 my $feeds = shift;
357 18         59 for my $feed (@$feeds) {
358 18         104 my $url = html_unescape $feed->{url}; # undo xml_escape for the request
359             $ua->on(start => sub {
360 18     18   17851 my ($ua, $tx) = @_;
361 18 50       104 $tx->req->headers->if_none_match($feed->{etag}) if ($feed->{etag});
362 18 50       144 $tx->req->headers->if_modified_since($feed->{last_modified}) if ($feed->{last_modified});
363 18         575 });
364             $feed->{promise} = $ua->get_p($url)
365             ->catch(sub {
366 0     0   0 $feed->{message} = "@_";
367 0         0 $feed->{code} = 521;
368             # returning 0 in the case of an error is important
369 0         0 0; })
370             # sleeping to stop blogger.com from blocking us
371 18     18   275 ->finally(sub { $log->debug($url); sleep 2; });
  18         286351  
  18         36004949  
372             }
373             }
374              
375             sub fetch_feeds {
376 18     18   64 my $feeds = shift;
377 18         544 $log->info("Fetching feeds...");
378 18         118 Mojo::Promise->all(map { $_->{promise} } @$feeds)->then(sub {
379             # all returns the values in the same order!
380 18     18   24507 for (my $i = 0; $i < @_; $i++) {
381 18         116 my $feed = $feeds->[$i];
382 18         78 my $value = $_[$i];
383 18         73 my $tx = $value->[0];
384             # relies on catch returning 0 above
385 18 50       218 next unless $tx;
386 18         310 $feed->{message} = $tx->result->message;
387 18         1651 $feed->{code} = $tx->result->code;
388 18         940 $feed->{last_modified} = $tx->result->headers->last_modified;
389 18         1514 $feed->{etag} = $tx->result->headers->etag;
390             # save raw bytes if this is a success
391 18 50       1013 eval { write_binary($feed->{cache_file}, $tx->result->body) } if $tx->result->is_success;
  18         1568  
392 18 50       9852 warn "Unable to write $feed->{cache_file}: $@\n" if $@;
393             }
394             })->catch(sub {
395 0     0   0 warn "Something went wrong: @_";
396 18         230 })->wait;
397             }
398              
399             sub load_feed_metadata {
400 37     37   115 my $feeds = shift;
401 37         76 my $files = shift;
402 37         157 for my $file (@$files) {
403 37         198 my $filename = "$file->{path}/$file->{name}";
404 37 100       932 next unless -r "$filename.json";
405 23         308 my $data = decode_json read_binary("$filename.json");
406 23         10604 for my $feed (@$feeds) {
407 22         85 my $url = $feed->{url};
408 22 50       97 next unless $url;
409             # don't overwrite title and htmlUrl from OPML file
410 22 100       177 $feed->{title} = $data->{$url}->{title} if $data->{$url}->{title};
411 22   100     197 $feed->{link} ||= $data->{$url}->{link};
412             # all the other metadata is loaded from the JSON file
413 22         78 $feed->{message} = $data->{$url}->{message};
414 22         134 $feed->{code} = $data->{$url}->{code};
415 22         82 $feed->{last_modified} = $data->{$url}->{last_modified};
416 22         89 $feed->{etag} = $data->{$url}->{etag};
417 23         96 } grep { $_->{opml_file} eq $file->{file} } @$feeds;
  22         152  
418             }
419             }
420              
421             sub save_feed_metadata {
422 37     37   979 my $feeds = shift;
423 37         106 my $files = shift;
424 37         126 for my $file (@$files) {
425 37         143 my $name = $file->{name};
426             my %messages = map {
427 36         106 my $feed = $_;
428 36         131 $feed->{url} => { map { $_ => $feed->{$_} } grep { $feed->{$_} } qw(title link message code last_modified etag) };
  131         670  
  216         501  
429 37         119 } grep { $_->{opml_file} eq $file->{file} } @$feeds;
  36         212  
430 37         388 write_binary("$file->{path}/$file->{name}.json", encode_json \%messages);
431             }
432             }
433              
434             sub cleanup_cache {
435 18     18   81 my $feeds = shift;
436 18         64 my %good = map { $_ => 1 } @{cache_files($feeds)};
  18         153  
  18         111  
437 18         75 my @unused = grep { not $good{$_} } @{existing_files($feeds)};
  18         165  
  18         105  
438 18 50       661 if (@unused) {
439 0         0 $log->info("Removing unused files from the cache...");
440 0         0 foreach (@unused) { $log->info($_) }
  0         0  
441 0         0 unlink @unused;
442             }
443             }
444              
445             sub existing_files {
446 18     18   67 my $feeds = shift;
447 18         50 my @files;
448 18         73 for my $dir (uniq map { $_->{cache_dir} } @$feeds) {
  18         161  
449 18 50 33     3466 push(@files, <"$dir/*">) if $dir and -d $dir;
450             }
451 18         228 return \@files;
452             }
453              
454             sub cache_files {
455 18     18   64 my $feeds = shift;
456 18         94 my @files = map { $_->{cache_file} } @$feeds;
  18         148  
457 18         93 return \@files;
458             }
459              
460             sub make_directories {
461 18     18   41 my $feeds = shift;
462 18         53 for my $dir (uniq map { $_->{cache_dir} } @$feeds) {
  18         159  
463 18 100 66     670 if ($dir and not -d $dir) {
464 14         3052 mkdir $dir;
465             }
466             }
467             }
468              
469             sub make_html {
470 19     19   40041 my ($feeds, $files) = read_opml(@_);
471 19         132 load_feed_metadata($feeds, $files); # load messages and codes for feeds
472 19         118 my $globals = globals($files);
473 19         135 my $entries = entries($feeds, 4); # set data for feeds, too
474 19         109 add_data($feeds, $entries); # extract data from the xml
475 19         266 save_feed_metadata($feeds, $files); # save title and link for feeds
476 19         26412 $entries = limit($entries, 100);
477 19         97 write_text(html_file(@_), apply_template(read_text(html_template_file(@_)), $globals, $feeds, $entries));
478 19         226270 write_text(feed_file(@_), apply_template(read_text(feed_template_file(@_)), $globals, $feeds, $entries));
479             }
480              
481             sub html_file {
482 19     19   178 my ($html) = grep /\.html$/, @_;
483 19   50     164 return $html || 'index.html';
484             }
485              
486             sub html_template_file {
487 19     19   165 my ($html, $template) = grep /\.html$/, @_;
488 19   33     225 $template ||= dist_file('App-jupiter', 'template.html');
489 19 50       5353 die "HTML template $template not found\n" unless -r $template;
490 19         177 return $template;
491             }
492              
493             sub feed_file {
494 19     19   313 my ($feed) = grep /\.(xml|rss|atom)$/, @_;
495 19 50       199 return $feed if $feed;
496 0         0 return 'feed.xml';
497             }
498              
499             sub feed_template_file {
500 19     19   148 my ($feed, $template) = grep /\.(xml|rss|atom)$/, @_;
501 19 50       79 return $template if $template;
502 19         112 return dist_file('App-jupiter', 'feed.rss');
503             }
504              
505             sub apply_template {
506 38     38   8450 my $mnt = Mojo::Template->new;
507 38         475 return $mnt->render(@_);
508             }
509              
510             =head1 TEMPLATES
511              
512             The page template is called with three hash references: C<globals>, C<feeds>,
513             and C<entries>. The keys of these three hash references are documented below.
514             The values of these hashes are all I<escaped HTML> except where noted (dates and
515             file names, for example).
516              
517             The technical details of how to write the templates are documented in the man
518             page for L<Mojo::Template>.
519              
520             =head2 Globals
521              
522             There are not many global keys.
523              
524             B<date> is the the publication date of the HTML page, in ISO date format:
525             YYYY-MM-DD.
526              
527             B<files> is the list of OPML files used.
528              
529             =cut
530              
531             sub globals {
532 19     19   71 my $files = shift;
533 19         192 my @time = gmtime;
534 19         285 my $today = DateTime->now->ymd;
535 19         11292 return {date => $today, files => $files};
536             }
537              
538             =head2 Writing templates for feeds
539              
540             Feeds have the following keys available:
541              
542             B<title> is the title of the feed.
543              
544             B<url> is the URL of the feed (RSS or Atom). This is not the link to the site!
545              
546             B<link> is the URL of the web page (HTML). This is the link to the site.
547              
548             B<opml_file> is the file name where this feed is listed.
549              
550             B<cache_dir> is the directory where this feed is cached.
551              
552             B<message> is the HTTP status message or other warning or error that we got
553             while fetching the feed.
554              
555             B<code> is the HTTP status code we got while fetching the feed.
556              
557             B<doc> is the L<XML::LibXML::Document>. Could be either Atom or RSS!
558              
559             =cut
560              
561             # Creates list of feeds. Each feed is a hash with keys title, url, opml_file,
562             # cache_dir and cache_file.
563             sub read_opml {
564 37     37   159 my (@feeds, @files);
565 37         323 my @filters = map { decode(locale => substr($_, 1, -1)) } grep /^\/.*\/$/, @_;
  2         16  
566 37         678 for my $file (grep /\.opml$/, @_) {
567 37         443 my $doc = XML::LibXML->load_xml(location => $file); # this better have no errors!
568 37         23007 my @nodes = $doc->findnodes('//outline[./@xmlUrl]');
569 37         7219 my ($name, $path) = fileparse($file, '.opml', '.xml');
570 37         219 for my $node (@nodes) {
571 37         270 my $title = xml_escape $node->getAttribute('title');
572 37         1454 my $url = xml_escape $node->getAttribute('xmlUrl');
573 37 100 100     910 next if @filters > 0 and not grep { $url =~ /$_/ or $title =~ /$_/ } @filters;
  2 100       143  
574 36         146 my $link = xml_escape $node->getAttribute('htmlUrl');
575 36         902 push @feeds, {
576             title => $title, # title in the OPML file
577             url => $url, # feed URL in the OPML file
578             link => $link, # web URL in the OPML file
579             opml_file => $file,
580             cache_dir => "$path/$name",
581             cache_file => "$path/$name/" . slugify($url),
582             };
583             }
584 37 50       2340 warn "No feeds found in the OPML file $file\n" unless @nodes;
585 37         400 push @files, { file => $file, path => $path, name => $name };
586             }
587 37         2172 @feeds = shuffle @feeds;
588 37         155 return \@feeds, \@files;
589             }
590              
591             sub entries {
592 19     19   51 my $feeds = shift;
593 19         57 my $limit = shift;
594 19         103 my $date = DateTime->now(time_zone => 'UTC')->subtract( days => 90 ); # compute once
595 19         40459 my $now = DateTime->now(time_zone => 'UTC');
596 19         6659 my @entries;
597 19         90 for my $feed (@$feeds) {
598 18 50       566 next unless -r $feed->{cache_file};
599 18         90 my $doc = eval { XML::LibXML->load_xml(recover => 2, location => $feed->{cache_file} )};
  18         237  
600 18 50       8673 if (not $doc) {
601 0         0 $feed->{message} = xml_escape "Parsing error: $@";
602 0         0 $feed->{code} = 422; # unprocessable
603 0         0 next;
604             }
605 18         291 $feed->{doc} = $doc;
606 18         206 my @nodes = $xpc->findnodes("/rss/channel/item | /atom:feed/atom:entry", $doc);
607 18 100       1517 if (not @nodes) {
608 2         9 $feed->{message} = "Empty feed";
609 2         10 $feed->{code} = 204; # no content
610 2         8 next;
611             }
612             # if this is an Atom feed, we need to sort the entries ourselves (older entries at the end)
613             my @candidates = map {
614 16         53 my $entry = {};
  40         90  
615 40         99 $entry->{element} = $_;
616 40         134 $entry->{id} = id($_);
617 40   66     131 $entry->{date} = updated($_) || $undefined_date;
618 40         1146 $entry;
619             } @nodes;
620 16         71 @candidates = grep { DateTime->compare($_->{date}, $now) <= 0 } @candidates;
  40         1862  
621 16         1801 @candidates = unique(sort { DateTime->compare( $b->{date}, $a->{date} ) } @candidates);
  38         1865  
622 16         212 @candidates = @candidates[0 .. min($#candidates, $limit - 1)];
623             # now that we have limited the candidates, let's add more metadata from the feed
624 16         56 for my $entry (@candidates) {
625 30         138 $entry->{feed} = $feed;
626             # these two are already escaped
627 30         69 $entry->{blog_title} = $feed->{title};
628 30         76 $entry->{blog_url} = $feed->{url};
629             }
630 16         100 add_age_warning($feed, \@candidates, $date);
631 16         455 push @entries, @candidates;
632             }
633 19         243 return \@entries;
634             }
635              
636             sub add_age_warning {
637 16     16   39 my $feed = shift;
638 16         51 my $entries = shift;
639 16         33 my $date = shift;
640             # feed modification date is smaller than the date given
641 16         84 my ($node) = $xpc->findnodes("/rss/channel | /atom:feed", $feed->{doc});
642 16         831 my $feed_date = updated($node);
643 16 100 100     385 if ($feed_date and DateTime->compare($feed_date, $date) == -1) {
644 11         961 $feed->{message} = "No feed updates in 90 days";
645 11         37 $feed->{code} = 206; # partial content
646 11         98 return;
647             } else {
648             # or no entry found with a modification date equal or bigger than the date given
649 5         229 for my $entry (@$entries) {
650 11 100       396 return if DateTime->compare($entry->{date}, $date) >= 0;
651             }
652 4         273 $feed->{message} = "No entry newer than 90 days";
653 4         19 $feed->{code} = 206; # partial content
654             }
655             }
656              
657             sub updated {
658 113     113   27354 my $node = shift;
659 113 50       397 return unless $node;
660 113 100       797 my @nodes = $xpc->findnodes('pubDate | atom:published | atom:updated', $node) or return;
661 106         5726 my $date = $nodes[0]->textContent;
662             my $dt = eval { DateTime::Format::Mail->parse_datetime($date) }
663             || eval { DateTime::Format::ISO8601->parse_datetime($date) }
664 106   100     185 || eval { DateTime::Format::Mail->parse_datetime(french($date)) };
665 106         75439 return $dt;
666             }
667              
668             sub french {
669 60     60   105 my $date = shift;
670 60         656 $date =~ s/^($wday_re)/$wday{$1}/;
671 60         564 $date =~ s/\b($month_re)/$month{$1}/;
672 60         273 return $date;
673             }
674              
675             sub id {
676 40     40   77 my $node = shift;
677 40 50       140 return unless $node;
678 40         344 my $id = $xpc->findvalue('guid | atom:id', $node); # id is mandatory for Atom
679 40   100     4576 $id ||= $node->findvalue('link'); # one of the following three is mandatory for RSS
680 40   100     2736 $id ||= $node->findvalue('title');
681 40   100     1697 $id ||= $node->findvalue('description');
682 40         857 return $id;
683             }
684              
685             sub unique {
686 35     35   535 my %seen;
687             my @unique;
688 35         87 for my $node (@_) {
689 69 100       220 next if $seen{$node->{id}};
690 61         149 $seen{$node->{id}} = 1;
691 61         144 push(@unique, $node);
692             }
693 35         184 return @unique;
694             }
695              
696             sub limit {
697 19     19   73 my $entries = shift;
698 19         41 my $limit = shift;
699             # we want the most recent entries overall
700 19         73 @$entries = sort { DateTime->compare( $b->{date}, $a->{date} ) } unique(@$entries);
  18         799  
701 19         638 return [@$entries[0 .. min($#$entries, $limit - 1)]];
702             }
703              
704             =head2 Writing templates for entries
705              
706             Entries have the following keys available:
707              
708             B<title> is the title of the post.
709              
710             B<link> is the URL to the post on the web (probably a HTML page).
711              
712             B<blog_title> is the title of the site.
713              
714             B<blog_link> is the URL for the site on the web (probably a HTML page).
715              
716             B<blog_url> is the URL for the site's feed (RSS or Atom).
717              
718             B<authors> are the authors (or the Dublin Core contributor), a list of strings.
719              
720             B<date> is the publication date, as a DateTime object.
721              
722             B<day> is the publication date, in ISO date format: YYYY-MM-DD, for the UTC
723             timezone. The UTC timezone is picked so that the day doesn't jump back and forth
724             when sorting entries by date.
725              
726             B<content> is the full post content, as string or encoded HTML.
727              
728             B<excerpt> is the post content, limited to 500 characters, with paragraph
729             separators instead of HTML elements, as HTML. It is not encoded because the idea
730             is that it only gets added to the HTML and not to the feed, and the HTML it
731             contains is very controlled (only the pilcrow sign inside a span to indicate
732             paragraph breaks).
733              
734             B<categories> are the categories, a list of strings.
735              
736             B<element> is for internal use only. It contains the L<XML::LibXML::Element>
737             object. This could be RSS or Atom!
738              
739             B<feed> is for internal use only. It's a reference to the feed this entry
740             belongs to.
741              
742             =cut
743              
744             sub add_data {
745 19     19   45 my $feeds = shift;
746 19         38 my $entries = shift;
747             # A note on the use of xml_escape: whenever we get data from the feed itself,
748             # it needs to be escaped if it gets printed into the HTML. For example: the
749             # feed contains a feed title of "Foo &amp; Bar". findvalue returns "Foo &
750             # Bar". When the template inserts the title, however, we want "Foo &amp; Bar",
751             # not "Foo & Bar". Thus: any text we get from the feed needs to be escaped
752             # if there's a chance we're going to print it again.
753 19         58 for my $feed (@$feeds) {
754 18 50       66 next unless $feed->{doc};
755             # title and url in the feed overrides title and xmlUrl set in the OPML (XML already escaped)
756 18   0     187 $feed->{title} = xml_escape($xpc->findvalue('/rss/channel/title | /atom:feed/atom:title', $feed->{doc})) || $feed->{title} || "";
757 18   50     1964 $feed->{url} = xml_escape($xpc->findvalue('/atom:feed/atom:link[@rel="self"]/@href', $feed->{doc})) || $feed->{url} || "";
758             # link in the feed does not override htmlUrl in the OPML (XML already escaped)
759 18   100     1842 $feed->{link} = $feed->{link} || xml_escape($xpc->findvalue('/rss/channel/link | /atom:feed/atom:link[@rel="alternate"][@type="text/html"]/@href', $feed->{doc})) || "";
760             # if they just pasted a domain "foo" then "//foo" is a valid URL
761 18 100       1215 $feed->{link} = "//" . $feed->{link} unless $feed->{link} =~ /\/\//;
762             }
763 19         213 for my $entry (@$entries) {
764             # copy from the feed (XML is already escaped)
765 30         258 $entry->{blog_link} = $entry->{feed}->{link};
766 30         84 $entry->{blog_title} = $entry->{feed}->{title};
767 30         72 $entry->{blog_url} = $entry->{feed}->{url};
768             # parse the elements
769 30         53 my $element = $entry->{element};
770 30   100     102 $entry->{title} = xml_escape $xpc->findvalue('title | atom:title', $element) || "Untitled";
771 30         2565 my @links = map { xml_escape $_->to_literal } map { $xpc->findnodes($_, $element) }
  16         560  
  90         2826  
772             # sorted by preferences!
773             qw(link atom:link[@rel="alternate"][@type="text/html"]/@href atom:link/@href);
774 30   100     868 $entry->{link} = shift(@links) || "";
775 30         251 my @authors = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes(
  8         403  
776             'author | atom:author/atom:name | atom:contributor/atom:name | dc:creator | dc:contributor', $element);
777 30 100       1678 @authors = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes(
  10         503  
778             '/atom:feed/atom:author/atom:name | '
779             . '/atom:feed/atom:contributor/atom:name | '
780             . '/rss/channel/dc:creator | '
781             . '/rss/channel/dc:contributor | '
782             . '/rss/channel/webMaster ', $element) unless @authors;
783 30 100       988 $entry->{authors} = @authors ? \@authors : undef; # key must exist in the hash
784 30 100       290 if (DateTime->compare($entry->{date}, $undefined_date) == 0) {
785 5         437 $entry->{day} = "(no date found)";
786             } else {
787 25         2178 $entry->{day} = $entry->{date}->clone->set_time_zone('UTC')->ymd; # operate on a clone
788             }
789 30         4354 my @categories = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes('category | atom:category/@term', $element);
  6         311  
790 30 100       1177 $entry->{categories} = @categories ? \@categories : undef; # key must exist in the hash
791 30         152 $entry->{excerpt} = '';
792 30         81 $entry->{content} = '';
793 30         104 my @nodes = $xpc->findnodes('description[text()!=""] | atom:content[text()!=""]', $element);
794 30 100       1436 @nodes = $xpc->findnodes('summary[text()!=""] | atom:summary[text()!=""] | itunes:summary[text()!=""]', $element) unless @nodes;
795 30         444 my $content = shift(@nodes);
796             # The default is that the content is either plain text or escaped HTML,
797             # which is what we want for RSS. For Atom feeds, type="xhtml" means that the
798             # content is XHTML, so it needs to be escaped.
799 30 100       101 if ($content) {
800 23   100     361 my $is_xhtml = $content->hasAttribute("type") && $content->getAttribute("type") eq "xhtml";
801 23         270 $entry->{excerpt} = excerpt($content->to_literal);
802 23         155 for my $child ($content->childNodes()) {
803 27         726 my $c = $child->toString();
804 27 100       99 $c = xml_escape $c if $is_xhtml;
805 27         278 $entry->{content} .= $c;
806             }
807             }
808             }
809             }
810              
811             sub excerpt {
812 23     23   54 my $content = shift;
813 23 50       82 return '(no excerpt)' unless $content;
814 23         50 my $doc = eval { XML::LibXML->load_html(recover => 2, string => $content) };
  23         119  
815 23         6816 my $separator = "¶";
816 23         101 for my $node ($doc->findnodes('//style | //script | //time | //*[contains(@class, "post-header")] | //*[contains(@class, "post-share-buttons")] | //*[contains(@class, "post-footer")] | //*[contains(@class, "comments")]'),
817             ){
818 1         127 $node->parentNode->removeChild($node);
819             }
820 23         1907 for my $node ($doc->findnodes('//p | //br | //blockquote | //li | //td | //th | //div | //h1 | //h2 | //h3 | //h4 | //h5 | //h6')) {
821 23         1284 $node->appendTextNode($separator);
822             }
823 23         83 my $text = strip_html($doc->textContent());
824 23         384 $text =~ s/( +|----+)/ /g;
825             # collapse whitespace and trim
826 23         274 $text =~ s/\s+/ /g;
827 23         116 $text = trim $text;
828             # replace paragraph repeats with their surrounding spaces
829 23         484 $text =~ s/ ?¶( ?¶)* ?/¶/g;
830 23         75 $text =~ s/^¶//;
831 23         107 $text =~ s/¶$//;
832 23         78 my $len = length($text);
833 23         78 $text = substr($text, 0, 500);
834 23 50       97 $text .= "…" if $len > 500;
835 23         68 $text = xml_escape $text;
836 23         291 $text =~ s/¶/<span class="paragraph">¶ <\/span>/g;
837 23         488 return $text;
838             }
839              
840             # When there's a value that's supposed to be text but isn't, then we can try to
841             # turn it to HTML and from there to text... This is an ugly hack and I wish it
842             # wasn't necessary.
843             sub strip_html {
844 47     47   985 my $str = shift;
845 47 50       123 return '' unless $str;
846 47         71 my $doc = eval { XML::LibXML->load_html(string => $str) };
  47         165  
847 47 100       12608 return $str unless $doc;
848 45         767 return $doc->textContent();
849             }
850              
851             =head1 SEE ALSO
852              
853             OPML 2.0, L<http://dev.opml.org/spec2.html>
854              
855             RSS 2.0, L<https://cyber.harvard.edu/rss/rss.html>
856              
857             Atom Syndication, L<https://tools.ietf.org/html/rfc4287>
858              
859             River of News,
860             L<http://scripting.com/2014/06/02/whatIsARiverOfNewsAggregator.html>
861              
862             =cut
863              
864             1;