File Coverage

script/jupiter
Criterion Covered Total %
statement 327 348 93.9
branch 55 80 68.7
condition 35 44 79.5
subroutine 49 52 94.2
pod n/a
total 466 524 88.9


line stmt bran cond sub pod time code
1             #! /usr/bin/env perl
2             # Planet Jupiter is a feed aggregator that creates a single HTML file
3             # Copyright (C) 2020–2021 Alex Schroeder <alex@gnu.org>
4              
5             # This program is free software: you can redistribute it and/or modify
6             # it under the terms of the GNU Affero General Public License as published by
7             # the Free Software Foundation, either version 3 of the License, or
8             # (at your option) any later version.
9             #
10             # This program is distributed in the hope that it will be useful,
11             # but WITHOUT ANY WARRANTY; without even the implied warranty of
12             # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13             # GNU Affero General Public License for more details.
14             #
15             # You should have received a copy of the GNU Affero General Public License
16             # along with this program. If not, see <https://www.gnu.org/licenses/>.
17              
18             package Jupiter;
19              
20 14     14   5103869 use Encode::Locale;
  14         61314  
  14         677  
21 14     14   103 use Encode;
  14         24  
  14         1047  
22 14     14   691 use utf8;
  14         40  
  14         92  
23              
24             binmode(STDOUT, ":utf8");
25             binmode(STDERR, ":utf8");
26              
27             =encoding UTF8
28              
29             =head1 NAME
30              
31             jupiter - turn a list of feeds into a HTML page, a river of news
32              
33             =head1 SYNOPSIS
34              
35             To update the feeds from one or more OPML files:
36              
37             B<jupiter update> I<feed.opml> … [I</regex/> …]
38              
39             To generate F<index.html>:
40              
41             B<jupiter html> I<feed.opml>
42              
43             =head1 DESCRIPTION
44              
45             Planet Jupiter is used to pull together the latest updates from a bunch of other
46             sites and display them on a single web page, the "river of news". The sites we
47             get our updates from are defined in an OPML file.
48              
49             A river of news, according to Dave Winer, is a feed aggregator. New items appear
50             at the top and old items disappear at the bottom. When it's gone, it's gone.
51             There is no count of unread items. The goal is to fight the I<fear of missing
52             out> (FOMO).
53              
54             Each item looks similar to every other: headline, link, an extract, maybe a date
55             and an author. Extracts contain but the beginning of the article's text; all
56             markup is removed; no images. The goal is to make it the page easy to skim.
57             Scroll down until you find something interesting and follow the link to the
58             original article if you want to read it.
59              
60             =head2 The OPML file
61              
62             You B<need> an OPML file. It's an XML file linking to I<feeds>. Here's an
63             example listing just one feed. In order to add more, add more C<outline>
64             elements with the C<xmlUrl> attribute. The exact order and nesting does not
65             matter. People can I<import> these OPML files into their own feed readers and
66             thus it may make sense to spend a bit more effort in making it presentable.
67              
68             <opml version="2.0">
69             <body>
70             <outline title="Alex Schroeder"
71             xmlUrl="https://alexschroeder.ch/wiki?action=rss"/>
72             </body>
73             </opml>
74              
75             =head2 Update the feeds in your cache
76              
77             This is how you update the feeds in a file called C<feed.opml>. It downloads all
78             the feeds linked to in the OPML file and stores them in the cache directory.
79              
80             jupiter update feed.opml
81              
82             The directory used to keep a copy of all the feeds in the OPML file has the same
83             name as the OPML file but without the .opml extension. In other words, if your
84             OPML file is called C<feed.opml> then the cache directory is called C<feed>.
85              
86             This operation takes long because it requests an update from all the sites
87             listed in your OPML file. Don't run it too often or you'll annoy the site
88             owners.
89              
90             The OPML file must use the .opml extension. You can update the feeds for
91             multiple OPML files in one go.
92              
93             =head2 Adding just one feed
94              
95             After a while, the list of feeds in your OPML starts getting unwieldy. When you
96             add a new feed, you might not want to fetch all of them. In this case, provide a
97             regular expression surrounded by slashes to the C<update> command:
98              
99             jupiter update feed.opml /example/
100              
101             Assuming a feed with a URL or title that matches the regular expression is
102             listed in your OPML file, only that feed is going to get updated.
103              
104             There is no need to escape slashes in the regular expression: C<//rss/> works
105             just fine. Beware shell escaping, however. Most likely, you need to surround the
106             regular expression with single quotes if it contains spaces:
107              
108             jupiter update feed.opml '/Halberds & Helmets/'
109              
110             Notice how we assume that named entities such as C<&amp;> have already been
111             parsed into the appropriate strings.
112              
113             =head2 Generate the HTML
114              
115             This is how you generate the C<index.html> file based on the feeds of your
116             C<feed.opml>. It assumes that you have already updated all the feeds (see
117             above).
118              
119             jupiter html feed.opml
120              
121             See L</OPTIONS> for ways to change how the HTML is generated.
122              
123             =head2 Generate the RSS feed
124              
125             This happens at the same time as when you generate the HTML. It takes all the
126             entries that are being added to the HTML and puts the into a feed.
127              
128             See L</OPTIONS> for ways to change how the HTML is generated.
129              
130             =head2 Why separate the two steps?
131              
132             The first reason is that tinkering with the templates involves running the
133             program again and again, and you don't want to contact all the sites whenever
134             you update your templates.
135              
136             The other reason is that it allows you to create subsets. For example, you can
137             fetch the feeds for three different OPML files:
138              
139             jupiter update osr.opml indie.opml other.opml
140              
141             And then you can create three different HTML files:
142              
143             jupiter html osr.html osr.opml
144             jupiter html indie.html indie.opml
145             jupiter html rpg.html osr.opml indie.opml other.opml
146              
147             For an example of how it might look, check out the setup for the planets I run.
148             L<https://alexschroeder.ch/cgit/planet/about/>
149              
150             =head2 What about the JSON file?
151              
152             There's a JSON file that gets generated and updated as you run Planet Jupiter.
153             It's name depends on the OPML files used. It records metadata for every feed in
154             the OPML file that isn't stored in the feeds themselves.
155              
156             C<message> is the HTTP status message, or a similar message such as "No entry
157             newer than 90 days." This is set when update the feeds in your cache.
158              
159             C<message> is the HTTP status code; this code could be the real status code from
160             the server (such as 404 for a "not found" status) or one generated by Jupiter
161             such that it matches the status message (such as 206 for a "partial content"
162             status when there aren't any recent entries in the feed). This is set when
163             update the feeds in your cache.
164              
165             C<title> is the site's title. When you update the feeds in your cache, it is
166             taken from the OPML file. That's how the feed can have a title even if the
167             download failed. When you generate the HTML, the feeds in the cache are parsed
168             and if a title is provided, it is stored in the JSON file and overrides the
169             title in the OPML file.
170              
171             C<link> is the site's link for humans. When you generate the HTML, the feeds in
172             the cache are parsed and if a link is provided, it is stored in the JSON file.
173             If the OPML element contained a C<htmlURL> attribute, however, that takes
174             precedence. The reasoning is that when a podcast is hosted on a platform which
175             generates a link that you don't like and you know the link to the human-readable
176             blog elsehwere, use the C<htmlURL> attribute in the OPML file to override this.
177              
178             C<last_modified> and C<etag> are two headers used for caching from the HTTP
179             response that cannot be changed by data in the feed.
180              
181             If we run into problems downloading a feed, this setup allows us to still link
182             to the feeds that aren't working, using their correct names, and describing the
183             error we encountered.
184              
185             =head2 Logging
186              
187             Use the C<--log=LEVEL> to set the log level. Valid values for LEVEL are debug,
188             info, warn, error, and fatal.
189              
190             =head1 LICENSE
191              
192             GNU Affero General Public License
193              
194             =head1 INSTALLATION
195              
196             Using C<cpan>:
197              
198             cpan App::jupiter
199              
200             Manual install:
201              
202             perl Makefile.PL
203             make
204             make install
205              
206             =head2 Dependencies
207              
208             To run Jupiter on Debian we need:
209              
210             C<libmodern-perl-perl> for L<Modern::Perl>
211              
212             C<libmojolicious-perl> for L<Mojo::Template>, L<Mojo::UserAgent>, L<Mojo::Log>,
213             L<Mojo::JSON>, and L<Mojo::Util>
214              
215             C<libxml-libxml-perl> for L<XML::LibXML>
216              
217             C<libfile-slurper-perl> for L<File::Slurper>
218              
219             C<libdatetime-perl> for L<DateTime>
220              
221             C<libdatetime-format-mail-perl> for L<DateTime::Format::Mail>
222              
223             C<libdatetime-format-iso8601-perl> for L<DateTime::Format::ISO8601>
224              
225             Unfortunately, L<Mojo::UserAgent::Role::Queued> isn't packaged for Debian.
226             Therefore, let's build it and install it as a Debian package.
227              
228             sudo apt-get install libmodule-build-tiny-perl
229             sudo apt-get install dh-make-perl
230             sudo dh-make-perl --build --cpan Mojo::UserAgent::Role::Queued
231             dpkg --install libmojo-useragent-role-queued-perl_1.15-1_all.deb
232              
233             To generate the C<README.md> from the source file, you need F<pod2markdown>
234             which you get in C<libpod-markdown-perl>.
235              
236             =head1 FILES
237              
238             There are a number of files in the F<share> directory which you can use as
239             starting points.
240              
241             F<template.html> is the HTML template.
242              
243             F<default.css> is a small CSS file used by F<template.html>.
244              
245             F<personalize.js> is a small Javascript file used by F<template.html> used to
246             allow visitors to jump from one article to the next using C<J> and C<K>.
247              
248             F<jupiter.png> is used by F<template.html> as the icon.
249              
250             F<jupiter.svg> is used by F<template.html> as the logo.
251              
252             F<feed.png> is used by F<template.html> as the icon for the feeds in the
253             sidebar.
254              
255             F<feed.rss> is the feed template.
256              
257             =head1 OPTIONS
258              
259             HTML generation uses a template, C<template.html>. It is written for
260             C<Mojo::Template> and you can find it in the F<share> directory of your
261             distribution. The default templates use other files, such as the logo, the feed
262             icon, a CSS file, and a small Javascript snippet to enable navigation using the
263             C<J> and C<K> keys (see above).
264              
265             You can specify a different HTML file to generate:
266              
267             B<jupiter html> I<your.html feed.opml>
268              
269             If you specify two HTML files, the first is the HTML file to generate and the
270             second is the template to use. Both must use the C<.html> extension.
271              
272             B<jupiter html> I<your.html your-template.html feed.opml>
273              
274             Feed generation uses a template, C<feed.rss>. It writes all the entries into a
275             file called C<feed.xml>. Again, the template is written for C<Mojo::Template>.
276              
277             You can specify up to two XML, RSS or ATOM files. They must uses one of these
278             three extensions: C<.xml>, C<.rss>, or C<.atom>. The first is the name of the
279             feed to generate, the second is the template to use:
280              
281             B<jupiter html> I<atom.xml template.xml planet.html template.html feed.opml>
282              
283             In the above case, Planet Jupiter will write a feed called F<atom.xml> based on
284             F<template.xml> and a HTML file called F<planet.html> based on F<template.html>,
285             using the cached entries matching the feeds in F<feed.opml>.
286              
287             =cut
288              
289 14     14   12637 use DateTime;
  14         5780885  
  14         760  
290 14     14   9842 use DateTime::Format::Mail;
  14         81529  
  14         594  
291 14     14   9431 use DateTime::Format::ISO8601;
  14         1736565  
  14         1178  
292 14     14   211 use File::Basename;
  14         35  
  14         1424  
293 14     14   782 use File::Slurper qw(read_binary write_binary read_text write_text);
  14         3316  
  14         907  
294 14     14   87 use List::Util qw(uniq min shuffle);
  14         24  
  14         973  
295 14     14   101 use Modern::Perl;
  14         25  
  14         153  
296 14     14   12902 use Mojo::Log;
  14         363759  
  14         216  
297 14     14   1210 use Mojo::JSON qw(decode_json encode_json);
  14         19092  
  14         840  
298 14     14   9084 use Mojo::Template;
  14         62865  
  14         172  
299 14     14   9570 use Mojo::UserAgent;
  14         449184  
  14         164  
300 14     14   9918 use Pod::Simple::Text;
  14         79017  
  14         621  
301 14     14   9836 use XML::LibXML;
  14         433036  
  14         135  
302 14     14   2517 use Mojo::Util qw(slugify trim xml_escape html_unescape);
  14         35  
  14         971  
303 14     14   91 use File::ShareDir 'dist_file';
  14         29  
  14         670  
304              
305 14     14   87 use vars qw($log);
  14         29  
  14         77052  
306             our $log = Mojo::Log->new;
307              
308             my $xpc = XML::LibXML::XPathContext->new;
309             $xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');
310             $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
311             $xpc->registerNs('dc', 'http://purl.org/dc/elements/1.1/');
312             $xpc->registerNs('itunes', 'http://www.itunes.com/dtds/podcast-1.0.dtd');
313              
314             my $undefined_date = DateTime->from_epoch( epoch => 0 );
315              
316             my (%wday, %month, $wday_re, $month_re);
317             %wday = qw (lun. Mon mar. Tue mer. Wed jeu. Thu ven. Fri sam. Sat dim. Sun);
318             %month = qw (janv. Jan févr. Feb mars Mar avr. Apr mai May juin Jun
319             juil. Jul août Aug sept. Sep oct. Oct nov. Nov déc. Dec);
320             $wday_re = join('|', map { quotemeta } keys %wday) unless $wday_re;
321             $month_re = join('|', map { quotemeta } keys %month) unless $month_re;
322              
323             # Our tests don't want to call main
324             __PACKAGE__->main unless caller;
325              
326             sub main {
327 0     0   0 my ($log_level) = grep /^--log=/, @ARGV;
328 0 0       0 $log->level(substr($log_level, 6)) if $log_level;
329 0         0 my ($command) = grep /^[a-z]+$/, @ARGV;
330 0   0     0 $command ||= 'help';
331 0 0       0 if ($command eq 'update') {
    0          
332 0         0 update_cache(@ARGV);
333             } elsif ($command eq 'html') {
334 0         0 make_html(@ARGV);
335             } else {
336 0         0 my $parser = Pod::Simple::Text->new();
337 0         0 $parser->parse_file($0);
338             }
339             }
340              
341             sub update_cache {
342 17     17   1159484 my ($feeds, $files) = read_opml(@_);
343 17         94 make_directories($feeds);
344 17         133 load_feed_metadata($feeds, $files);
345 17         263 my $ua = Mojo::UserAgent->new->with_roles('+Queued')
346             ->max_redirects(3)
347             ->max_active(5);
348 17         81205 make_promises($ua, $feeds);
349 17         19431 fetch_feeds($feeds);
350 17         12879 save_feed_metadata($feeds, $files);
351 17         5651 cleanup_cache($feeds);
352             }
353              
354             sub make_promises {
355 17     17   44 my $ua = shift;
356 17         34 my $feeds = shift;
357 17         51 for my $feed (@$feeds) {
358 17         103 my $url = html_unescape $feed->{url}; # undo xml_escape for the request
359             $ua->on(start => sub {
360 17     17   16220 my ($ua, $tx) = @_;
361 17 50       94 $tx->req->headers->if_none_match($feed->{etag}) if ($feed->{etag});
362 17 50       100 $tx->req->headers->if_modified_since($feed->{last_modified}) if ($feed->{last_modified});
363 17         503 });
364             $feed->{promise} = $ua->get_p($url)
365             ->catch(sub {
366 0     0   0 $feed->{message} = "@_";
367 0         0 $feed->{code} = 521;
368             # returning 0 in the case of an error is important
369 0         0 0; })
370             # sleeping to stop blogger.com from blocking us
371 17     17   241 ->finally(sub { $log->debug($url); sleep 2; });
  17         256230  
  17         34004009  
372             }
373             }
374              
375             sub fetch_feeds {
376 17     17   56 my $feeds = shift;
377 17         562 $log->info("Fetching feeds...");
378 17         103 Mojo::Promise->all(map { $_->{promise} } @$feeds)->then(sub {
379             # all returns the values in the same order!
380 17     17   14740 for (my $i = 0; $i < @_; $i++) {
381 17         81 my $feed = $feeds->[$i];
382 17         49 my $value = $_[$i];
383 17         48 my $tx = $value->[0];
384             # relies on catch returning 0 above
385 17 50       181 next unless $tx;
386 17         243 $feed->{message} = $tx->result->message;
387 17         1053 $feed->{code} = $tx->result->code;
388 17         471 $feed->{last_modified} = $tx->result->headers->last_modified;
389 17         956 $feed->{etag} = $tx->result->headers->etag;
390             # save raw bytes if this is a success
391 17 50       627 eval { write_binary($feed->{cache_file}, $tx->result->body) } if $tx->result->is_success;
  17         1133  
392 17 50       8362 warn "Unable to write $feed->{cache_file}: $@\n" if $@;
393             }
394             })->catch(sub {
395 0     0   0 warn "Something went wrong: @_";
396 17         219 })->wait;
397             }
398              
399             sub load_feed_metadata {
400 35     35   83 my $feeds = shift;
401 35         65 my $files = shift;
402 35         103 for my $file (@$files) {
403 34         143 my $filename = "$file->{path}/$file->{name}";
404 34 100       821 next unless -r "$filename.json";
405 21         230 my $data = decode_json read_binary("$filename.json");
406 21         8206 for my $feed (@$feeds) {
407 21         65 my $url = $feed->{url};
408             # don't overwrite title and htmlUrl from OPML file
409 21 50       131 $feed->{title} = $data->{$url}->{title} if $data->{$url}->{title};
410 21   100     174 $feed->{link} ||= $data->{$url}->{link};
411             # all the other metadata is loaded from the JSON file
412 21         62 $feed->{message} = $data->{$url}->{message};
413 21         79 $feed->{code} = $data->{$url}->{code};
414 21         91 $feed->{last_modified} = $data->{$url}->{last_modified};
415 21         70 $feed->{etag} = $data->{$url}->{etag};
416 21         61 } grep { $_->{opml_file} eq $file->{file} } @$feeds;
  21         127  
417             }
418             }
419              
420             sub save_feed_metadata {
421 35     35   530 my $feeds = shift;
422 35         79 my $files = shift;
423 35         115 for my $file (@$files) {
424 34         111 my $name = $file->{name};
425             my %messages = map {
426 34         70 my $feed = $_;
427 34         107 $feed->{url} => { map { $_ => $feed->{$_} } grep { $feed->{$_} } qw(title link message code last_modified etag) };
  124         498  
  204         404  
428 34         97 } grep { $_->{opml_file} eq $file->{file} } @$feeds;
  34         200  
429 34         374 write_binary("$file->{path}/$file->{name}.json", encode_json \%messages);
430             }
431             }
432              
433             sub cleanup_cache {
434 17     17   54 my $feeds = shift;
435 17         44 my %good = map { $_ => 1 } @{cache_files($feeds)};
  17         91  
  17         100  
436 17         50 my @unused = grep { not $good{$_} } @{existing_files($feeds)};
  17         90  
  17         66  
437 17 50       460 if (@unused) {
438 0         0 $log->info("Removing unused files from the cache...");
439 0         0 foreach (@unused) { $log->info($_) }
  0         0  
440 0         0 unlink @unused;
441             }
442             }
443              
444             sub existing_files {
445 17     17   44 my $feeds = shift;
446 17         38 my @files;
447 17         49 for my $dir (uniq map { $_->{cache_dir} } @$feeds) {
  17         126  
448 17         1808 push(@files, <"$dir/*">);
449             }
450 17         115 return \@files;
451             }
452              
453             sub cache_files {
454 17     17   38 my $feeds = shift;
455 17         54 my @files = map { $_->{cache_file} } @$feeds;
  17         92  
456 17         71 return \@files;
457             }
458              
459             sub make_directories {
460 17     17   36 my $feeds = shift;
461 17         48 for my $dir (uniq map { $_->{cache_dir} } @$feeds) {
  17         111  
462 17 100       620 if (not -d $dir) {
463 13         1545 mkdir $dir;
464             }
465             }
466             }
467              
468             sub make_html {
469 18     18   30384 my ($feeds, $files) = read_opml(@_);
470 18         135 load_feed_metadata($feeds, $files); # load messages and codes for feeds
471 18         81 my $globals = globals($files);
472 18         91 my $entries = entries($feeds, 4); # set data for feeds, too
473 18         95 add_data($feeds, $entries); # extract data from the xml
474 18         277 save_feed_metadata($feeds, $files); # save title and link for feeds
475 18         7709 $entries = limit($entries, 100);
476 18         95 write_text(html_file(@_), apply_template(read_text(html_template_file(@_)), $globals, $feeds, $entries));
477 18         204442 write_text(feed_file(@_), apply_template(read_text(feed_template_file(@_)), $globals, $feeds, $entries));
478             }
479              
480             sub html_file {
481 18     18   230 my ($html) = grep /\.html$/, @_;
482 18   50     171 return $html || 'index.html';
483             }
484              
485             sub html_template_file {
486 18     18   106 my ($html, $template) = grep /\.html$/, @_;
487 18   33     184 $template ||= dist_file('App-jupiter', 'template.html');
488 18 50       4834 die "HTML template $template not found\n" unless -r $template;
489 18         145 return $template;
490             }
491              
492             sub feed_file {
493 18     18   263 my ($feed) = grep /\.(xml|rss|atom)$/, @_;
494 18 50       155 return $feed if $feed;
495 0         0 return 'feed.xml';
496             }
497              
498             sub feed_template_file {
499 18     18   210 my ($feed, $template) = grep /\.(xml|rss|atom)$/, @_;
500 18 50       88 return $template if $template;
501 18         139 return dist_file('App-jupiter', 'feed.rss');
502             }
503              
504             sub apply_template {
505 36     36   9252 my $mnt = Mojo::Template->new;
506 36         517 return $mnt->render(@_);
507             }
508              
509             =head1 TEMPLATES
510              
511             The page template is called with three hash references: C<globals>, C<feeds>,
512             and C<entries>. The keys of these three hash references are documented below.
513             The values of these hashes are all I<escaped HTML> except where noted (dates and
514             file names, for example).
515              
516             The technical details of how to write the templates are documented in the man
517             page for L<Mojo::Template>.
518              
519             =head2 Globals
520              
521             There are not many global keys.
522              
523             B<date> is the the publication date of the HTML page, in ISO date format:
524             YYYY-MM-DD.
525              
526             B<files> is the list of OPML files used.
527              
528             =cut
529              
530             sub globals {
531 18     18   44 my $files = shift;
532 18         254 my @time = gmtime;
533 18         229 my $today = DateTime->now->ymd;
534 18         8686 return {date => $today, files => $files};
535             }
536              
537             =head2 Writing templates for feeds
538              
539             Feeds have the following keys available:
540              
541             B<title> is the title of the feed.
542              
543             B<url> is the URL of the feed (RSS or Atom). This is not the link to the site!
544              
545             B<link> is the URL of the web page (HTML). This is the link to the site.
546              
547             B<opml_file> is the file name where this feed is listed.
548              
549             B<cache_dir> is the directory where this feed is cached.
550              
551             B<message> is the HTTP status message or other warning or error that we got
552             while fetching the feed.
553              
554             B<code> is the HTTP status code we got while fetching the feed.
555              
556             B<doc> is the L<XML::LibXML::Document>. Could be either Atom or RSS!
557              
558             =cut
559              
560             # Creates list of feeds. Each feed is a hash with keys title, url, opml_file,
561             # cache_dir and cache_file.
562             sub read_opml {
563 35     35   110 my (@feeds, @files);
564 35         272 my @filters = map { decode(locale => substr($_, 1, -1)) } grep /^\/.*\/$/, @_;
  2         18  
565 35         510 for my $file (grep /\.opml$/, @_) {
566 35         435 my $doc = XML::LibXML->load_xml(location => $file); # this better have no errors!
567 35         18355 my @nodes = $doc->findnodes('//outline[./@xmlUrl]');
568 35         5452 my ($name, $path) = fileparse($file, '.opml', '.xml');
569             push @feeds, map {
570 35         152 my $title = xml_escape $_->getAttribute('title');
  35         227  
571 35         1118 my $url = xml_escape $_->getAttribute('xmlUrl');
572 35 100 100     660 next if @filters > 0 and not grep { $url =~ /$_/ or $title =~ /$_/ } @filters;
  2 100       62  
573 34         114 my $link = xml_escape $_->getAttribute('htmlUrl');
574             {
575 34         686 title => $title, # title in the OPML file
576             url => $url, # feed URL in the OPML file
577             link => $link, # web URL in the OPML file
578             opml_file => $file,
579             cache_dir => "$path/$name",
580             cache_file => "$path/$name/" . slugify($url),
581             }
582             } @nodes;
583 34 50       1878 warn "No feeds found in the OPML file $file\n" unless @nodes;
584 34         345 push @files, { file => $file, path => $path, name => $name };
585             }
586 35         1806 @feeds = shuffle @feeds;
587 35         136 return \@feeds, \@files;
588             }
589              
590             sub entries {
591 18     18   38 my $feeds = shift;
592 18         35 my $limit = shift;
593 18         77 my $date = DateTime->now(time_zone => 'UTC')->subtract( days => 90 ); # compute once
594 18         31634 my @entries;
595 18         62 for my $feed (@$feeds) {
596 17 50       666 next unless -r $feed->{cache_file};
597 17         60 my $doc = eval { XML::LibXML->load_xml(recover => 2, location => $feed->{cache_file} )};
  17         198  
598 17 50       8524 if (not $doc) {
599 0         0 $feed->{message} = xml_escape "Parsing error: $@";
600 0         0 $feed->{code} = 422; # unprocessable
601 0         0 next;
602             }
603 17         167 $feed->{doc} = $doc;
604 17         139 my @nodes = $xpc->findnodes("/rss/channel/item | /atom:feed/atom:entry", $doc);
605 17 100       1315 if (not @nodes) {
606 2         7 $feed->{message} = "Empty feed";
607 2         5 $feed->{code} = 204; # no content
608 2         6 next;
609             }
610             # if this is an Atom feed, we need to sort the entries ourselves (older entries at the end)
611             my @candidates = map {
612 15         45 my $entry = {};
  34         70  
613 34         86 $entry->{element} = $_;
614 34         105 $entry->{id} = id($_);
615 34   66     114 $entry->{date} = updated($_) || $undefined_date;
616 34         851 $entry;
617             } @nodes;
618 15         113 @candidates = unique(sort { DateTime->compare( $b->{date}, $a->{date} ) } @candidates);
  30         1522  
619 15         203 @candidates = @candidates[0 .. min($#candidates, $limit - 1)];
620             # now that we have limited the candidates, let's add more metadata from the feed
621 15         46 for my $entry (@candidates) {
622 29         54 $entry->{feed} = $feed;
623             # these two are already escaped
624 29         67 $entry->{blog_title} = $feed->{title};
625 29         65 $entry->{blog_url} = $feed->{url};
626             }
627 15         76 add_age_warning($feed, \@candidates, $date);
628 15         441 push @entries, @candidates;
629             }
630 18         242 return \@entries;
631             }
632              
633             sub add_age_warning {
634 15     15   37 my $feed = shift;
635 15         23 my $entries = shift;
636 15         34 my $date = shift;
637             # feed modification date is smaller than the date given
638 15         76 my ($node) = $xpc->findnodes("/rss/channel | /atom:feed", $feed->{doc});
639 15         758 my $feed_date = updated($node);
640 15 100 100     314 if ($feed_date and DateTime->compare($feed_date, $date) == -1) {
641 10         945 $feed->{message} = "No feed updates in 90 days";
642 10         25 $feed->{code} = 206; # partial content
643 10         106 return;
644             } else {
645             # or no entry found with a modification date equal or bigger than the date given
646 5         319 for my $entry (@$entries) {
647 11 100       368 return if DateTime->compare($entry->{date}, $date) >= 0;
648             }
649 4         327 $feed->{message} = "No entry newer than 90 days";
650 4         19 $feed->{code} = 206; # partial content
651             }
652             }
653              
654             sub updated {
655 106     106   27275 my $node = shift;
656 106 50       448 return unless $node;
657 106 100       857 my @nodes = $xpc->findnodes('pubDate | atom:updated', $node) or return;
658 99         5948 my $date = $nodes[0]->textContent;
659             my $dt = eval { DateTime::Format::Mail->parse_datetime($date) }
660             || eval { DateTime::Format::ISO8601->parse_datetime($date) }
661 99   100     205 || eval { DateTime::Format::Mail->parse_datetime(french($date)) };
662 99         73299 return $dt;
663             }
664              
665             sub french {
666 60     60   124 my $date = shift;
667 60         849 $date =~ s/^($wday_re)/$wday{$1}/;
668 60         733 $date =~ s/\b($month_re)/$month{$1}/;
669 60         332 return $date;
670             }
671              
672             sub id {
673 34     34   59 my $node = shift;
674 34 50       81 return unless $node;
675 34         267 my $id = $xpc->findvalue('guid | atom:id', $node); # id is mandatory for Atom
676 34   100     3430 $id ||= $node->findvalue('link'); # one of the following three is mandatory for RSS
677 34   100     1861 $id ||= $node->findvalue('title');
678 34   100     1380 $id ||= $node->findvalue('description');
679 34         481 return $id;
680             }
681              
682             sub unique {
683 33     33   584 my %seen;
684             my @unique;
685 33         94 for my $node (@_) {
686 63 100       182 next if $seen{$node->{id}};
687 59         150 $seen{$node->{id}} = 1;
688 59         117 push(@unique, $node);
689             }
690 33         157 return @unique;
691             }
692              
693             sub limit {
694 18     18   50 my $entries = shift;
695 18         42 my $limit = shift;
696             # we want the most recent entries overall
697 18         75 @$entries = sort { DateTime->compare( $b->{date}, $a->{date} ) } unique(@$entries);
  18         795  
698 18         617 return [@$entries[0 .. min($#$entries, $limit - 1)]];
699             }
700              
701             =head2 Writing templates for entries
702              
703             Entries have the following keys available:
704              
705             B<title> is the title of the post.
706              
707             B<link> is the URL to the post on the web (probably a HTML page).
708              
709             B<blog_title> is the title of the site.
710              
711             B<blog_link> is the URL for the site on the web (probably a HTML page).
712              
713             B<blog_url> is the URL for the site's feed (RSS or Atom).
714              
715             B<authors> are the authors (or the Dublin Core contributor), a list of strings.
716              
717             B<date> is the publication date, as a DateTime object.
718              
719             B<day> is the publication date, in ISO date format: YYYY-MM-DD, for the UTC
720             timezone. The UTC timezone is picked so that the day doesn't jump back and forth
721             when sorting entries by date.
722              
723             B<content> is the full post content, as string or encoded HTML.
724              
725             B<excerpt> is the post content, limited to 500 characters, with paragraph
726             separators instead of HTML elements, as HTML. It is not encoded because the idea
727             is that it only gets added to the HTML and not to the feed, and the HTML it
728             contains is very controlled (only the pilcrow sign inside a span to indicate
729             paragraph breaks).
730              
731             B<categories> are the categories, a list of strings.
732              
733             B<element> is for internal use only. It contains the L<XML::LibXML::Element>
734             object. This could be RSS or Atom!
735              
736             B<feed> is for internal use only. It's a reference to the feed this entry
737             belongs to.
738              
739             =cut
740              
741             sub add_data {
742 18     18   40 my $feeds = shift;
743 18         41 my $entries = shift;
744             # A note on the use of xml_escape: whenever we get data from the feed itself,
745             # it needs to be escaped if it gets printed into the HTML. For example: the
746             # feed contains a feed title of "Foo &amp; Bar". findvalue returns "Foo &
747             # Bar". When the template inserts the title, however, we want "Foo &amp; Bar",
748             # not "Foo & Bar". Thus: any text we get from the feed needs to be escaped
749             # if there's a chance we're going to print it again.
750 18         70 for my $feed (@$feeds) {
751 17 50       69 next unless $feed->{doc};
752             # title and url in the feed overrides title and xmlUrl set in the OPML (XML already escaped)
753 17   0     181 $feed->{title} = xml_escape($xpc->findvalue('/rss/channel/title | /atom:feed/atom:title', $feed->{doc})) || $feed->{title} || "";
754 17   50     1942 $feed->{url} = xml_escape($xpc->findvalue('/atom:feed/atom:link[@rel="self"]/@href', $feed->{doc})) || $feed->{url} || "";
755             # link in the feed does not override htmlUrl in the OPML (XML already escaped)
756 17   100     1584 $feed->{link} = $feed->{link} || xml_escape($xpc->findvalue('/rss/channel/link | /atom:feed/atom:link[@rel="alternate"][@type="text/html"]/@href', $feed->{doc})) || "";
757             # if they just pasted a domain "foo" then "//foo" is a valid URL
758 17 100       1045 $feed->{link} = "//" . $feed->{link} unless $feed->{link} =~ /\/\//;
759             }
760 18         252 for my $entry (@$entries) {
761             # copy from the feed (XML is already escaped)
762 29         214 $entry->{blog_link} = $entry->{feed}->{link};
763 29         73 $entry->{blog_title} = $entry->{feed}->{title};
764 29         56 $entry->{blog_url} = $entry->{feed}->{url};
765             # parse the elements
766 29         56 my $element = $entry->{element};
767 29   100     98 $entry->{title} = xml_escape $xpc->findvalue('title | atom:title', $element) || "Untitled";
768 29         2422 my @links = map { xml_escape $_->to_literal } map { $xpc->findnodes($_, $element) }
  16         504  
  87         2649  
769             # sorted by preferences!
770             qw(link atom:link[@rel="alternate"][@type="text/html"]/@href atom:link/@href);
771 29   100     726 $entry->{link} = shift(@links) || "";
772 29         218 my @authors = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes(
  7         255  
773             'author | atom:author/atom:name | atom:contributor/atom:name | dc:creator | dc:contributor', $element);
774 29 100       1207 @authors = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes(
  10         512  
775             '/atom:feed/atom:author/atom:name | '
776             . '/atom:feed/atom:contributor/atom:name | '
777             . '/rss/channel/dc:creator | '
778             . '/rss/channel/dc:contributor | '
779             . '/rss/channel/webMaster ', $element) unless @authors;
780 29 100       837 $entry->{authors} = @authors ? \@authors : undef; # key must exist in the hash
781 29 100       272 if (DateTime->compare($entry->{date}, $undefined_date) == 0) {
782 5         404 $entry->{day} = "(no date found)";
783             } else {
784 24         2146 $entry->{day} = $entry->{date}->clone->set_time_zone('UTC')->ymd; # operate on a clone
785             }
786 29         4430 my @categories = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes('category | atom:category/@term', $element);
  6         278  
787 29 100       1069 $entry->{categories} = @categories ? \@categories : undef; # key must exist in the hash
788 29         150 $entry->{excerpt} = '';
789 29         73 $entry->{content} = '';
790 29         86 my @nodes = $xpc->findnodes('description[text()!=""] | atom:content[text()!=""]', $element);
791 29 100       1235 @nodes = $xpc->findnodes('summary[text()!=""] | atom:summary[text()!=""] | itunes:summary[text()!=""]', $element) unless @nodes;
792 29         378 my $content = shift(@nodes);
793             # The default is that the content is either plain text or escaped HTML,
794             # which is what we want for RSS. For Atom feeds, type="xhtml" means that the
795             # content is XHTML, so it needs to be escaped.
796 29 100       96 if ($content) {
797 23   100     315 my $is_xhtml = $content->hasAttribute("type") && $content->getAttribute("type") eq "xhtml";
798 23         267 $entry->{excerpt} = excerpt($content->to_literal);
799 23         157 for my $child ($content->childNodes()) {
800 27         745 my $c = $child->toString();
801 27 100       106 $c = xml_escape $c if $is_xhtml;
802 27         276 $entry->{content} .= $c;
803             }
804             }
805             }
806             }
807              
808             sub excerpt {
809 23     23   62 my $content = shift;
810 23 50       69 return '(no excerpt)' unless $content;
811 23         36 my $doc = eval { XML::LibXML->load_html(recover => 2, string => $content) };
  23         119  
812 23         6956 my $separator = "¶";
813 23         109 for my $node ($doc->findnodes('//style')) {
814 1         60 $node->parentNode->removeChild($node);
815             }
816 23         937 for my $node ($doc->findnodes('//p | //br | //blockquote | //li | //td | //th | //div')) {
817 23         957 $node->appendTextNode($separator);
818             }
819 23         81 my $text = strip_html($doc->textContent());
820 23         403 $text =~ s/( +|----+)/ /g;
821             # collapse whitespace and trim
822 23         290 $text =~ s/\s+/ /g;
823 23         153 $text = trim $text;
824             # replace paragraph repeats with their surrounding spaces
825 23         502 $text =~ s/ ?¶( ?¶)* ?/¶/g;
826 23         69 $text =~ s/^¶//;
827 23         106 $text =~ s/¶$//;
828 23         71 my $len = length($text);
829 23         85 $text = substr($text, 0, 500);
830 23 50       80 $text .= "…" if $len > 500;
831 23         78 $text = xml_escape $text;
832 23         279 $text =~ s/¶/<span class="paragraph">¶ <\/span>/g;
833 23         142 return $text;
834             }
835              
836             # When there's a value that's supposed to be text but isn't, then we can try to
837             # turn it to HTML and from there to text... This is an ugly hack and I wish it
838             # wasn't necessary.
839             sub strip_html {
840 46     46   582 my $str = shift;
841 46 50       118 return '' unless $str;
842 46         66 my $doc = eval { XML::LibXML->load_html(string => $str) };
  46         162  
843 46 100       12223 return $str unless $doc;
844 44         728 return $doc->textContent();
845             }
846              
847             =head1 SEE ALSO
848              
849             OPML 2.0, L<http://dev.opml.org/spec2.html>
850              
851             RSS 2.0, L<https://cyber.harvard.edu/rss/rss.html>
852              
853             Atom Syndication, L<https://tools.ietf.org/html/rfc4287>
854              
855             River of News,
856             L<http://scripting.com/2014/06/02/whatIsARiverOfNewsAggregator.html>
857              
858             =cut
859              
860             1;