File Coverage

script/jupiter

Criterion	Covered	Total	%
statement	331	352	94.0
branch	58	84	69.0
condition	38	50	76.0
subroutine	49	52	94.2
pod			n/a
total	476	538	88.4

line	stmt	bran	cond	sub	time	code
1						#! /usr/bin/env perl
2						# Planet Jupiter is a feed aggregator that creates a single HTML file
3						# Copyright (C) 2020–2021 Alex Schroeder <alex@gnu.org>
4
5						# This program is free software: you can redistribute it and/or modify
6						# it under the terms of the GNU Affero General Public License as published by
7						# the Free Software Foundation, either version 3 of the License, or
8						# (at your option) any later version.
9						#
10						# This program is distributed in the hope that it will be useful,
11						# but WITHOUT ANY WARRANTY; without even the implied warranty of
12						# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13						# GNU Affero General Public License for more details.
14						#
15						# You should have received a copy of the GNU Affero General Public License
16						# along with this program. If not, see <https://www.gnu.org/licenses/>.
17
18						package Jupiter;
19
20	15			15	5310177	use Encode::Locale;
	15				63904
	15				717
21	15			15	128	use Encode;
	15				43
	15				1287
22	15			15	666	use utf8;
	15				43
	15				82
23
24						binmode(STDOUT, ":utf8");
25						binmode(STDERR, ":utf8");
26
27						=encoding UTF8
28
29						=head1 NAME
30
31						jupiter - turn a list of feeds into a HTML page, a river of news
32
33						=head1 SYNOPSIS
34
35						To update the feeds from one or more OPML files:
36
37						B<jupiter update> I<feed.opml> … [I</regex/> …]
38
39						To generate F<index.html>:
40
41						B<jupiter html> I<feed.opml>
42
43						=head1 DESCRIPTION
44
45						Planet Jupiter is used to pull together the latest updates from a bunch of other
46						sites and display them on a single web page, the "river of news". The sites we
47						get our updates from are defined in an OPML file.
48
49						A river of news, according to Dave Winer, is a feed aggregator. New items appear
50						at the top and old items disappear at the bottom. When it's gone, it's gone.
51						There is no count of unread items. The goal is to fight the I<fear of missing
52						out> (FOMO).
53
54						Each item looks similar to every other: headline, link, an extract, maybe a date
55						and an author. Extracts contain but the beginning of the article's text; all
56						markup is removed; no images. The goal is to make it the page easy to skim.
57						Scroll down until you find something interesting and follow the link to the
58						original article if you want to read it.
59
60						=head2 The OPML file
61
62						You B<need> an OPML file. It's an XML file linking to I<feeds>. Here's an
63						example listing just one feed. In order to add more, add more C<outline>
64						elements with the C<xmlUrl> attribute. The exact order and nesting does not
65						matter. People can I<import> these OPML files into their own feed readers and
66						thus it may make sense to spend a bit more effort in making it presentable.
67
68						<opml version="2.0">
69						<body>
70						<outline title="Alex Schroeder"
71						xmlUrl="https://alexschroeder.ch/wiki?action=rss"/>
72						</body>
73						</opml>
74
75						=head2 Update the feeds in your cache
76
77						This is how you update the feeds in a file called C<feed.opml>. It downloads all
78						the feeds linked to in the OPML file and stores them in the cache directory.
79
80						jupiter update feed.opml
81
82						The directory used to keep a copy of all the feeds in the OPML file has the same
83						name as the OPML file but without the .opml extension. In other words, if your
84						OPML file is called C<feed.opml> then the cache directory is called C<feed>.
85
86						This operation takes long because it requests an update from all the sites
87						listed in your OPML file. Don't run it too often or you'll annoy the site
88						owners.
89
90						The OPML file must use the .opml extension. You can update the feeds for
91						multiple OPML files in one go.
92
93						=head2 Adding just one feed
94
95						After a while, the list of feeds in your OPML starts getting unwieldy. When you
96						add a new feed, you might not want to fetch all of them. In this case, provide a
97						regular expression surrounded by slashes to the C<update> command:
98
99						jupiter update feed.opml /example/
100
101						Assuming a feed with a URL or title that matches the regular expression is
102						listed in your OPML file, only that feed is going to get updated.
103
104						There is no need to escape slashes in the regular expression: C<//rss/> works
105						just fine. Beware shell escaping, however. Most likely, you need to surround the
106						regular expression with single quotes if it contains spaces:
107
108						jupiter update feed.opml '/Halberds & Helmets/'
109
110						Notice how we assume that named entities such as C<&> have already been
111						parsed into the appropriate strings.
112
113						=head2 Generate the HTML
114
115						This is how you generate the C<index.html> file based on the feeds of your
116						C<feed.opml>. It assumes that you have already updated all the feeds (see
117						above).
118
119						jupiter html feed.opml
120
121						See L</OPTIONS> for ways to change how the HTML is generated.
122
123						=head2 Generate the RSS feed
124
125						This happens at the same time as when you generate the HTML. It takes all the
126						entries that are being added to the HTML and puts the into a feed.
127
128						See L</OPTIONS> for ways to change how the HTML is generated.
129
130						=head2 Why separate the two steps?
131
132						The first reason is that tinkering with the templates involves running the
133						program again and again, and you don't want to contact all the sites whenever
134						you update your templates.
135
136						The other reason is that it allows you to create subsets. For example, you can
137						fetch the feeds for three different OPML files:
138
139						jupiter update osr.opml indie.opml other.opml
140
141						And then you can create three different HTML files:
142
143						jupiter html osr.html osr.opml
144						jupiter html indie.html indie.opml
145						jupiter html rpg.html osr.opml indie.opml other.opml
146
147						For an example of how it might look, check out the setup for the planets I run.
148						L<https://alexschroeder.ch/cgit/planet/about/>
149
150						=head2 What about the JSON file?
151
152						There's a JSON file that gets generated and updated as you run Planet Jupiter.
153						It's name depends on the OPML files used. It records metadata for every feed in
154						the OPML file that isn't stored in the feeds themselves.
155
156						C<message> is the HTTP status message, or a similar message such as "No entry
157						newer than 90 days." This is set when update the feeds in your cache.
158
159						C<message> is the HTTP status code; this code could be the real status code from
160						the server (such as 404 for a "not found" status) or one generated by Jupiter
161						such that it matches the status message (such as 206 for a "partial content"
162						status when there aren't any recent entries in the feed). This is set when
163						update the feeds in your cache.
164
165						C<title> is the site's title. When you update the feeds in your cache, it is
166						taken from the OPML file. That's how the feed can have a title even if the
167						download failed. When you generate the HTML, the feeds in the cache are parsed
168						and if a title is provided, it is stored in the JSON file and overrides the
169						title in the OPML file.
170
171						C<link> is the site's link for humans. When you generate the HTML, the feeds in
172						the cache are parsed and if a link is provided, it is stored in the JSON file.
173						If the OPML element contained a C<htmlURL> attribute, however, that takes
174						precedence. The reasoning is that when a podcast is hosted on a platform which
175						generates a link that you don't like and you know the link to the human-readable
176						blog elsehwere, use the C<htmlURL> attribute in the OPML file to override this.
177
178						C<last_modified> and C<etag> are two headers used for caching from the HTTP
179						response that cannot be changed by data in the feed.
180
181						If we run into problems downloading a feed, this setup allows us to still link
182						to the feeds that aren't working, using their correct names, and describing the
183						error we encountered.
184
185						=head2 Logging
186
187						Use the C<--log=LEVEL> to set the log level. Valid values for LEVEL are debug,
188						info, warn, error, and fatal.
189
190						=head1 LICENSE
191
192						GNU Affero General Public License
193
194						=head1 INSTALLATION
195
196						Using C<cpan>:
197
198						cpan App::jupiter
199
200						Manual install:
201
202						perl Makefile.PL
203						make
204						make install
205
206						=head2 Dependencies
207
208						To run Jupiter on Debian we need:
209
210						C<libmodern-perl-perl> for L<Modern::Perl>
211
212						C<libmojolicious-perl> for L<Mojo::Template>, L<Mojo::UserAgent>, L<Mojo::Log>,
213						L<Mojo::JSON>, and L<Mojo::Util>
214
215						C<libxml-libxml-perl> for L<XML::LibXML>
216
217						C<libfile-slurper-perl> for L<File::Slurper>
218
219						C<libdatetime-perl> for L<DateTime>
220
221						C<libdatetime-format-mail-perl> for L<DateTime::Format::Mail>
222
223						C<libdatetime-format-iso8601-perl> for L<DateTime::Format::ISO8601>
224
225						Unfortunately, L<Mojo::UserAgent::Role::Queued> isn't packaged for Debian.
226						Therefore, let's build it and install it as a Debian package.
227
228						sudo apt-get install libmodule-build-tiny-perl
229						sudo apt-get install dh-make-perl
230						sudo dh-make-perl --build --cpan Mojo::UserAgent::Role::Queued
231						dpkg --install libmojo-useragent-role-queued-perl_1.15-1_all.deb
232
233						To generate the C<README.md> from the source file, you need F<pod2markdown>
234						which you get in C<libpod-markdown-perl>.
235
236						=head1 FILES
237
238						There are a number of files in the F<share> directory which you can use as
239						starting points.
240
241						F<template.html> is the HTML template.
242
243						F<default.css> is a small CSS file used by F<template.html>.
244
245						F<personalize.js> is a small Javascript file used by F<template.html> used to
246						allow visitors to jump from one article to the next using C<J> and C<K>.
247
248						F<jupiter.png> is used by F<template.html> as the icon.
249
250						F<jupiter.svg> is used by F<template.html> as the logo.
251
252						F<feed.png> is used by F<template.html> as the icon for the feeds in the
253						sidebar.
254
255						F<feed.rss> is the feed template.
256
257						=head1 OPTIONS
258
259						HTML generation uses a template, C<template.html>. It is written for
260						C<Mojo::Template> and you can find it in the F<share> directory of your
261						distribution. The default templates use other files, such as the logo, the feed
262						icon, a CSS file, and a small Javascript snippet to enable navigation using the
263						C<J> and C<K> keys (see above).
264
265						You can specify a different HTML file to generate:
266
267						B<jupiter html> I<your.html feed.opml>
268
269						If you specify two HTML files, the first is the HTML file to generate and the
270						second is the template to use. Both must use the C<.html> extension.
271
272						B<jupiter html> I<your.html your-template.html feed.opml>
273
274						Feed generation uses a template, C<feed.rss>. It writes all the entries into a
275						file called C<feed.xml>. Again, the template is written for C<Mojo::Template>.
276
277						You can specify up to two XML, RSS or ATOM files. They must uses one of these
278						three extensions: C<.xml>, C<.rss>, or C<.atom>. The first is the name of the
279						feed to generate, the second is the template to use:
280
281						B<jupiter html> I<atom.xml template.xml planet.html template.html feed.opml>
282
283						In the above case, Planet Jupiter will write a feed called F<atom.xml> based on
284						F<template.xml> and a HTML file called F<planet.html> based on F<template.html>,
285						using the cached entries matching the feeds in F<feed.opml>.
286
287						=cut
288
289	15			15	14037	use DateTime;
	15				6905971
	15				802
290	15			15	8386	use DateTime::Format::Mail;
	15				85456
	15				606
291	15			15	9241	use DateTime::Format::ISO8601;
	15				6593858
	15				878
292	15			15	167	use File::Basename;
	15				32
	15				1353
293	15			15	700	use File::Slurper qw(read_binary write_binary read_text write_text);
	15				3465
	15				1040
294	15			15	115	use List::Util qw(uniq min shuffle);
	15				36
	15				1157
295	15			15	127	use Modern::Perl;
	15				46
	15				207
296	15			15	14477	use Mojo::Log;
	15				398978
	15				189
297	15			15	1246	use Mojo::JSON qw(decode_json encode_json);
	15				19516
	15				982
298	15			15	8119	use Mojo::Template;
	15				65861
	15				99
299	15			15	9139	use Mojo::UserAgent;
	15				436255
	15				135
300	15			15	9604	use Pod::Simple::Text;
	15				82213
	15				687
301	15			15	10188	use XML::LibXML;
	15				455640
	15				112
302	15			15	2430	use Mojo::Util qw(slugify trim xml_escape html_unescape);
	15				48
	15				1024
303	15			15	146	use File::ShareDir 'dist_file';
	15				34
	15				711
304
305	15			15	94	use vars qw($log);
	15				46
	15				85135
306						our $log = Mojo::Log->new;
307
308						my $xpc = XML::LibXML::XPathContext->new;
309						$xpc->registerNs('atom', 'http://www.w3.org/2005/Atom');
310						$xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
311						$xpc->registerNs('dc', 'http://purl.org/dc/elements/1.1/');
312						$xpc->registerNs('itunes', 'http://www.itunes.com/dtds/podcast-1.0.dtd');
313
314						my $undefined_date = DateTime->from_epoch( epoch => 0 );
315
316						my (%wday, %month, $wday_re, $month_re);
317						%wday = qw (lun. Mon mar. Tue mer. Wed jeu. Thu ven. Fri sam. Sat dim. Sun);
318						%month = qw (janv. Jan févr. Feb mars Mar avr. Apr mai May juin Jun
319						juil. Jul août Aug sept. Sep oct. Oct nov. Nov déc. Dec);
320						$wday_re = join('\|', map { quotemeta } keys %wday) unless $wday_re;
321						$month_re = join('\|', map { quotemeta } keys %month) unless $month_re;
322
323						# Our tests don't want to call main
324						__PACKAGE__->main unless caller;
325
326						sub main {
327	0			0	0	my ($log_level) = grep /^--log=/, @ARGV;
328	0	0			0	$log->level(substr($log_level, 6)) if $log_level;
329	0				0	my ($command) = grep /^[a-z]+$/, @ARGV;
330	0		0		0	$command \|\|= 'help';
331	0	0			0	if ($command eq 'update') {
		0
332	0				0	update_cache(@ARGV);
333						} elsif ($command eq 'html') {
334	0				0	make_html(@ARGV);
335						} else {
336	0				0	my $parser = Pod::Simple::Text->new();
337	0				0	$parser->parse_file($0);
338						}
339						}
340
341						sub update_cache {
342	18			18	1337878	my ($feeds, $files) = read_opml(@_);
343	18				100	make_directories($feeds);
344	18				158	load_feed_metadata($feeds, $files);
345	18				272	my $ua = Mojo::UserAgent->new->with_roles('+Queued')
346						->max_redirects(3)
347						->max_active(5);
348	18				88165	make_promises($ua, $feeds);
349	18				21937	fetch_feeds($feeds);
350	18				21680	save_feed_metadata($feeds, $files);
351	18				9111	cleanup_cache($feeds);
352						}
353
354						sub make_promises {
355	18			18	62	my $ua = shift;
356	18				44	my $feeds = shift;
357	18				59	for my $feed (@$feeds) {
358	18				104	my $url = html_unescape $feed->{url}; # undo xml_escape for the request
359						$ua->on(start => sub {
360	18			18	17851	my ($ua, $tx) = @_;
361	18	50			104	$tx->req->headers->if_none_match($feed->{etag}) if ($feed->{etag});
362	18	50			144	$tx->req->headers->if_modified_since($feed->{last_modified}) if ($feed->{last_modified});
363	18				575	});
364						$feed->{promise} = $ua->get_p($url)
365						->catch(sub {
366	0			0	0	$feed->{message} = "@_";
367	0				0	$feed->{code} = 521;
368						# returning 0 in the case of an error is important
369	0				0	0; })
370						# sleeping to stop blogger.com from blocking us
371	18			18	275	->finally(sub { $log->debug($url); sleep 2; });
	18				286351
	18				36004949
372						}
373						}
374
375						sub fetch_feeds {
376	18			18	64	my $feeds = shift;
377	18				544	$log->info("Fetching feeds...");
378	18				118	Mojo::Promise->all(map { $_->{promise} } @$feeds)->then(sub {
379						# all returns the values in the same order!
380	18			18	24507	for (my $i = 0; $i < @_; $i++) {
381	18				116	my $feed = $feeds->[$i];
382	18				78	my $value = $_[$i];
383	18				73	my $tx = $value->[0];
384						# relies on catch returning 0 above
385	18	50			218	next unless $tx;
386	18				310	$feed->{message} = $tx->result->message;
387	18				1651	$feed->{code} = $tx->result->code;
388	18				940	$feed->{last_modified} = $tx->result->headers->last_modified;
389	18				1514	$feed->{etag} = $tx->result->headers->etag;
390						# save raw bytes if this is a success
391	18	50			1013	eval { write_binary($feed->{cache_file}, $tx->result->body) } if $tx->result->is_success;
	18				1568
392	18	50			9852	warn "Unable to write $feed->{cache_file}: $@\n" if $@;
393						}
394						})->catch(sub {
395	0			0	0	warn "Something went wrong: @_";
396	18				230	})->wait;
397						}
398
399						sub load_feed_metadata {
400	37			37	115	my $feeds = shift;
401	37				76	my $files = shift;
402	37				157	for my $file (@$files) {
403	37				198	my $filename = "$file->{path}/$file->{name}";
404	37	100			932	next unless -r "$filename.json";
405	23				308	my $data = decode_json read_binary("$filename.json");
406	23				10604	for my $feed (@$feeds) {
407	22				85	my $url = $feed->{url};
408	22	50			97	next unless $url;
409						# don't overwrite title and htmlUrl from OPML file
410	22	100			177	$feed->{title} = $data->{$url}->{title} if $data->{$url}->{title};
411	22		100		197	$feed->{link} \|\|= $data->{$url}->{link};
412						# all the other metadata is loaded from the JSON file
413	22				78	$feed->{message} = $data->{$url}->{message};
414	22				134	$feed->{code} = $data->{$url}->{code};
415	22				82	$feed->{last_modified} = $data->{$url}->{last_modified};
416	22				89	$feed->{etag} = $data->{$url}->{etag};
417	23				96	} grep { $_->{opml_file} eq $file->{file} } @$feeds;
	22				152
418						}
419						}
420
421						sub save_feed_metadata {
422	37			37	979	my $feeds = shift;
423	37				106	my $files = shift;
424	37				126	for my $file (@$files) {
425	37				143	my $name = $file->{name};
426						my %messages = map {
427	36				106	my $feed = $_;
428	36				131	$feed->{url} => { map { $_ => $feed->{$_} } grep { $feed->{$_} } qw(title link message code last_modified etag) };
	131				670
	216				501
429	37				119	} grep { $_->{opml_file} eq $file->{file} } @$feeds;
	36				212
430	37				388	write_binary("$file->{path}/$file->{name}.json", encode_json \%messages);
431						}
432						}
433
434						sub cleanup_cache {
435	18			18	81	my $feeds = shift;
436	18				64	my %good = map { $_ => 1 } @{cache_files($feeds)};
	18				153
	18				111
437	18				75	my @unused = grep { not $good{$_} } @{existing_files($feeds)};
	18				165
	18				105
438	18	50			661	if (@unused) {
439	0				0	$log->info("Removing unused files from the cache...");
440	0				0	foreach (@unused) { $log->info($_) }
	0				0
441	0				0	unlink @unused;
442						}
443						}
444
445						sub existing_files {
446	18			18	67	my $feeds = shift;
447	18				50	my @files;
448	18				73	for my $dir (uniq map { $_->{cache_dir} } @$feeds) {
	18				161
449	18	50	33		3466	push(@files, <"$dir/*">) if $dir and -d $dir;
450						}
451	18				228	return \@files;
452						}
453
454						sub cache_files {
455	18			18	64	my $feeds = shift;
456	18				94	my @files = map { $_->{cache_file} } @$feeds;
	18				148
457	18				93	return \@files;
458						}
459
460						sub make_directories {
461	18			18	41	my $feeds = shift;
462	18				53	for my $dir (uniq map { $_->{cache_dir} } @$feeds) {
	18				159
463	18	100	66		670	if ($dir and not -d $dir) {
464	14				3052	mkdir $dir;
465						}
466						}
467						}
468
469						sub make_html {
470	19			19	40041	my ($feeds, $files) = read_opml(@_);
471	19				132	load_feed_metadata($feeds, $files); # load messages and codes for feeds
472	19				118	my $globals = globals($files);
473	19				135	my $entries = entries($feeds, 4); # set data for feeds, too
474	19				109	add_data($feeds, $entries); # extract data from the xml
475	19				266	save_feed_metadata($feeds, $files); # save title and link for feeds
476	19				26412	$entries = limit($entries, 100);
477	19				97	write_text(html_file(@_), apply_template(read_text(html_template_file(@_)), $globals, $feeds, $entries));
478	19				226270	write_text(feed_file(@_), apply_template(read_text(feed_template_file(@_)), $globals, $feeds, $entries));
479						}
480
481						sub html_file {
482	19			19	178	my ($html) = grep /\.html$/, @_;
483	19		50		164	return $html \|\| 'index.html';
484						}
485
486						sub html_template_file {
487	19			19	165	my ($html, $template) = grep /\.html$/, @_;
488	19		33		225	$template \|\|= dist_file('App-jupiter', 'template.html');
489	19	50			5353	die "HTML template $template not found\n" unless -r $template;
490	19				177	return $template;
491						}
492
493						sub feed_file {
494	19			19	313	my ($feed) = grep /\.(xml\|rss\|atom)$/, @_;
495	19	50			199	return $feed if $feed;
496	0				0	return 'feed.xml';
497						}
498
499						sub feed_template_file {
500	19			19	148	my ($feed, $template) = grep /\.(xml\|rss\|atom)$/, @_;
501	19	50			79	return $template if $template;
502	19				112	return dist_file('App-jupiter', 'feed.rss');
503						}
504
505						sub apply_template {
506	38			38	8450	my $mnt = Mojo::Template->new;
507	38				475	return $mnt->render(@_);
508						}
509
510						=head1 TEMPLATES
511
512						The page template is called with three hash references: C<globals>, C<feeds>,
513						and C<entries>. The keys of these three hash references are documented below.
514						The values of these hashes are all I<escaped HTML> except where noted (dates and
515						file names, for example).
516
517						The technical details of how to write the templates are documented in the man
518						page for L<Mojo::Template>.
519
520						=head2 Globals
521
522						There are not many global keys.
523
524						B<date> is the the publication date of the HTML page, in ISO date format:
525						YYYY-MM-DD.
526
527						B<files> is the list of OPML files used.
528
529						=cut
530
531						sub globals {
532	19			19	71	my $files = shift;
533	19				192	my @time = gmtime;
534	19				285	my $today = DateTime->now->ymd;
535	19				11292	return {date => $today, files => $files};
536						}
537
538						=head2 Writing templates for feeds
539
540						Feeds have the following keys available:
541
542						B<title> is the title of the feed.
543
544						B<url> is the URL of the feed (RSS or Atom). This is not the link to the site!
545
546						B<link> is the URL of the web page (HTML). This is the link to the site.
547
548						B<opml_file> is the file name where this feed is listed.
549
550						B<cache_dir> is the directory where this feed is cached.
551
552						B<message> is the HTTP status message or other warning or error that we got
553						while fetching the feed.
554
555						B<code> is the HTTP status code we got while fetching the feed.
556
557						B<doc> is the L<XML::LibXML::Document>. Could be either Atom or RSS!
558
559						=cut
560
561						# Creates list of feeds. Each feed is a hash with keys title, url, opml_file,
562						# cache_dir and cache_file.
563						sub read_opml {
564	37			37	159	my (@feeds, @files);
565	37				323	my @filters = map { decode(locale => substr($_, 1, -1)) } grep /^\/.*\/$/, @_;
	2				16
566	37				678	for my $file (grep /\.opml$/, @_) {
567	37				443	my $doc = XML::LibXML->load_xml(location => $file); # this better have no errors!
568	37				23007	my @nodes = $doc->findnodes('//outline[./@xmlUrl]');
569	37				7219	my ($name, $path) = fileparse($file, '.opml', '.xml');
570	37				219	for my $node (@nodes) {
571	37				270	my $title = xml_escape $node->getAttribute('title');
572	37				1454	my $url = xml_escape $node->getAttribute('xmlUrl');
573	37	100	100		910	next if @filters > 0 and not grep { $url =~ /$_/ or $title =~ /$_/ } @filters;
	2	100			143
574	36				146	my $link = xml_escape $node->getAttribute('htmlUrl');
575	36				902	push @feeds, {
576						title => $title, # title in the OPML file
577						url => $url, # feed URL in the OPML file
578						link => $link, # web URL in the OPML file
579						opml_file => $file,
580						cache_dir => "$path/$name",
581						cache_file => "$path/$name/" . slugify($url),
582						};
583						}
584	37	50			2340	warn "No feeds found in the OPML file $file\n" unless @nodes;
585	37				400	push @files, { file => $file, path => $path, name => $name };
586						}
587	37				2172	@feeds = shuffle @feeds;
588	37				155	return \@feeds, \@files;
589						}
590
591						sub entries {
592	19			19	51	my $feeds = shift;
593	19				57	my $limit = shift;
594	19				103	my $date = DateTime->now(time_zone => 'UTC')->subtract( days => 90 ); # compute once
595	19				40459	my $now = DateTime->now(time_zone => 'UTC');
596	19				6659	my @entries;
597	19				90	for my $feed (@$feeds) {
598	18	50			566	next unless -r $feed->{cache_file};
599	18				90	my $doc = eval { XML::LibXML->load_xml(recover => 2, location => $feed->{cache_file} )};
	18				237
600	18	50			8673	if (not $doc) {
601	0				0	$feed->{message} = xml_escape "Parsing error: $@";
602	0				0	$feed->{code} = 422; # unprocessable
603	0				0	next;
604						}
605	18				291	$feed->{doc} = $doc;
606	18				206	my @nodes = $xpc->findnodes("/rss/channel/item \| /atom:feed/atom:entry", $doc);
607	18	100			1517	if (not @nodes) {
608	2				9	$feed->{message} = "Empty feed";
609	2				10	$feed->{code} = 204; # no content
610	2				8	next;
611						}
612						# if this is an Atom feed, we need to sort the entries ourselves (older entries at the end)
613						my @candidates = map {
614	16				53	my $entry = {};
	40				90
615	40				99	$entry->{element} = $_;
616	40				134	$entry->{id} = id($_);
617	40		66		131	$entry->{date} = updated($_) \|\| $undefined_date;
618	40				1146	$entry;
619						} @nodes;
620	16				71	@candidates = grep { DateTime->compare($_->{date}, $now) <= 0 } @candidates;
	40				1862
621	16				1801	@candidates = unique(sort { DateTime->compare( $b->{date}, $a->{date} ) } @candidates);
	38				1865
622	16				212	@candidates = @candidates[0 .. min($#candidates, $limit - 1)];
623						# now that we have limited the candidates, let's add more metadata from the feed
624	16				56	for my $entry (@candidates) {
625	30				138	$entry->{feed} = $feed;
626						# these two are already escaped
627	30				69	$entry->{blog_title} = $feed->{title};
628	30				76	$entry->{blog_url} = $feed->{url};
629						}
630	16				100	add_age_warning($feed, \@candidates, $date);
631	16				455	push @entries, @candidates;
632						}
633	19				243	return \@entries;
634						}
635
636						sub add_age_warning {
637	16			16	39	my $feed = shift;
638	16				51	my $entries = shift;
639	16				33	my $date = shift;
640						# feed modification date is smaller than the date given
641	16				84	my ($node) = $xpc->findnodes("/rss/channel \| /atom:feed", $feed->{doc});
642	16				831	my $feed_date = updated($node);
643	16	100	100		385	if ($feed_date and DateTime->compare($feed_date, $date) == -1) {
644	11				961	$feed->{message} = "No feed updates in 90 days";
645	11				37	$feed->{code} = 206; # partial content
646	11				98	return;
647						} else {
648						# or no entry found with a modification date equal or bigger than the date given
649	5				229	for my $entry (@$entries) {
650	11	100			396	return if DateTime->compare($entry->{date}, $date) >= 0;
651						}
652	4				273	$feed->{message} = "No entry newer than 90 days";
653	4				19	$feed->{code} = 206; # partial content
654						}
655						}
656
657						sub updated {
658	113			113	27354	my $node = shift;
659	113	50			397	return unless $node;
660	113	100			797	my @nodes = $xpc->findnodes('pubDate \| atom:published \| atom:updated', $node) or return;
661	106				5726	my $date = $nodes[0]->textContent;
662						my $dt = eval { DateTime::Format::Mail->parse_datetime($date) }
663						\|\| eval { DateTime::Format::ISO8601->parse_datetime($date) }
664	106		100		185	\|\| eval { DateTime::Format::Mail->parse_datetime(french($date)) };
665	106				75439	return $dt;
666						}
667
668						sub french {
669	60			60	105	my $date = shift;
670	60				656	$date =~ s/^($wday_re)/$wday{$1}/;
671	60				564	$date =~ s/\b($month_re)/$month{$1}/;
672	60				273	return $date;
673						}
674
675						sub id {
676	40			40	77	my $node = shift;
677	40	50			140	return unless $node;
678	40				344	my $id = $xpc->findvalue('guid \| atom:id', $node); # id is mandatory for Atom
679	40		100		4576	$id \|\|= $node->findvalue('link'); # one of the following three is mandatory for RSS
680	40		100		2736	$id \|\|= $node->findvalue('title');
681	40		100		1697	$id \|\|= $node->findvalue('description');
682	40				857	return $id;
683						}
684
685						sub unique {
686	35			35	535	my %seen;
687						my @unique;
688	35				87	for my $node (@_) {
689	69	100			220	next if $seen{$node->{id}};
690	61				149	$seen{$node->{id}} = 1;
691	61				144	push(@unique, $node);
692						}
693	35				184	return @unique;
694						}
695
696						sub limit {
697	19			19	73	my $entries = shift;
698	19				41	my $limit = shift;
699						# we want the most recent entries overall
700	19				73	@$entries = sort { DateTime->compare( $b->{date}, $a->{date} ) } unique(@$entries);
	18				799
701	19				638	return [@$entries[0 .. min($#$entries, $limit - 1)]];
702						}
703
704						=head2 Writing templates for entries
705
706						Entries have the following keys available:
707
708						B<title> is the title of the post.
709
710						B<link> is the URL to the post on the web (probably a HTML page).
711
712						B<blog_title> is the title of the site.
713
714						B<blog_link> is the URL for the site on the web (probably a HTML page).
715
716						B<blog_url> is the URL for the site's feed (RSS or Atom).
717
718						B<authors> are the authors (or the Dublin Core contributor), a list of strings.
719
720						B<date> is the publication date, as a DateTime object.
721
722						B<day> is the publication date, in ISO date format: YYYY-MM-DD, for the UTC
723						timezone. The UTC timezone is picked so that the day doesn't jump back and forth
724						when sorting entries by date.
725
726						B<content> is the full post content, as string or encoded HTML.
727
728						B<excerpt> is the post content, limited to 500 characters, with paragraph
729						separators instead of HTML elements, as HTML. It is not encoded because the idea
730						is that it only gets added to the HTML and not to the feed, and the HTML it
731						contains is very controlled (only the pilcrow sign inside a span to indicate
732						paragraph breaks).
733
734						B<categories> are the categories, a list of strings.
735
736						B<element> is for internal use only. It contains the L<XML::LibXML::Element>
737						object. This could be RSS or Atom!
738
739						B<feed> is for internal use only. It's a reference to the feed this entry
740						belongs to.
741
742						=cut
743
744						sub add_data {
745	19			19	45	my $feeds = shift;
746	19				38	my $entries = shift;
747						# A note on the use of xml_escape: whenever we get data from the feed itself,
748						# it needs to be escaped if it gets printed into the HTML. For example: the
749						# feed contains a feed title of "Foo & Bar". findvalue returns "Foo &
750						# Bar". When the template inserts the title, however, we want "Foo & Bar",
751						# not "Foo & Bar". Thus: any text we get from the feed needs to be escaped
752						# if there's a chance we're going to print it again.
753	19				58	for my $feed (@$feeds) {
754	18	50			66	next unless $feed->{doc};
755						# title and url in the feed overrides title and xmlUrl set in the OPML (XML already escaped)
756	18		0		187	$feed->{title} = xml_escape($xpc->findvalue('/rss/channel/title \| /atom:feed/atom:title', $feed->{doc})) \|\| $feed->{title} \|\| "";
757	18		50		1964	$feed->{url} = xml_escape($xpc->findvalue('/atom:feed/atom:link[@rel="self"]/@href', $feed->{doc})) \|\| $feed->{url} \|\| "";
758						# link in the feed does not override htmlUrl in the OPML (XML already escaped)
759	18		100		1842	$feed->{link} = $feed->{link} \|\| xml_escape($xpc->findvalue('/rss/channel/link \| /atom:feed/atom:link[@rel="alternate"][@type="text/html"]/@href', $feed->{doc})) \|\| "";
760						# if they just pasted a domain "foo" then "//foo" is a valid URL
761	18	100			1215	$feed->{link} = "//" . $feed->{link} unless $feed->{link} =~ /\/\//;
762						}
763	19				213	for my $entry (@$entries) {
764						# copy from the feed (XML is already escaped)
765	30				258	$entry->{blog_link} = $entry->{feed}->{link};
766	30				84	$entry->{blog_title} = $entry->{feed}->{title};
767	30				72	$entry->{blog_url} = $entry->{feed}->{url};
768						# parse the elements
769	30				53	my $element = $entry->{element};
770	30		100		102	$entry->{title} = xml_escape $xpc->findvalue('title \| atom:title', $element) \|\| "Untitled";
771	30				2565	my @links = map { xml_escape $_->to_literal } map { $xpc->findnodes($_, $element) }
	16				560
	90				2826
772						# sorted by preferences!
773						qw(link atom:link[@rel="alternate"][@type="text/html"]/@href atom:link/@href);
774	30		100		868	$entry->{link} = shift(@links) \|\| "";
775	30				251	my @authors = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes(
	8				403
776						'author \| atom:author/atom:name \| atom:contributor/atom:name \| dc:creator \| dc:contributor', $element);
777	30	100			1678	@authors = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes(
	10				503
778						'/atom:feed/atom:author/atom:name \| '
779						. '/atom:feed/atom:contributor/atom:name \| '
780						. '/rss/channel/dc:creator \| '
781						. '/rss/channel/dc:contributor \| '
782						. '/rss/channel/webMaster ', $element) unless @authors;
783	30	100			988	$entry->{authors} = @authors ? \@authors : undef; # key must exist in the hash
784	30	100			290	if (DateTime->compare($entry->{date}, $undefined_date) == 0) {
785	5				437	$entry->{day} = "(no date found)";
786						} else {
787	25				2178	$entry->{day} = $entry->{date}->clone->set_time_zone('UTC')->ymd; # operate on a clone
788						}
789	30				4354	my @categories = map { xml_escape strip_html($_->to_literal) } $xpc->findnodes('category \| atom:category/@term', $element);
	6				311
790	30	100			1177	$entry->{categories} = @categories ? \@categories : undef; # key must exist in the hash
791	30				152	$entry->{excerpt} = '';
792	30				81	$entry->{content} = '';
793	30				104	my @nodes = $xpc->findnodes('description[text()!=""] \| atom:content[text()!=""]', $element);
794	30	100			1436	@nodes = $xpc->findnodes('summary[text()!=""] \| atom:summary[text()!=""] \| itunes:summary[text()!=""]', $element) unless @nodes;
795	30				444	my $content = shift(@nodes);
796						# The default is that the content is either plain text or escaped HTML,
797						# which is what we want for RSS. For Atom feeds, type="xhtml" means that the
798						# content is XHTML, so it needs to be escaped.
799	30	100			101	if ($content) {
800	23		100		361	my $is_xhtml = $content->hasAttribute("type") && $content->getAttribute("type") eq "xhtml";
801	23				270	$entry->{excerpt} = excerpt($content->to_literal);
802	23				155	for my $child ($content->childNodes()) {
803	27				726	my $c = $child->toString();
804	27	100			99	$c = xml_escape $c if $is_xhtml;
805	27				278	$entry->{content} .= $c;
806						}
807						}
808						}
809						}
810
811						sub excerpt {
812	23			23	54	my $content = shift;
813	23	50			82	return '(no excerpt)' unless $content;
814	23				50	my $doc = eval { XML::LibXML->load_html(recover => 2, string => $content) };
	23				119
815	23				6816	my $separator = "¶";
816	23				101	for my $node ($doc->findnodes('//style \| //script \| //time \| //[contains(@class, "post-header")] \| //[contains(@class, "post-share-buttons")] \| //[contains(@class, "post-footer")] \| //[contains(@class, "comments")]'),
817						){
818	1				127	$node->parentNode->removeChild($node);
819						}
820	23				1907	for my $node ($doc->findnodes('//p \| //br \| //blockquote \| //li \| //td \| //th \| //div \| //h1 \| //h2 \| //h3 \| //h4 \| //h5 \| //h6')) {
821	23				1284	$node->appendTextNode($separator);
822						}
823	23				83	my $text = strip_html($doc->textContent());
824	23				384	$text =~ s/( +\|----+)/ /g;
825						# collapse whitespace and trim
826	23				274	$text =~ s/\s+/ /g;
827	23				116	$text = trim $text;
828						# replace paragraph repeats with their surrounding spaces
829	23				484	$text =~ s/ ?¶( ?¶)* ?/¶/g;
830	23				75	$text =~ s/^¶//;
831	23				107	$text =~ s/¶$//;
832	23				78	my $len = length($text);
833	23				78	$text = substr($text, 0, 500);
834	23	50			97	$text .= "…" if $len > 500;
835	23				68	$text = xml_escape $text;
836	23				291	$text =~ s/¶/<span class="paragraph">¶ <\/span>/g;
837	23				488	return $text;
838						}
839
840						# When there's a value that's supposed to be text but isn't, then we can try to
841						# turn it to HTML and from there to text... This is an ugly hack and I wish it
842						# wasn't necessary.
843						sub strip_html {
844	47			47	985	my $str = shift;
845	47	50			123	return '' unless $str;
846	47				71	my $doc = eval { XML::LibXML->load_html(string => $str) };
	47				165
847	47	100			12608	return $str unless $doc;
848	45				767	return $doc->textContent();
849						}
850
851						=head1 SEE ALSO
852
853						OPML 2.0, L<http://dev.opml.org/spec2.html>
854
855						RSS 2.0, L<https://cyber.harvard.edu/rss/rss.html>
856
857						Atom Syndication, L<https://tools.ietf.org/html/rfc4287>
858
859						River of News,
860						L<http://scripting.com/2014/06/02/whatIsARiverOfNewsAggregator.html>
861
862						=cut
863
864						1;