Archaeological Extraction of Blockquotes

OPML is much in the air these days: Ton is experimenting with federated bookshelves, and Paul is using OPML of yesteryear to explore his feed-reading past.

Which got me thinking about blog post archaeology, and using the blogs that I read every day as a corpus to explore in different ways.

My first thought was: export my list of feeds as OPML, then write code to parse the OPML to get the RSS feed for each blog, then write more code to retrieve the archive of each blog, and then write more code to parse the body of each post. In theory that would all be possible, as many languages have plug-and-play libraries to make parsing OPML and RSS relatively easy.

But then I realized that my RSS reader, FreshRSS, maintains a long archive of blog posts in its local database. And I thought, as a first experiment, it might be interesting to extract all the quotes from that archive–anything wrapped in “blockquote” in the body of the post–by way of providing an alternate interface for experiencing the posts all over again.

Here’s what I did to make this happen:

I used the command line interface for FreshRSS to export a JSON representation of the archive, one file per blog:

cd freshness
./export-zip-for-user.php --user peter > peter.zip

I copied the resulting peter.zip file to my local machine, unzipped it into a folder called peter, and then used the following PHP, which depends on PHP Simple HTML DOM Parser, to generate an HTML file of the quotes:

<?php

require_once("simplehtmldom/simple_html_dom.php");

$path = "./peter";

if ($handle = opendir($path)) {
    while (false !== ($file = readdir($handle))) {
        if ('.' === $file) continue;
        if ('..' === $file) continue;
        parseJSON($path . '/' .  $file);
    }
    closedir($handle);
}

function parseJSON($file) {
	$json = file_get_contents($file);
	$feed = json_decode($json);
	if ($feed) {
		print "<h1>" . str_replace(' articles', '', str_replace('List of ', '', $feed->title)) . "</h1>\n";
		foreach ($feed->items as $item) {
			$html = str_get_html($item->content->content);
			if ($html) {
				if ($html->find('blockquote')) {
					echo "<h2><a href=\"" . $item->id . "\">" . $item->title . "</a></h2>\n";
					foreach($html->find('blockquote') as $element) {
						echo "<blockquote style='border: 1px solid grey; padding: 20px'>" . $element->innertext . "</blockquote>\n";
					}
				}
			}
		}
	}
}

I ran the script, dumping the result into an HTML file:

php parse.php > quotes.html

It turns out that the blogs I follow include a lot of quotes, and the file is–quotes.html–is, to some degree, impenetrably useless.

Which got me thinking: what if I rejigged this output as an OPML file, which, among other things, I could load into OmniOutliner to browse.

So, I rejigged the code:

<?php

require_once("simplehtmldom/simple_html_dom.php");

$path = "./peter";

print '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
print '<opml version="2.0"><head><title>Quotes in Posts</title></head>';
print '<body>' . "\n";

if ($handle = opendir($path)) {
    while (false !== ($file = readdir($handle))) {
        if ('.' === $file) continue;
        if ('..' === $file) continue;
        parseJSON($path . '/' .  $file);
    }
    closedir($handle);
}

print '</body>';
print '</opml>';

function parseJSON($file) {
	$json = file_get_contents($file);
	$feed = json_decode($json);
	if ($feed) {
		print "<outline text=\"" . str_replace(' articles', '', str_replace('List of ', '', htmlspecialchars($feed->title))) . "\">\n";
		foreach ($feed->items as $item) {
			$html = str_get_html($item->content->content);
			if ($html) {
				if ($html->find('blockquote')) {
					echo "<outline text=\"" .  htmlspecialchars($item->title) . "\">\n";
					foreach($html->find('blockquote') as $element) {
						echo "<outline text=\"" . htmlspecialchars(strip_tags($element->innertext)) . "\"></outline>\n";
					}
					print "</outline>\n";
				}
			}
		}
		print "</outline>\n";
	}
}

And, sure enough, the result is somewhat less impenetrable. And kind of cool:

Visualizing block quotes in OPML in OmniOutliner.

The result also shows one of the limitations of HTML as currently practiced, which generally leaves quotes without machine-readable attribution, something that using more semantic HTML, as illustrated here, would help alleviate:

<figure>
    <blockquote cite="https://www.huxley.net/bnw/four.html">
        <p>Words can be like X-rays, if you use them properly—they’ll go through anything. You read and you’re pierced.</p>
    </blockquote>
    <figcaption>—Aldous Huxley, <cite>Brave New World</cite></figcaption>
</figure>

I’ll try to start doing that with my own quotes.

Monday, May 10, 2021 at 10:23 am

Peter Rukavina

Archaeological Extraction of Blockquotes

Add new comment

Plain text

About This Blog