Bypassing Islandora to Harvest Island Newspapers Data Directly

Because of Robertson Library’s significant investment in Islandora as a content repository tool, dissing Islandora as being an annoying piece of cruft between me and the data I want to harvest isn’t singing from the company song book. And of course Islandora is much more than an annoying piece of cruft: it truly is a very useful, flexible tool that enables projects like IslandNewpapers.ca and innumerable other repository projects to be managed from a web front-end.

But sometimes you want your fingers on the beating heart of the data, not abstracted through an intermediary, and fortunately Islandora doesn’t prevent that, as it’s possible to communicate directly with Apache Solr and Fedora, bypassing the Islandora layer.

For example, I had a request from a friend who was using IslandNewspapers.ca to do some research on the Mayne family: there are 1847 pages of The Guardian in IslandNewspapers.ca with one or more instances of the word “Mayne” on them, and she wanted to avoid manually visiting 1847 pages in IslandNewspapers.ca to pull out that information.

Here’s what I did to help her along.

I reverse-engineered the Solr query that a search for “Mayne” in IslandNewspapers.ca was running and found it looks like this (actually, the “rows” parameter is set to 20 by default, so I changed it to 2000 to ensure I could pull out all of the instances of Mayne in a single query).

To pull out that Solr search results, I just grabbed the output of that search using wget, like this:

wget "http://islandnewspapers.ca:8080/solr/select?defType=dismax&facet=true&facet.mincount=0&facet.limit=20&facet.field=PARENT_century_s&facet.field=PARENT_decade_s&facet.field=PARENT_year_s&facet.field=PARENT_month_s&qt=standard&facet.date=PARENT_dateIssued_dt&f.PARENT_dateIssued_dt.facet.date.start=NOW%2FYEAR-120YEARS&f.PARENT_dateIssued_dt.facet.date.end=NOW&f.PARENT_dateIssued_dt.facet.date.gap=%2B1YEAR&f.PARENT_dateIssued_dt.facet.mincount=0&facet.date.start=NOW%2FYEAR-20YEARS&facet.date.end=NOW&facet.date.gap=%2B1YEAR&hl=true&hl.fl=OCR_t&hl.fragsize=400&hl.simple.pre=%3Cspan+class%3D%22islandora-solr-highlight%22%3E&hl.simple.post=%3C%2Fspan%3E&qf=OCR_t^10.0&version=1.2&wt=json&json.nl=map&q=Mayne&start=0&rows=2000&indent=on&debugQuery=true" -O Mayne.json

(Update: it turns out that I was able to access Solr directly because I was connected to the UPEI VPN; Solr isn’t exposed to the Internet-at-large right not, so y’all on the outside world cannot, alas, run that query as I could).

The result was a 53 MB JSON file that, key for my purposes, contains a “highlighting” object that lists each of the 1847 instances of “Mayne”, with the Fedora “PID” – the unique identifier of that specific page in The Guardian – and the keyword in context snippet, like this:

  "highlighting":{
    "guardian:19450730-010":{
      "OCR_t":["; Mrs. Hutghlngs, Cyril George. F0. Keith Warren. Bigger. Sask. High Jump — Wendell Mayne, Knight, Ronald Earl. F'l.-Lieut., Herman Mayne. Saul! sic. Marie. Ont Spike driving contest, ladies- Ajqyrigk, Peter William. 110.. lis’ team won over Mr. Everett London, Ont. 6015151115‘,- ldi Mi l 4 =,;- B ll, 90.. St. Mary's. ug-o -war. a es- ss E sic 01?.\" n‘ R055 e Maynes teafn won over Mrs. Wen- PTAM, New"]},
    "guardian:19261030-011":{
      "OCR_t":["- ed to the pupils are as follows: canes t-cnAlns Best sheaf of 100 heads 0t‘ oats. —1- Melburne MscDoweil, Pleas- ant Vulley 2 Cimeuce Haslaxn, Springfield 3 Rich Kelly, Stanchol 4 E3191 H088\". Solltil Granville. ‘Wheat—1 Ethel Hogan, 2 Lloyd Stanchei 4; Clgrence Springfield. Corn-—1 Allison Mayne, Spring- field 2 ‘Priscilla Frizzeil, Stsnchel 3 Lloyd Frlzzei. Tbreslled ‘Beans-l Alice Weeks"]},

I cut-and-pasted those 1847 objects into a text file all their own, then did some BBEdit-massaging to end up with a tab-delimeted ASCII file that looked like this:

guardian:19450730-010	; Mrs. Hutghlngs, Cyril George. F0. Keith Warren. Bigger. Sask. High Jump — Wendell Mayne	 Knight	 Ronald Earl. F'l.-Lieut.	 Herman Mayne. Saul! sic. Marie. Ont Spike driving contest	 ladies- Ajqyrigk	 Peter William. 110.. lis’ team won over Mr. Everett London	 Ont. 6015151115‘	- ldi Mi l 4 =	;- B ll	 90.. St. Mary's. ug-o -war. a es- ss E sic 01?.\" n‘ R055 e Maynes teafn won over Mrs. Wen- PTAM	 New"
guardian:19261030-011	- ed to the pupils are as follows: canes t-cnAlns Best sheaf of 100 heads 0t‘ oats. —1- Melburne MscDoweil, Pleas- ant Vulley 2 Cimeuce Haslaxn, Springfield 3 Rich Kelly, Stanchol 4 E3191 H088\. Solltil Granville. ‘Wheat—1 Ethel Hogan	 2 Lloyd Stanchei 4; Clgrence Springfield. Corn-—1 Allison Mayne	 Spring- field 2 ‘Priscilla Frizzeil	 Stsnchel 3 Lloyd Frlzzei. Tbreslled ‘Beans-l Alice Weeks																																								

Finally, I wrote a little PHP script to cycle through that file, and to harvest the JPEG2000 and OCR data streams out of Islandora and into files in a local directory.

Running the script left me with an index.html file with links to the keyword-highlighted version of the newspaper image in IslandNewspapers.ca, to the remote JPEG2000 image of each page, to a locally-stored ASCII file of the OCR’d text, and the keyword-in-context snippet returned by Solr:

With this file and its links, my Mayne friend is now ready to dig into the data in a much more efficient manner.

Comments