At the risk of dipping into a well I’ve visited often: the only pieces of information I need to have on this slip are the title of the book and the date it’s due; everything else just makes it harder to find that out.

Recent explorations of lost and found sounds unearthed the audio of a short-lived podcast I produced 10 years ago called The 3LA Podcast. I’ve resurrected the 10 episodes for your archival listening pleasure.

I posted my first sound to SoundCloud in October 2007, about a month after the service launched. That first track — a bootleg recording of a Garnet Rogers concert at the Trailside in Mount Stewart, has internal track ID of 488; the last track I posted, of Oliver’s birthday greeting to me earlier this month, has a track ID of 257472209, meaning that 257 million tracks have been posted in the intervening 9 years. That’s a lot of sound.

I joined SoundCloud on the strength of having heard co-founders Alexander Ljung and Eric Wahlforss at the reboot conference in Copenhagen; they were interesting polymaths and I reasoned that anything they would launch would be similarly interesting. It didn’t hurt that I was working with [[Plazes]] at the time, and Plazes and SoundCloud inhabited the same Berlin neighbourhood, both geographically and spiritually.

And I love sound; I’m a collector of sorts. I’ll record a waterfall here, record a podcast there. SoundCloud was a piece of enabling infrastructure for that: before it was widely possible to post easily-playable sound online, SoundCloud made upload-and-share easy. I’m not exactly, a prolific collector: I’ve only posted 155 sounds of 9 years. But I’m nothing if not a diverse collector: I’ve gathered the sound of my parents cold-room door opening (before they moved house), of a café in Tokyo, music hacks with The Island Hymn, saxophones on Victoria Row, a Bruce Guthro bootleg, and myriad CBC Radio interviews.

When my yearly invoice for SoundCloud Pro Unlimited arrived early this month, though, for $115/year, I decided that it was time to repatriate my sound, to get it out of SoundCloud and into the same Drupal content management system I use to manage everything else I produce online.

A lot has changed since SoundCloud launched: it’s become much easier, via HTML5, to host and serve audio to the browser, and there are lots of JavaScript libraries that make this even easier, libraries that integrate well with Drupal. And SoundCloud itself has evolved, narrowed and professionalized its focus, and entering the same marketplace as Spotify, Google Play and Apple Music. It didn’t feel like the right home for my sound hacking anymore.

So I wrote a script to use the SoundCloud API — always and still one of the greatest features of the site – to pull metadata and media files for all the sounds I’ve uploaded to SoundCloud over the years (what a motley collection these media files were: WAV, MP3, AMR, M4A). I converted all of the audio to MP3 (which is now playable directly in all modern browsers), created a Drupal content type, and imported each sound’s metadata. And so on the end of all the links to sounds above you’ll find their repatriated versions.

There’s a reverse chronological list of all the sounds here, and sounds are also integrated into lists of related posts, like this one for Tokyo, elevating sound from something I stash elsewhere to “first level content,” so to speak.

I wish SoundCloud all the best, and I feel bad about leaving after such a long run; it continues to be the most interesting place for sound, and there’s nothing like sticking a keyword in the search — banjo punk, for example — and letting it role. There’s no better way to let new sound wash over you.

I’ve still got some SoundCloud embeds to update in archival posts around the blog, so there will be a transition period. And I’m not closing my SoundCloud account, just cutting it off at the un-unlimited knees, so links out there will continue to work.

Twelve years ago, in the late summer of 2004, Steven Garrity, Dan James and I started a podcast called Live from the Formosa Tea House, recorded very occasionally over lunch at the eponymous Formosa Tea House on Prince Street in Charlottetown. The primary claim to fame of the podcast was that it was among the first of the medium, following quickly on early experiments by Dave Winer and Adam Curry; the entire 6-episode, 7-year run of the podcast happened largely outside the current podcasting resurgence. 

Over the years the various MP3 files of the 6 episodes ended up scattered here and there, the victim of whatever blogging platform was in vogue at the time. I’ve spent the last couple of days consolidating all of the audio I’ve recorded over the years right here on this website, and thus, for the first time, you can listen to all 6 hours in one place. Enjoy.

Episode № Recorded On Audio
One September 8, 2004
Two September 16, 2004
Three January 31, 2005
Four July 13, 2005
Five November 2, 2005
Six June 27, 2011

As the consideration of the estimates in the Legislative Assembly of Prince Edward Island continues today, it’s useful to take a look back 100 years to the day when our MLA ancestors were engaged in exactly the same activity, albeit with considerably more furor and journalistic bluster attached.

Thanks to the PEI Legislative Documents Online project, you can review many of the house records from that 1916 session.

While tracking down a copy of Living and Learning: The Report of the Provincial Committee on Aims and Objectives of Education in the Schools of Ontario (known popularly as the “Hall-Dennis Report”) yesterday at Robertson Library I found, helpfully catalogued beside it, a copy of Education or Molasses: A critical look at the Hall-Dennis Report.

Given that the former has been at the core both of how I experienced education as a child and how I regard it as an adult, the latter provides me an opportunity to examine some of my deeply-held beliefs in a new light.

It’s a delightfully acerbic read.

I can’t remember why this is in my pocket.

“Keep this coupon for guaranteed immortality?”

“Keep this coupon if you ever want to see you son again?”

“Keep this coupon to claim your coat?”

I can never throw it away.

Some days your heart just sings. Thanks to members of the Legislative Assembly, this is one of those days. From Hansard from last Thursday evening, a mention of my quick hack to demonstrate the utility of HTML in legislative documents.

Chair: The hon. Leader of the Third Party.

Dr. Bevan-Baker: This isn’t related to aquaculture or fisheries and agriculture specifically, but many in the House will probably remember Peter Rukavina was here the other night –

Mr. McIsaac: Yes.

Dr. Bevan-Baker: – and Peter is a great advocate of open data. Today he watched us here in the House with our desks overflowing with paper and trying to get from one section to another and accommodate them, and he just spent – he was just here yesterday so he did this in 24 hours – but he transmitted a whole section from a PDF, which is, as programmers, that’s where data goes to die, into an HTML form which allows you to access all kinds of stuff. He sent me the link here. I’d really like to send that around the House.

An Hon. Member: (Indistinct).

Dr. Bevan-Baker: It’s so much less work, less paper. It’s much more user-friendly. There’s just so much here which I think we could improve this process so we’re not kind of trying to find our way through bits of paper.

Mr. McIsaac: Yeah, I –

Dr. Bevan-Baker: I just wanted to comment on that.

Mr. McIsaac: It’s interesting because I saw Peter here yesterday and he sent me a note, too, and he said this is actually his favourite part of the House is doing estimates.

Dr. Bevan-Baker: Yeah.

Mr. McIsaac: Not Question Period or motions or whatever. But I’ll tell you an example. I went to the priorities committee today with my tablet and I got there and couldn’t get on. I didn’t have any hard copies. So I’ll apologize for having the hard copy but it’s really helpful.

Dr. Bevan-Baker: Yeah. No, I don’t think we need to apologize for where we are, but as a joke actually then he says: I move that for the next fiscal year we set ourselves a goal of crafting a thoroughly modern version of the estimates. I’d go with that.

An Hon. Member: Hear, hear!

Dr. Bevan-Baker: Anyway –

Mr. Trivers: Modernize it (Indistinct)

Dr. Bevan-Baker: – I’ll send that around so everybody has access to it.

Mr. McIsaac: Sounds good

I appreciate Dr. Bevan-Baker raising the issue, and I appreciate the spirit of collegiality his comment was met with by other members.

A couple of days ago I wrote about my reverse engineering of the video archives of the Legislative Assembly of Prince Edward Island, and I suggested, at the end, that additional hijinks could now ensue.

When I read How I OCR Hundreds of Hours of Video, I knew that’s where I had to look next: the author of that post, Waldo Jaquith, uses optical character recognition — in essence “getting computers to read the words in images” — with video of the General Assembly of Virginia, to do automated indexing of speakers and bills. I reasoned that a similar approach could be used for Prince Edward Island, as our video here also has lower thirds listing the name of the member speaking.

So I tried it. And it worked! Here’s a walk-through of the toolchain I used, which is adapted from Waldo’s

The structure of the video archive I outlined earlier lends itself well to grabbing a still frame of video every 10 seconds, from the beginning of each 10-second-long transport stream.

I’ll start by illustrating the process of doing OCR on a single frame, and then run through the automation of the process for an entire part of the day.

Each 10-second transport stream has 306 frames. I don’t need all of those, I just need one, so I use FFmpeg to extract a single JPEG like this, run against this transport stream file.

ffmpeg -ss 1 -i "media_w1108428848_014.ts" -qscale:v 2 -vframes 1 "media_w1108428848_014.jpg"

The result is a JPEG like this:

JPEG frame capture from Legislative Assembly video

I only need the area of the frame that includes the “lower third” to do the OCR, so I use ImageMagick to crop this out:

convert "media_w1108428848_014.jpg" -crop 439x60+64+360 +repage -compress none -depth 8 "media_w1108428848_014.tif"

This crops out a 439 pixel by 60 pixel rectangle starting 64 pixels from the left and 360 pixels from the top, this section here:

Cropped Video Section

The lower third is different for members with multiple titles, like the Premier, and back bench members, which is why such a large swath is needed, vertically, to ensure all members’ names can be grabbed.

The resulting TIFF file looks like this:

Lower Third Cropped Out

Next I use ImageMagick again to convert all of the cropped lower thirds to black and white, with:

convert "media_w1108428848_014.tif" -negate -fx '.8*r+.8*g+0*b' -compress none -depth 8 "bw-media_w1108428848_014.tif"

Resulting in black and white images like this:

Black and white lower third

Now I’m ready to do the OCR, for which, like Waldo, I use Tesseract:

tesseract "bw-media_w1108428848_014.tif" "bw-media_w1108428848_014"

This results in a text file with the converted text:

Hon H Wade Maclauchlan

mmm-‘v
Mun (-1 n1 hI-Anralui l‘nl‘ln nan-Iv

Tesseract did an almost perfect job on the member’s name — Hon. H. Wade MacLauchlan. It missed the periods, but that’s understandable as they got blown out in the conversion to black and white. And it got the fourth letter of the Premier’s last name as a lower case rather than upper case “L”, but, again, the tail on the “L” got blown out by the conversion.

And that’s it, really: grab a frame, crop out the lower third, convert to black and white, OCR. 

All I need now is a script to pull a series of transport streams and do this as a batch; this is what I came up with:

#!/bin/bash

DATESTAMP=$1

curl -Ss http://198.167.125.144:1935/leg/mp4:${DATESTAMP}.mp4/playlist.m3u8 > /tmp/playlist.m3u8
IFS=_ array=(`tail -1 /tmp/playlist.m3u8`)
IFS=. array=(${array[1]})
UNIQUEID="${array[0]}"

START=$(expr $(($2 * 6 - 1)))
DURATION=$(expr $(($3 * 6)))
END=$(expr $(($START + $DURATION)))

echo "Getting video for ${DATESTAMP}"

rm -f /tmp/concatentated-video.ts

while [ ${START} -lt ${END} ]; do
  echo "Getting chunk ${START}"
  PADDED=`printf %03d $START`
  echo "Changing to ${PADDED}"
  curl -Ss "http://198.167.125.144:1935/leg/mp4:${DATESTAMP}.mp4/media_${UNIQUEID}_${START}.ts"  > "ts/media_${UNIQUEID}_${PADDED}.ts"
  ffmpeg -ss 1 -i "ts/media_${UNIQUEID}_${PADDED}.ts" -qscale:v 2 -vframes 1 "frames/media_${UNIQUEID}_${PADDED}.jpg"
  convert "frames/media_${UNIQUEID}_${PADDED}.jpg" -crop 439x60+64+360 +repage -compress none -depth 8 "cropped/media_${UNIQUEID}_${PADDED}.tif"
  convert "cropped/media_${UNIQUEID}_${PADDED}.tif" -negate -fx '.8*r+.8*g+0*b' -compress none -depth 8 "bw/media_${UNIQUEID}_${PADDED}.tif"
  tesseract "bw/media_${UNIQUEID}_${PADDED}.tif" "ocr/media_${UNIQUEID}_${PADDED}" 
  let START=START+1
done

With this script in place, and directories set up for each of the generated files — ts/, frames/, cropped/, bw/ and ocr/ — I’m ready to go, using arguments identical to my earlier script. So, for example, if I want to OCR 90 minutes of the Legislative Assembly from the morning of April 22, 2016, starting at the second minute, I do this:

./get-video.sh 20160422A 2 90

I leave that running for a while, and I end up with an ocr directory filled with OCRed text from each of the transport streams, files that look like this:

, . 1
Hon J Alan Mclsaac
MHn-Jrl (v0 Axul: HIVIHP thi | l'llr‘HF"

and this:

_ 4'

Hon. Allen F. Roac‘h

As Waldo wrote in his post:

Although Tesseract’s OCR is better than anything else out there, it’s also pretty bad, by any practical measurement.

And that’s born out in my experiments: the OCR is pretty good, but it’s not consistent enough to use for anything without some post-processing. And for that, I used the same technique Waldo did, computing the Levenshtein distance between the text from each OCRed frame and a list of Members of the Legislative Assembly.

From the Members page on the Legislative Assembly website, I prepared a CSV containing a row for each member and their party designation, with a couple of additional rows to allow me to react to frames where no member was identified:

Bradley Trivers,C
Bush Dumville,L
Colin LaVie,C
Darlene Compton,C
Hal Perry,L
Hon. Allen F. Roach,L
Hon. Doug W. Currie,L
Hon. Francis (Buck) Watts,N
Hon. H. Wade MacLaughlan,L
Hon. Heath MacDonald,L
Hon. J. Alan McIsaac,L
Hon. Jamie Fox,C
Hon. Paula Biggar,L
Hon. Richard Brown,L
Hon. Robert L. Henderson,L
Hon. Robert Mitchell,L
Hon. Tina Mundy,L
James Aylward,C
Janice Sherry,L
Jordan Brown,L
Kathleen Casey,L
Matthew MacKay,C
Pat Murphy,L
Peter Bevan-Baker,G
Sidney MacEwen,C
Sonny Gallant,L
Steven Myers,C
None,N
2nd Session,N

The idea is that for each OCRed frame I take the text and compare it to each of the names on this list; the name on the list with the lowest Levenshtein distance value is the likeliest speaker. 

For example, for this OCRed text:

e'b

Hon. Paula Blggav

I get this set of Levenshtein distances:

Bradley Trivers -> 20
Bush Dumville -> 18
Colin LaVie -> 18
Darlene Compton -> 20
Hal Perry -> 17
Hon. Allen F. Roach -> 18
Hon. Doug W. Currie -> 18
Hon. Francis (Buck) Watts -> 22
Hon. H. Wade MacLaughlan -> 19
Hon. Heath MacDonald -> 19
Hon. J. Alan McIsaac -> 17
Hon. Jamie Fox -> 16
Hon. Paula Biggar -> 8
Hon. Richard Brown -> 17
Hon. Robert L. Henderson -> 22
Hon. Robert Mitchell -> 20
Hon. Tina Mundy -> 16
James Aylward -> 19
Janice Sherry -> 19
Jordan Brown -> 17
Kathleen Casey -> 18
Matthew MacKay -> 18
Pat Murphy -> 18
Peter Bevan-Baker -> 19
Sidney MacEwen -> 19
Sonny Gallant -> 17
Steven Myers -> 19
None -> 19
2nd Session -> 20

The smallest Levenshtein distances is Hon. Paula Biggar, with a value of 8, so that’s the value I connect with this frame.

Ninety minutes of video from Friday morning results in 540 frame captures and 540 OCRed snippets of text.

With the snippets of text extracted, I run a PHP script on the result, dumping out an HTML file with a thumbnail for each frame, coloured to match the party of the member speaking I identified from the OCR:

<?php

$colors = array("L" => "#F00",  // Liberal
                "C" => "#00F",  // Conservative
                "G" => "#0F0",  // Green
                "N" => "#FFF"   // None
                );

$names = file_get_contents("member-names.txt");
$members = explode("\n", $names);
foreach ($members as $key => $value) {
  if ($value != '') {
    list($name, $party) = explode(",", $value);
    $p = array("name" => $name, "party" => $party);
    $m[] = $p;
  }
}

$fp = fopen("index.html", "w");

if ($handle = opendir('./ocr')) {
  while (false !== ($entry = readdir($handle))) {
      if ($entry != "." && $entry != ".." && $entry != '.DS_Store') {
        $ocr = file_get_contents("./ocr/" . $entry);
        $jpeg = "frames/" . basename($entry, ".txt") . ".jpg";
        $ts = "ts/" . basename($entry, ".txt") . ".ts";
        $ocr = preg_replace('/[^a-z\n]+/i', ' ', $ocr);
        $mindist = 9999;
        unset($found);
        foreach($m as $key => $value) {
          if ($value != '') {
            $d = levenshtein($value['name'], trim($ocr));
            if ($d < $mindist) {
              $mindist = $d;
              $found = $value;
            }
          }
        }
        fwrite($fp, "<div style='float: left; background: " . $colors[$found['party']] . "'>\n");
        fwrite($fp, "<a href='$ts'><img src='$jpeg' style='width: 64px; height: auto; padding: 5px'></a></div>");
      }
  }
  closedir($handle);
}

The resulting HTML file looks like this in a browser:

Friday Morning in the House, colour-coded

The frames that are coloured white are frames where there was either no lower third, or where the lower third didn’t contain the name of the member speaking. It’s not a perfect process: the last dozen frames or so, for example, are from the consideration of the estimates, where there’s no member’s name in the lower third, but my script doesn’t know that, and it simply finds the member’s name with the smallest Levenshtein distance from the jumble of text it does find there; some fine-tuning of the matching process could avoid this.

Changing the output of the PHP script so that the names of the members are included, the thumbnails a little larger, and each thumbnail linked to the transport stream of the associated video, and I get a visual navigator for the morning’s video:

One more experiment, this time representing each OCRed frame as a two-pixel-wide part of a bar, allowing the entire morning to be visualized by party:

The Morning Visualized

Leaving thumbnails and party colours out of it completely, here are the members ranked by the number (of the total 504) 10 second frame captures they appear in the first frame of (the total is not 540 because the remaining frames had no lower third and thus no identified speaker):

  42 Hon. Paula Biggar
  37 Peter Bevan-Baker
  36 James Aylward
  30 Hon. Robert L. Henderson
  28 Hon. Allen F. Roach
  19 Hon. J. Alan McIsaac
  17 Hon. Jamie Fox
  15 Steven Myers
  15 Bradley Trivers
  13 Sidney MacEwen
  13 Hon. H. Wade MacLaughlan
  12 Hal Perry
  11 Colin LaVie
   9 Hon. Doug W. Currie
   8 Hon. Robert Mitchell
   8 Hon. Heath MacDonald
   6 Hon. Tina Mundy
   5 Darlene Compton
   5 Bush Dumville
   4 Sonny Gallant
   4 Jordan Brown
   3 Kathleen Casey
   2 Hon. Richard Brown

Visualized as a bar chart, this data looks like this:

Bar Chart of Frames per Member

And finally, here’s a party breakdown (it’s important to note that this is only a very rough take on the “which party gets the most speaking time” question because I’m only looking at the first frame of every 10 second video chunk):

Pie Chart Showing Frames per Party

Peter Bevan-Baker, Leader of the Green Party, is the only speaker in the Green slice; he’s the second-most-frequent speaker — 37 frame chunks — but the other parties spread their speaking across more members which is why the Green Party only represents 11% of the frame chunks in total.

As with much of the information that public bodies emit, the Legislative Assembly of PEI could make this sort of analysis much easier by releasing time-coded open data in addition to the video — as sort of “structured data Hansard,” if you well. Without that, we’re left to using blunt instruments like OCR which, though fun, involve a lot of futzing that should really be required.

About This Blog

Photo of Peter RukavinaI am . I am a writer, letterpress printer, and a curious person.

To learn more about me, read my /nowlook at my bio, read presentations and speeches I’ve written, or get in touch (peter@rukavina.net is the quickest way). You can subscribe to an RSS feed of posts, an RSS feed of comments, or receive a daily digests of posts by email.

Search