The Third Session of the Sixty-fifth General Assembly wrapped up on Wednesday after 14 sitting days and I’ve written some code to do some statistical analysis of how the session went.
The code is rough and hacky, mostly because the Legislative Assembly’s Hansard is published only as mostly-unstructured PDF files; there’s just enough structure to ferret out who’s speaking and what they said, but the process is prone to error because there are just enough variations in the formatting that it cannot be relied upon to work 100% of the time.
Look at this snippet from November 20, 2018, for example:
While the name of each person who speaks is rendered in bold in the PDF file, this doesn’t survive the conversion from PDF to text, so post-conversion I’m left with:
Cost of campaign to Island taxpayers Question to the Minister of Workforce and Advanced Learning: How much did that promotion cost Island tax payers? Speaker: The hon. Minister of Workforce and Advanced Learning. Mr. Gallant: Thank you, Mr. Speaker. I thank the hon. member for the question and that was a very good promotion and we had, I think, there was approximately seven people that contacted us and there was over 200 contacted Workforce PEI.
At this point I’m left to deduce the names of speakers by looking for patterns common to all, which boils down to something like “one or more capitalized words that start the line and are followed by a colon.” Which, in regular expression terms, looks like this:
So in this brief excerpt this pulls out “Speaker” and “Mr. Gallant,” helpfully. But also, unhelpfully, “Advanced Learning,” which gets caught by the same pattern matching.
Things are further complicated by the presence of names in Hansard that are not Members of the Legislative Assembly or clerical staff, like “Barry Jackson,” here:
The pattern matching also misses things like “Leader of the Opposition” (“of the” not being capitalized).
That all said, it’s possible, even with these deficiencies, to get some rough, inaccurate-but-not-too-much-so statistical summaries from Hansard.
To start, I used a Python script I wrote a few years ago to harvest all of the Hansard PDFs on the Legislative Assembly website and convert them to text files (using pdftotext).
Next, for the heavy lifting, I processed the individual text files with a PHP script that used the aforementioned pattern matching to extract each person speaking, what they said, and the number of words spoken, along with the date, into a CSV file.
I limited the script to the 14 sittings days from this session, and ended up with a file hansard.csv. Each line of that file represents one speaker, so, from the above example, you’ll see this:
2018-11-20,"Thank you, Mr. Chairman.","Mr. R. Brown",4 2018-11-20,"Are you going to make an opening comment, minister?",Chair,9
There are 11,782 “utterances” like this in the file, which means that, give or take, 11,782 times over 14 days someone spoke in the Legislative Assembly on something.
Taking that CSV file into a spreadsheet and creating a pivot table by speaker and total number of words, and then hand-editing out the aberrations, I end up with 73 people speaking in total (with some duplication, as when a member is acting as Chair of Committee of the Whole House they appear as, for example, “Chair (Casey)”, not as “Ms. Casey”).
With that done, I can start to pull together some summary data.
Members of the Legislative Assembly
There were 398,088 words spoken by Members of the Legislative Assembly; broken down by speaker:
|Mr. J. Brown||35995|
|Mr. R. Brown||23573|
|Leader of the Opposition||17771|
A total of 24 people were invited onto the floor as “strangers”–including me!–this session, speaking a total of 29,040 words. By stranger:
|Dr. Wendy Verhoek-Oftedahl||288|
Others recorded in Hansard include the Speaker, the Clerk and Clerk Assistants, the Chairs of Committee of the Whole House, and the generic “An Hon. Member”:
|An Hon. Member||820|
|Some Hon. Members||805|
|Clerk Assistant (Doiron)||360|
|Clerk Assistant (Reddin)||183|
|Some Hon. Member||2|
An Hon. Member
When the identity of the member speaking cannot be discerned, “An Hon. Member” is credited. Stringing together these utterances makes for interesting found poetry; here’s an example:
Could you table a copy?
It starts (Indistinct)
Call the hour.
No one is clapping for that one.
Forget the facts.
Don’t confuse the (Indistinct)
There’s the truth.
The excellent q utility allows SQL queries to be run on CSV files, allowing mentions of particular words or phrases to be counted. For example, here’s a query to count mentions of “Stars for Life,” the non-profit organization I sit on the board of:
q -H -d, "SELECT count(*) from ./hansard.csv where text like '%Stars for Life%'"
The result is 12. To find out who mentioned Stars for Life, I can do:
q -H -d, "SELECT speaker,count(*) as mentions from ./hansard.csv where text like '%Stars for Life%' group by speaker"
Leader of the Opposition,1 Minister,1 Mr. MacEwen,2 Ms. Casey,4 Ms. Mundy,1 Peter Rukavina,2 Speaker,1
Donald Trump was invoked several times this session, and I can find out by who with:
q -H -d, "SELECT speaker,count(*) as mentions from ./hansard.csv where text like '%Trump%' group by speaker"
An Hon. Member,1 Mr. R. Brown,3
To see what was said about Donald Trump:
q -H -d, "SELECT speaker,text as mentions from ./hansard.csv where text like '%Trump%'"
which results in:
Mr. R. Brown,"You’re (Indistinct), you’re like Donald Trump, you want to divide the Island and not unite the Island." Mr. R. Brown,"Okay, Trump." Mr. R. Brown,"You know what; 30 million litres of fuel have been saved. 90,000 tonnes of carbon have been taken out of the air; 90 ,000 tonnes have been taken out, 90,000 tonnes, 30,000 tandem loads of pollution have been taken out of the air from little old PEI. But no, do we hear thanks from the Green Party? No, Islanders are wrong. They then go on about the economy and you know, over the last couple of months there’s been a discussion going on about populism and populist leaders, polarizing things, we have it in Donald Trump, polarizing things. They’re going to give you great things and they’re going to do great things for you. We have the Green Party is saying: You have to put a carbon tax in, you have to tax – they’re sinners, you have to tax them –" An Hon. Member,(Indistinct) Donald Trump.
Take It For a Ride Yourself
If you want to take this out for a ride yourself, here are the things you’ll need:
To make this kind of analysis easier, having access to a version of Hansard where speakers and text are available as structured elements would be a big improvement.
For example, something like this:
<session date="2018-11-15"> <utterance> <text>You’re (Indistinct), you’re like Donald Trump, you want to divide the Island and not unite the Island.</text> <wordcount>19</wordcount> <speaker> <name>Mr. R. Brown</name> <role>Minister of Communities, Land and Environment</role> <speaker> </utterance> </session>
Other jurisdictions have moved in this direction, and I believe the Prince Edward Island is bound to have this capability eventually.
If you have any ideas about more accurate parsing of the Hansard PDF files, please let me know.