Insider Hacking [Not]

I had a note about this in my original post about OpenCorporations, but it bears repeating, as I was just asked about it by a reporter from The Guardian: in addition to my daring exploits of late I was also, as it happens, the programmer who originally crafted the code for the Province’s Corporate Register, as part of a multi-year project developing www.gov.pe.ca that ended amicably five years ago.

I haven’t had access to the code that runs the register, or to the database that serves it, since 2003, however, and the “spidering” of the register that fed the OpenCorporations project was done simply by indexing public web pages. This is something that anyone could do (and, indeed, is something that other search engines did freely for many years until the register was modified last week).

Indeed I used a similar approach when I created spidering applications to index Charlottetown Building Permits and City Council Minutes in 2006.

Thursday, December 11, 2008 at 12:11 pm

Peter Rukavina

OpenCorporations

Comments

I can’t agree. It was a

I can’t agree. It was a breach of trust placed in you by the Province.

It’s in the story which I won’t repeat since I’m getting tired.

Sorry

Public data is public data

Public data is public data after all. Very amateurish move by the government on all fronts - both blocking access and now obviously smearing you….

Some days I am nostagic for the “island way of life” and others decidedly not.

Here’s a comment I just

Here’s a comment I just submitted to Stephen Pate’s site in response to his linked-to post:

This is a strangely prissy stand you seem to be taking. Even if the coding knowledge Peter used had been proprietary and either of monetary or strategic value, Peter did not use it for a commercial purpose or make it any more widely available to others than it already was. What’s proprietary to the government, furthermore, is proprietary to the people of PEI, on whose behalf and authority it supposedly acts every time and in every way it acts at all. The government is not just “a client,” in other words, and neither is the disposition of legal records—required by law to be public—just any kind of information which relates to just any kind of job one performs under contract. If you want to talk about something that rises to the standard of ethics, consider if the previously inconvenient-to-access data in question had been about the number of citizens killed by government administered vaccinations; or numbers of civilians killed by operations in Afghanistan. If you inflate the moral significance of Peter’s coding and publishing of the Web site it looks a lot more like whistle-blowing than a violation of the public trust—which is what customer disservice or betrayal gets called when the customer is the government.

That’s funny Oliver, I don’t

That’s funny Oliver, I don’t see your well reasoned post anywhere on his site.

Maybe he censors?

Oh wait a minute, he says so right here:

“All comments are screened for appropriateness. Commenting is a privilege, not a right. Good comments will be cherished, bad comments will be deleted.”

That line is in the comment section under his post raging at the audacity of the Guardian for deleting one of his comments. Too funny!

What a hyprocite!!! Let me criticize you and call you unethical. But don’t question my alleged goodness!

The sad part about it is, he probably doesn’t even realize the hypocrisy of his hypocriticalness.

Well, as I wrote, I’d only

Well, as I wrote, I’d only “just” submitted it, and just because you screen doesn’t make you pledged to screen on-demand or to ensure notifications of new comments get forwarded automatically to your mobile phone. I imagine he’d rather respond to it than censor, or at least to present it without paying the compliment of an answer. Jeez, you’re making me plead on his behalf, and all I wanted to do was to defend Peter.

I’m sorry for that.I’m just

I’m sorry for that.
I’m just pointing out why I don’t give this particular person any credibility.
I see he has allowed your comment.
But the point I was trying to make is that this person rants and raves about people “censoring” him, and goes all half cocked about alleged “unethicalness” of good people like Ruk, and then tries to drive business to his various tinfoil hat blogs.
Yet his own policy is rather, shall we say, unsupporting of free speech.

just theoretically, if you

just theoretically, if you set up an OCR system like ocropus to deal the “security feature” for the registry, i wonder if this would be breaching any policy? “are you human” technology definitely discourages harvesting from general web search engines, and i would guess yahoo and google are unlikely to want to invest the cpu cycles and groundwork to use OCR for deeper indexing since it’s a significant effort for a single site - but for an individual effort like OpenCorporations, it’s possible that the current roadblock could be surmounted.

Roadblocks are installed for

Roadblocks are installed for a purpose, and as Peter’s been taking pains to say, he didn’t set out to defeat any perceivable purpose in improving public access to the data in question by scraping and posting it elsewhere. It wasn’t an act of civil disobedience, but a civil service volunteered in good faith. Intentionally circumventing an obvious roadblock can’t be done in good faith except in civil disobedience. Actually, unlike somebody who had never been payed by the government to put that data on the Web where it could be freely accessed, Peter presumably had not only his good faith but a reasonable expectation that the government would thank him. He couldn’t expect that sentiment in passing a barrier we can reasonably assume it set up specifically for him. Just because the government pisses on Peter and the public doesn’t make it up to him to piss back.

Oliver, I think I should hire

Oliver, I think I should hire you to spin for me.

Sorry, i wasn’t clear, i didn

Sorry, i wasn’t clear, i didn’t mean “surmount” in a “sneak through the village gates” sort of way, though i am not convinced that this isn’t sometimes a good idea when it comes to public data, my point was that there are workarounds which can be built if the intent is to block other harvesting agents, i.e., if the gov’t doesn’t mind peter’s service but has other bot problems. I asked about this because the only bot that seems to be targeted via robots.txt on the site is psbot, and since it has some legitimate reputation for bad behavior, i wondered if it factored in somehow. Has the gov’t articulated that this data can’t be re-purposed, or even that they are unhappy with peter’s service? Needing to infer gov’t policies on the re-use of public data seems like an exercise that should never be necessary.

I think that, through the

I think that, through the media, government has indeed articulated that “this data can’t be re-purposed” and also that they are unhappy with OpenCorporations: Attorney General staff were quoted in a TV news story suggesting that they took their actions in direct response to my “inappropriate reformatting” of the corporations information.

Whatever their reasoning, this seems a pretty clear-cut indication that they have introduced the CAPTHCA to signal “don’t spider here.”

Wow, ok, i never caught the

Wow, ok, i never caught the tv news layer. In sheer technical terms, CAPTHCA means you have to work harder to harvest (i think the W3C estimates that fairly rudimentary ocr can get around 80-90% of CAPTHCA blocks), and it definitely keeps away most of the common and ill-behaved bots, but why this wouldn’t be addressed in robots.txt or a direct communication astounds me. CAPTHCA also causes major accessibility issues for many persons with disabilities but it sounds like access is not a major concern in the site’s design anyway.

Spin, Peter? I wouldn’t know

Spin, Peter? I wouldn’t know how. What I write comes straight from the heart and cold-filtered through pure reason to clarity. For spin I’d have to charge extra.