Let's deepen MSM-citizen collaboration to ask government for data we can mine, by @Info_Aus

Open Data Reporter at No Fibs
Rosie Williams (BA Sociology) owns and runs InfoAus.net, a site that hosts database journalism projects that engage the public with open data.
12 hours ago
Rosie is a long time activist who began lobbying the federal government at the age of 16. You can find out more about Rosie at InfoAus.net


By Rosie Williams,

10 January, 2014

Source:  infoaus.net

In this report called Life in the Clickstream: The Future of Journalism, Australia’s Media Alliance discusses the ways in which new technologies and economic events in recent years have put the future of an entire industry under threat.

The internet brought with it a cultural expectation that content be free. The separation of online classifieds from journalistic publications to stand alone sites means that papers are left to fund their activities primarily from their readers, readers who are accustomed to the price of their news being subsidised by advertisers.

If mainstream media publications survive this ‘perfect storm’ it might only be through changing their relationship with their audience to a more collaborative approach.

It is interesting to think of the effect  a more collaborative attitude might have on society which is seen the mainstream media historically as representing the interests of advertisers rather than readers. In recent times we have seen Fairfax media receive a Walkley for creating a searchable database publishing information from the federal government pecuniary interests register.

Interns transcribed the PDF’s in order to create the database which provided a rich source for stories regarding discrepancies in what ought to have been declared. In this case the transcription had to be done by hand as the interests data are published as images of PDF’s which can’t be mechanically scraped.

Once published, this data was then able to be mechanically scraped from the SMH site and made available on ScraperWiki by open government hacker Alexander Sadlier who then made it available to me for re-use in eXpenseAus – my own version of a searchable database.

I have combined this data with travel/office expenses data published by Finance & Deregulation both from my own work and as scraped by Nick Evershed from The Guardian for their own project to produce a page which lists politician visits (data scraped by Nick) with results from the pecuniary interests register (scraped by Alex) to show politician visits alongside the locations of their investment properties so the public can see if they have made travel expense claims for trips that may include visits to investment property locations.

Nick, Alexander (who also worked on LobbyLens) and I are part of a new movement of people who have a keen interest in opening government data for use by researchers and the public. The fact that the Pecuniary Interests Register is online at all (instead of available for inspection by appointment only) is attributable to the work of the Open Australia Foundation, a not-for-profit group who make government information more accessible through their projects (as do I) by creating databases and lobbying for data to be made available for this use.

The Guardian UK took this effort to a new level a few years ago when they set up their own lobbying effort to get the information useful to the public made freely available. I think a media collaboration on open data in Australia would be very useful and it would be simple to implement. For every article that relates to open data/FOI published in main stream, independent publications or blogs, there could be a badge included on the page which says that the article is part of a drive for open data.

This badge could lead to a page (on the publishers own site or a separate site) which explains to the public very simply what open data is about and why it is in the public interest to have data published in usable formats as well as links to all the other badged articles (either within each publication or a collaborative list).

Even journalists do not necessarily understand the difference between data made available in a PDF and the same data in a CSV file that can be sucked straight into a database for republishing – a point made very clear in this blog by the Open Knowledge Foundation.

The last year has seen a move toward collaboration with the public as two mainstream publications (SMH) & (Guardian) sought the help of the public in providing the research labour to detect potentially dodgy travel claims. I propose we take this collaboration one step further and encourage the media to join forces in asking the government to release data in usable formats – as it is already required to do under its own policies.

Such a campaign would raise public awareness of the issue of open data and open government considerably and the government would no longer be able to enjoy the current lack of accountability regarding its standard of transparency. Information is gathered by the government for its citizens and as citizens we have a right to access that data and have it made available in ways we can actually engage with it. The media has an essential role in making the public aware of such issues and I believe the time is right to move forward on this issue.

  1. FelineCyclist says

    It’s not just for journalistic purposes that government data should be provided in searchable format. Myki provides users with their travel history in a PDF, rather than CSV format. Repeated requests for CSV have been refused on a range of bases including the “database can’t produce the history in CSV” and there is a policy against releasing CSV because people might manipulate it or change it (presumably they’re worried about dodgy claims for refunds or something). If is very annoying that Myki collects all of this information about me and then won’t give it to me in a form that I can use – it’s even more annoying come tax time.

  2. syntheticloveblogPeter Bayley says

    There ae plenty of PDF extraction tools around which, admittedly, produce a variety of results but which would be a lot better than manual conversion Just Google “PDF Text extraction” or “PDF Image extraction”

    • The PDF’s I have been working with on InfoAus.net are far too complex for these data extraction tools to work eg the Portfolio Budget Statements. The govt has claimed they are going to release the budget data from next May in CSV or Excel format thanks to lobbying. No one has put the budget into a searchable database prior to my project for this very reason. My son scraped the data manually but it took longer than most people who have such skills are inclined to spend without being paid substantial amounts of money. Those who have such skills are not inclined to spend that kind of time on unpaid projects and no one will fund them unless people are willing to pay for the outcome.

      With the pecuniary interests data (what politicians own, are given, and owe) each politician fills it in by hand and updates through letters to the Department of Senate or Department of House of Reps. The form and updates are then made into an image and whacked into a PDF at http://www.aph.gov.au/members/register

      These PDF’s can’t be scraped at all and more people are interested in this data than many other datasets. The govt is obliged through it’s policies on Open government/Open data to publish in usable formats as is done in the UK & US but we are a long way behind.

  3. Will Abbott, Murdoch, Rinehart, the IPA, Federal government in general and others who do not wish to convey either truth or knowledge, be on board with searchable databases? Maybe around the same time as the pope endorses women to full equality, same sex-marriage, full accountability for child abuse and pays taxes – not impossible, just extremely unlikely.

    • The director of the IPA actually gave me a personal thankyou via email for my work increasing budget transparency (BudgetAus- federal budget in online searchable database) and Fairfax have already put the pecuniary interests data (eg shares owned by politicians) online so for people who think conservatives and the media are against openness and transparency, I can’t really agree with you. The new government has already instituted hypothecated tax receipts from this year (tells you how much of your tax dollars are going into different areas of public spending).

        “Increasing budget transparency” WOW.

        Will such openness and honesty extend to asylum seekers, government schools, reasons for funding private schools, reasons for defunding anything designed to protect the environment, an explanation for Cory Bernardi, middle and upper income welfare, approval for coal/gas mining, cracking down on a minute percentage of organised crime because they ride motorbikes….?

      • Probably not. Increasing budget transparency is probably more important than you realise. Knowing exactly where money is spent, cut or how it is being hidden by various mechanisms is important. It is not a right or left issue and it is just as important to asylum seekers as it is to anyone else.

        Labor was a great supporter of budget transparency, putting in place Operation Sunlight as an election promise back in 2008 http://infoaus.net/wp/where-to-now-for-operation-sunlight/

        By the time the review of politician entitlements came around in 2010, they were far less interested in budget transparency. http://infoaus.net/wp/so-this-is-what-happened-to-budget-transparency-in-australia/

      • Please do not patronise me – I believe in a completely open and transparent form of government – pretty much the opposite of what we have with the Abbott government.

        Frankly, if the IPA thanked me, I would take time to review where I had gone wrong or failing that, ask myself what did the IPA have to gain from such work on a database.

        Giving links to what promises Labor has broken has little to do with right here and right now. Just so you know, I have not voted for either the LNP or Labor for around 20 years.