Category Archives: opendata

The Who, What and Why of Personal Data (new paper in JOAL)

Some of my PhD research has been published in the current issue of the Journal of Open Access to Law, from Cornell University.

“Data protection laws require organisations to be transparent about how they use personal data. This article explores the potential of machine-readable privacy notices to address this transparency challenge. We analyse a large source of open data comprised of semi-structured privacy notifications from hundreds of thousands of organisations in the UK, to investigate the reasons for data collection, the types of personal data collected and from whom, and the types of recipients who have access to the data. We analyse three specific sectors in detail; health, finance, and data brokerage. Finally, we draw recommendations for possible future applications of open data to privacy policies and transparency notices.”

It’s available under open access at the JOAL website (or download the PDF)

A Study of International Personal Data Transfers

Whilst researching open registers of data controllers, I was left with some interesting data on international data transfers which didn’t make it into my main research paper. This formed the basis of a short paper for the 2014 Web Science conference which took place last month.

The paper presents a brief analysis of the destinations of 16,000 personal data transfers from the UK. Each ‘transfer’ represents an arrangement between a data controller in the UK to send data to a country located overseas. Many of these destinations are simply listed by the rather general categories of ‘European Economic Area’ or ‘Worldwide’, so the analysis focuses on just those transfers where specific countries were mentioned.

I found that even when we adjust for the size of their existing UK export market, countries whose data protection regimes are approved as ‘adequate’ by the European Commission had higher rates of data transfers. This indicates that easing legal restrictions on cross-border transfers does indeed positively correlate with a higher number of transfers (although the direction of causation can’t be established). I was asked by the organisers to produce a graphic to illustrate the findings, so I’m sharing that below.


Open Research in Practice: responding to peer review with GitHub

I wrote a tweet last week that got a bit of unexpected (but welcome) attention;

I got a number of retweets and replies in response, including:

The answers are: Yes, yes, yes and yes. I thought I’d respond in more detail and pen some short reflections on github, collaboration and open research.

The backstory; I had submitted a short paper (co-authored with my colleague David Matthews from the Maths department) to a conference workshop (WWW2014: Social Machines). While the paper was accepted (yay!), the reviewers had suggested a number of substantial revisions. Since the whole thing had to be written in Latex, with various associated style, bibliography and other files, and version control was important, we decided to create a github repo for the revision process. I’d seen Github used for paper authoring before by another colleague and wanted to try it out.

Besides version control, we also decided to make use of other features of Github, including ‘issues’. I took the long list of reviewer comments and filed them as a new issue. We then separated these out into themes which were given their own sub-issues. From here, we could clearly see what needed changing, and discuss how we were going to do it. Finally, once we were satisfied with an issue, that could be closed.

At first I considered making the repository private – I was a little bit nervous to put all the reviewer comments up on a public repo, especially as some were fairly critical of our first draft (although, in hindsight, entirely fair). In the end, we opted for the open approach – that way, if anyone is interested they can see the process we went through. While I doubt anyone will be interested in such a level of detail for this paper, opening up the paper revision process as a matter of principle is probably a good idea.

With the movement to open up the ‘grey literature’ of science – preliminary data, unfinished papers, failed experiments – it seems logical to extend this to the post-peer-review revision process. For very popular and / or controversial papers, it would be interesting to see how authors have dealt with reviewer comments. It could help provide context for subsequent debates and responses, as well as demystify what can be a strange and scary process for early-career researchers like myself.

I’m sure there are plenty of people more steeped in the ways of open science who’ve given this a lot more thought. New services like FigShare, and open access publishers like PLoS and PeerJ, are experimenting with opening up all the whole process of academic publishing. There are also dedicated paper authoring tools that extend on git-like functionality – next time, I’d like to try one of the collaborative web-based Latex editors like ShareLatex or WriteLatex. Either way, I’d recommend adopting git or something git-like, for co-authoring papers and the post-peer-review revision process. The future of open peer review looks bright – and integrating this with an open, collaborative revision process is a no-brainer.

Next on my reading list for this topic is this book on the future of academic publishing by Kathleen FitzPatrick – Planned Obsolescence
— UPDATE: Chad Kohalyk just alerted me to a relevant new feature rolled out by Github yesterday – a better way to track diffs in rendered prose. Thanks!

Is an elaborate art joke?

Last week I was sent a link to – a new web application where you can upload your data from Facebook, and choose whether to license it directly to marketers or make it available as open data. It’s a neat idea which has been explored by a number of other startups (e.g., YesProfile, Teckler).

Obviously, uploading all of your Facebook data to a random website raises a whole host of privacy concerns – exactly what you’d expect a rock-solid privacy policy / terms-of-service to address. Unfortunately, there doesn’t seem to be any such terms for If you click the Terms of Service button on the registration page it takes you nowhere.

Looking at the page source, the html anchor points to an empty ‘#’ id, which suggests that there is not some problem with the link, but that there was nowhere to link to in the first place; suspicious! If I was serious about starting a service like this, the very first thing I’d do is draft a terms-of-service and privacy policy. Then before launching the website, I’d triple-check to make sure it appears prominently on the registration form.

Looking at the ‘Browse Open Data’ part of the website, you can look at the supposedly de-identified Facebook profiles that other users have submitted. These include detailed data and metadata like number of friends, hometown, logins, etc. The problem is, despite the removal of names, the information on these profiles is almost certainly enough to re-identify the individual in the majority of cases.

These two glaring privacy issues and technical problems make me think this whole thing might just be an elaborate hoax. In which case, Ha ha. Well, done, you got me. After digging a little deeper, it looks like the website is a project from Commodify, Inc., an artist-run startup, and Moddr, who describe themselves as;

Rotterdam-based media/hacker/co-working space and DIY/FOSS/OSHW fablab for artgeeks, part of the venue WORM: Institute for Avantgardistic Recreation

They’re behind a few other projects in a similar vein, such as ‘Give Me My Data‘. I remembered seeing a very amusing presentation on the Web 2.0 Suicide Machine project by Walter Langelaar a year or two ago.

So I registered using a temporary dummy email addresses, to have a look around, but I didn’t get to upload my (fake) data because the data upload page says it’s currently being updated. I tried sending an email to the mailing address moderator ( listed as ) but it bounced.

If this is intended as a real service, then it’s pretty awful as far as privacy is concerned. If it’s intended as a humorous art project, then that’s fine – as long as as there are no real users who have been duped into participating.

Transparent Privacy Protection: Let’s open up the regulators

Should Government agencies tasked with protecting our privacy make their investigations more transparent and open?

I spotted this story on (eminent IT law professor) Michael Geist’s blog, discussing a recent study by the Canadian Privacy Commissioner Jennifer Stoddart into how well popular e-commerce and media websites in Canada protect their user’s personal information and seek informed consent. This is important work; the kind of pro-active investigation into privacy practices that sets a good example to other authorities tasked with protecting citizen’s personal data.

However, while the results of the study have been published, the Commissioner declined to name names of those websites it investigated. Geist rightly points out that this secrecy denies individuals the opportunity to reassess their use of the offending websites. Amid calls from the Commissioner for greater transparency in data protection generally – such as better security breach notification – this decision goes against the trend, and seems, to me, a missed opportunity.

This isn’t just about naming and shaming the bad guys. It is as much about encouraging good practice where it appears. But this evaluation should take place in the open. Privacy and Data Protection commissioners should leverage the power of public pressure to improve company privacy practices, rather than relying solely on their own enforcement powers.

Identifying the subjects of such investigations is not a radical suggestion. It has already happened in a number of high-profile investigations undertaken by the Canadian Privacy Commissioner (into Google and Facebook), as well by its relevant counterparts in other countries. The Irish Data Protection Commissioner has made the results of its investigation into Facebook openly available. The UK Information Commissioners Office regularly identifies the targets of its investigations. While the privacy of individual data controllers should be respected, the privacy of individual data subjects should come before the ‘privacy’ of organisations and businesses.

As I wrote in my last blog post, openness and transparency from those government agencies tasked with enforcing data protection has the potential to alleviate modern privacy concerns. The data and knowledge they hold should be considered basic public infrastructure for sound privacy decisions. Opening up data protection registers could help reveal who is doing what with our personal data. Investigations undertaken by the authorities into websites’ privacy practices are another important source of information to empower individual users. The more information we have about who is collecting our data and how well they are protecting it, the better we can assess their trustworthiness.

Reflections on an Open Internet of Things

Last weekend I attended the Open Internet of Things Assembly here in London. You can read more comprehensive accounts of the weekend here. The purpose was to collaboratively draft a set of recommendations/standards/criteria to establish what it takes to be ‘open’ in the emerging ‘Internet of Things’. This vague term describes an emerging reality where our bodies, homes, cities and environment bristle with devices and sensors interacting with each other over the internet.

A huge amount of data is currently collected through traditional internet use – searches, clicks, purchases. The proliferation of internet-connected objects envisaged by Internet-of-Things enthusiasts would make the current ‘data deluge’ seem insignificant by comparison.

At this stage, asking what an Internet of Things is for would be a bit like travelling back to 1990 to ask Tim Berners-Lee what the World Wide Web was ‘for’. It’s just not clear yet. Like the web, it probably has some great uses, and some not so great ones. And, like the web, much of its positive potential probably depends on it being ‘open’. This means that anyone can participate, both at the level of infrastructure – connecting ‘things’ to the internet, and at the level of data – utilising the flows of data that emerge from that infrastructure.

The final document we came up with which attempts to define what it takes to be ‘open’ in the internet of things is available here. A number of salient points arose for me over the course of the weekend.

When it comes to questions of rights, privacy and control, we can all agree that there is an important distinction to be made between personal and non-personal data. What also emerged over the weekend for me were the shades of grey between this apparently clear-cut distinction. Saturday morning’s discussions were divided into four categories – the body, the home, the city, and the environment – which I think are spread relatively evenly across the spectrum between personal and non-personal.

Some language emerged to describe these differences – notably, the idea of a ‘data subject’ as someone who the data is ‘about’. Whilst helpful, this term also points to further complexities. Data about one person at one time can later be mined or combined with other data sets to yield data about somebody else. I used to work at a start-up which analysed an individual’s phone call data to reveal insights into their productivity. We quickly realised that when it comes to interpersonal connections, data about you is inextricably linked to data about other people – and this gets worse the more data you have. This renders any straightforward analysis of personal vs. non-personal data inadequate.

During a session on privacy and control, we considered whether the right to individual anonymity in public data sets is technologically realistic. Cambridge computer scientist Ross Anderson‘s work concludes that absolute anonymity is impossible – datasets can always be mined and ‘triangulated’ with others to reveal individual identities. It is only possible to increase or decrease the costs of de-anonymisation. Perhaps the best that can be said is that it is incumbent on those who publicly publish data to make efforts to limit personal identification.

Unlike its current geographically-untethered incarnation, the internet of things will be bound to the physical spaces in which its ‘things’ are embedded. This means we need to reconsider the meaning of and distinction between public and private space. Adam Greenfield spoke of the need for a ‘jurisprudence of open public objects’. Who has stewardship over ‘things’ embedded in public spaces? Do owners of private property have exclusive jurisdiction over the operation of the ‘things’ embedded on it, or do the owners of the thing have some say? And do the ‘data subjects’, who may be distinct from the first two parties, have a say? Mark Lizar pointed out that under existing U.S. law, you can mount a CCTV camera on your roof, pointed at your neighbours back garden (but any footage you capture is not admissible in court). Situations like this are pretty rare right now but will be part and parcel of the internet of things.

I came away thinking that the internet of things will be both wonderful and terrible, but I’m hopeful that the good people involved in this event can tip the balance towards the former and away from the latter.


I’ve been a fan of the Open Rights Group – the UK’s foremost digital rights organisation – for a few years now, but yesterday was my first time attending ORGcon, their annual gathering. The turnout was impressive; upon arrival I was pleasantly surprised to see a huge queue stretching out of Westminster University and down Regent’s Street.

The day kicked off with a rousing keynote from Cory Doctorow on ‘The Coming War On General-Purpose Computing’ (a version of the talk he gave at the last Chaos Communication Camp, [video]). In his typical sardonic style, Doctorow argued that in an age when computers are everywhere – in household objects, medical implants, cars – we must defend our right to break into them and examine exactly what they are doing.  Manufacturers don’t want their gadgets to be general-purpose computers, because this enables users to do things that scare them. They will disable computers that could be programmed to do anything, lock them down and turn them into appliances which operate outside of our control and obscured from our oversight.

Doctorow mocked the naive attempts of the copyright industries to achieve this using digital locks – but warned of the coming legal and technological measures which are likely to be campaigned for by industries with much greater lobbying power. In the post-talk Q&A session, an audience member linked the topic to the teaching of IT in schools; the need for children to understand from an early age how to look inside gadgets, understand how they work and that they may be operating against the users best interests.

As is always the way with parallel sessions, throughout the day I found myself wanting to be in multiple places at once. I opted to hear Wendy Seltzer give a nice summary of the current state of digital rights activism. She likened the grassroots response to SOPA and PIPA to an immune system fighting a virus. She warned that, like an overactive immune system, we run the risk of attacking the innocuous. If we cry wolf too often, legislators may cease to listen. She went on to imply that the current anti-ACTA movement is guilty of this. Personally, I think that as long as such protest is well informed, it cannot do any harm and hopefully will do some good. Legislators are only just beginning to recognise how serious these issues are to the ‘net generation’, and the more we can do to make that clear, the better.

The next hour was spent in a crowded and stuffy room, watching my Southampton colleague Tim Davies grill Chris Taggart (OpenCorporates), Rufus Pollock (OKFN), and Heather Brooke (journalist and author) about ‘Raw, Big, Linked, Open: is all this data doing us any good?’ The discussion was interesting and good to see this topic, which has until recently been confined to a relatively niche community, brought to an ORG audience.

After discussing university campus-based ORG actions over lunch, I went along to a discussion of the future of copyright reform in the UK in the wake of the Hargreaves report. Peter Bradwell went through ORG’s submission to the government’s consultation on the Hargreave’s measures. Saskia Wazkel from Consumer Focus gave a comprehensive talk and had some interesting things to say about the role of consumers and artists themselves in copyright reform. Emily Goodhand (more commonly known as @copyrightgirl on twitter) spoke about the University of Reading’s submission, and her perspective of as Copyright and Compliance officer there. Finally Professor Charlotte Waelde, head of Exeter Law School, took the common call for more evidence-based copyright policy and urged us to ask ‘What would evidence-based copyright policy actually look like?’. Particularly interesting for me, as both an interdisciplinary researcher and believer in evidence-based policy, was her question about what mixture of disciplines are needed to create conclusions to inform policy. It was also encouraging to see an almost entirely female panel and chair in what is too often a male-dominated community.

I spent the next session attending an open space discussion proposed by Steve Lawson, a musician, about the future of music in the digital age. It was great to hear the range of opinions – from data miners, web developers and a representative from the UK Pirate Party – and hear about some the innovations in this space. I hope to talk to Steve in more detail soon in lieu of a book I’m working on about consumer ethics/activism for the pirate generation.

Finally, we were sent off with a talk from Larry Lessig, on ‘recognising the fight we’re in’. His speech took in a bunch of different issues: open access to scholarly literature; the economics of the radio spectrum (featuring a hypothetical three way battle between economist Robert Coase, dictator Joseph Stalin and singer Hetty Lamar [whom I’d never heard of but apparently co-invented ‘frequency hopping’ which paved the way for modern day wireless communication]); and corruption in the US political system, the topic of his latest book.

In the Q+A I asked his opinion on academic piracy (the time honoured practice of swapping PDFs to get around lack of institutional access, which has now evolved into the twitter hashtag phenomenon #icanhazPDF), and whether he prefers the ‘green’ or ‘gold’ routes to open access. He seemed to generally endorse PDF-swapping. He came down on the side of ‘gold’ open access (where publishers become open-access), rather than ‘green’ (where academic departments self-archive), citing the importance of being able to do data-mining. I’m not convinced that data-mining isn’t possible under green OA; so long as self-archiving repositories are set up right (for example, Southampton’s eprints software is designed to enable this kind of thing).

After Lessig’s talk, about a hundred sweaty, thirsty digital rights activists descended on a nearby pub, then pizza, then said our goodbyes until next time. All round it was a great conference; roll on ORGcon2013.