Category Archives: Uncategorized

Self-sufficient programming: RSA cryptosystem with plain Python

[WARNING: this is an exercise purely for fun: it is definitely very insecure, do not use this code in a real system and do not attempt to write your own cryptographic functions!]

Despite working in computer science, these days I barely have the need to write any code in my day-to-day work. This is not unusual for someone working in CS, especially those of us at the ‘softer’ edges. So I decided to set myself a little project to get back into it.

I’ve also been thinking about how the development stack for most programming languages is so dependent on a byzantine labyrinth of third party libraries and packages. This is great in many ways, because whatever you want to do, someone else has probably already done it better and more secure.

But it also means that most applications are dependent on code written and maintained by other people. In some cases this can lead to an unexpected global mess, such as the time when a developer of several popular NPM libraries pulled them from the package management system, breaking the ‘Jenga tower of Javascript‘.

While I don’t think this is actually the answer to the above problems, it could be a fun exercise to see how many of the basic building blocks of modern computing could be re-created from scratch. By me. Someone whose programming has always been a little shoddy and now very out of practice. Armed with nothing more than a few scribbled notes and old slides covering undergraduate-level CS, and using only basic Python (i.e. the Standard Library, no external packages). This is a challenge, because while Python supposedly ships with all the basic stuff you should need (‘batteries included’), in practice this is arguably not true.


What project to pick? I decided to have a go at implementing the RSA cryptosystem for asymmetric key cryptography, a pretty fundamental part of modern computer security. This seemed like a good challenge for two reasons:

  • RSA is not in the Python standard library, and requires various functions which I’d naturally go looking for in external libraries (e.g. finding coprimes).
  • In debates about national security, it is often said that governments can’t effectively ban encryption because it’s not any particular piece of software, it’s just math. Anyone who knows RSA and how to program can implement it themselves given a general purpose computer. But how many people does that include? Can even I, a non-cryptographer with a basic knowledge of how it works in theory, create an actual working version of RSA?

So as an entirely pedagogical / auto-didactical weekend project I’m going to have a go. I’m sure that it will be highly insecure, but I’ll be happy if I can make something that just about works.

I’m allowing myself access to material describing the RSA system (e.g. lecture slides, the original paper), and to stackexchange for basic Python syntax etc. that I’ve forgotten. But no use of external libraries or peeking at existing Python implementations of RSA.

So far I’ve got the following steps of the key generation process:

Pick two large prime numbers:

First we need to pick two large prime numbers at random. So first, we’ll get the user to type some random keys on the keyboard (this will be familiar if you’ve used e.g. Gnu-PG):

user_entropy = input("please generate some entropy by typing lots of random characters: ")

entropy = 0
for letter in user_entropy:
entropy = entropy + ord(letter)

That turns the user input into a single number. Then we need to find the nearest prime number, with two functions:

def isPrime(num):
for i in range(2,num):
if (num % i) == 0:
prime = False
prime = True
return prime
def find_nearest_prime(num):
while num < 100000:
if isPrime(num):
return num
num += 1

Get N and Φ(N)

Now we have two prime numbers, we multiply then together to get the composite number N which will be the second part of the private and public keys.

n = prime1*prime2

We also need Φ(N) (the ‘totient’):

phi_n = ((prime1-1)*(prime2-1))

Get e (public key)

Now we have to find e, the public key component. The easy bit condition is that it has to be between 1 and Φ(N). The more tricky condition is that it has to be coprime with both N and Φ(N). Two numbers are coprime if they have no common factors other than 1. So first thing we need is a function to find the factors of a number:

def get_factors(num):
factors = []
for i in range(2,num):
if ((num % i) == 0):
return factors

Now we can write a function to check if two numbers have common factors other than 1:

def isCoprime(num1,num2):
num1_factors = get_factors(num1)
num2_factors = get_factors(num2)
if set(num1_factors).isdisjoint(set(num2_factors)):
# print('no common factors - they coprime!')
return True
# print('there are common factors, not coprime')
return False

Now we can write a function to find values for e that will satisfy those conditions:

def find_e(n,phi_n):
candidates = []
for i in range(3,n):
if isPrime(i):
if((isCoprime(i,n)) and (isCoprime(i,phi_n))):
return candidates

This returns a list of potential values for e which we can pick.

Get d (private key)

How about the private key for decrypting messages (d)? This should be the multiplicative inverse of e; that means that when multiplied by e, it should equal 1. My notes say this is equivalent to: e * d = 1 mod N. This is where I’ve run into trouble. My initial attempts to define this function don’t seem to have worked when I tested the encryption and decryption functions on it. It’s also incredibly slow.

At this point I’m also not sure if the problem lies somewhere else in my code. I made some changes and now I’m waiting for it to calculate d again. Watch this space …

[UPDATE: I got it working …]

OK, so it seems like it wasn’t working before because I’d transcribed the multiplicative inverse condition wrong. I had had (d * e) % N, but it’s actually (d * e) Φ(N). So the correct function is:

def find_d(prime1,n):
for i in range(prime1,n):
if (((i*e) % phi_n) == 1):
return i

So that’s the basic key generation functions done. I put in some command line interactions to get this all done, and save the keypair to a text file in the working directory.

Encryption and decryption

With the keypairs saved as dictionaries, the encryption and decryption functions are relatively simple:

def encrypt(pt):
return (pt ** public_key['e']) % public_key['n']
def decrypt(ct):
return (ct ** private_key['d'] % public_key['n'])

At the moment, this only works to encrypt integers rather than text strings. There are various ways we could handle encoding text to integers.

Functional … just!

So there we go. A day was just enough to put together a minimally functional RSA cryptosystem.

The main issue is that even for pairs of small prime numbers, it takes a while to find e. Keysizes in the 10-20’s range are pretty quick to compute, but NIST recommends asymmetric keys should be at least 2048-bits. Trying to generate a key this big means leaving the script running for a long, long time.

There are probably loads of ways I could improve the code. Also it would be better to default to a higher value for e. Finally, the default key management is basically nothing (a unencrypted plaintext file).

How to comply with GDPR Article 22? Automated credit decisions

This post explores automated decision-making systems in the context of the EU General Data Protection Regulation (GDPR). Recent discussions have focused on what exactly the GDPR requires of data controllers who are implementing automated decision-making systems. In particular, what information should be provided to those who are subject to automated decisions? I’ll outline the legal context and then present an example of what that might mean at a technical level, working through a very simple machine learning task in an imaginary credit lending scenario.

The legal context

In May next year, the GDPR will come into force in EU member states (including the UK). Part of the Regulation that has gained a fair amount of attention recently is Article 22, which sets out rights and obligations around the use of automated decision making. Article 22 gives individuals the right to object to decisions made about them purely on the basis of automated processing (where those decisions have significant / legal effects). Other provisions in the GDPR (in Articles 13,14, and 15) give data subjects the right to obtain information about the existence of an automated decision-making system, the ‘logic involved’ and its significance and envisaged consequences. Article 22 is an updated version of Article 15 in the old Data Protection Directive. Member states implemented the Directive into domestic law around a couple of decades ago, but the rights in Article 15 of the Directive have barely been exercised. To put it bluntly, no one really has a grip on what it meant in practice, and we’re now in a similar situation with the new regulation.

In early proposals for the GDPR, the new Article 22 (Article 20 in earlier versions) looked like it might be a powerful new right providing greater transparency in the age of big data profiling and algorithmic decision-making. However, the final version of the text significantly watered it down, to the extent that it is arguably weaker in some respects than the previous Article 15. One of the significant ambiguities is around whether Articles 13, 14, 15, or 22 give individuals a ‘right to an explanation’, that is, an ex post explanation of why a particular automated decision was made about them.

Explaining automated decision-making

The notion of a ‘right to an explanation’ for an automated decision was popularised in a paper which garnered a lot of media attention in summer of last year. However, as Sandra Wachter and colleagues argue in a recent paper, the final text carefully avoids mentioning such a right in the operative provisions. Instead, the GDPR only gives the subjects of automated decisions the right to obtain what Wachter et al describe as an ex ante ‘explanation of system functionality’. Under Articles 15 (1) h, and 14 (2) g, data controllers must provide ‘meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing’ to the data subject. But neither of these provisions amount to an ex post explanation for a particular decision that has been made. The only suggestion of such a right appears in a Recital (71), which says appropriate safeguards should include the ability of data subjects ‘to obtain an explanation of the decision reached after such assessment’.

Since Recitals are not operative provisions and therefore not binding, this suggests that there is no ex post ‘right to explanation’ for specific decisions in the GDPR. It is conceivable that such a right becomes established by the court at a later date, especially if it were to pay close attention to Recital 71. It is also possible that member state DPAs, and the new European Data Protection Board in its harmonisation role, interpret Article 22 in this way, and advise data controllers accordingly. But until then, it looks like data controllers will not be required to provide explanations for specific automated decisions.

Having followed discussions about this aspect of the GDPR since the first proposal text was released in 2012, one of the difficulties has been a lack of specific and detailed examples of how these provisions are supposed to operate in practice. This makes it hard to get a grip on the supposed distinction between a ‘right to explanation of a decision’ and a mere ‘right to an explanation of system functionality’.

If I’m entitled as a data subject to an explanation of a system’s ‘functionality’, and its ‘likely effects’ on me, that could mean a lot of things. It could be very general or quite detailed and specific. At a certain level of generality, such an explanation could be completely uninformative (e.g. ‘based on previous data the model will make a prediction or classification, which will be used to make a decision’). On the other hand, if the system were to be characterised in a detailed way,  showing how particular outputs relate to particular inputs (e.g. ‘applicants with existing debts are 3x less likely to be offered credit’), it might be possible for me to anticipate the likely outcome of a decision applied to me. But without looking at specific contexts, system implementations, and feasible transparency measures, it’s difficult to interpret which of these might be feasibly required by the GDPR.

A practical example: automated credit decisions

Even if legal scholars and data protection officers did have a clear idea about what the GDPR requires in the case automated decision making systems, it’s another matter for that to be implemented at a technical level in practice. In that spirit, let’s work through a specific case in which a data controller might attempt to implement an automated decision-making system.

Lots of different things could be considered as automated decision-making systems, but the ones that are getting a lot of attention these days are systems based on models trained on historical data using machine learning algorithms, whose outputs will be used to make decisions. To illustrate the technology, I’m going to explain how one might build a very simple system using real data (note: this is not intended to be an example of ‘proper’ data science; I’m deliberately going to miss out some important parts of the process, such as evaluation, in order to make it simpler).

Imagine a bank wants to implement an automated system to determine whether or not an individual should be granted credit. The bank takes a bunch of data from previous customers, such as their age, whether or not they have children, and the number of days they have had a negative balance (in reality, they’d probably use many more features, but let’s stick with these three for simplicity). Each customer has been labelled as a ‘good’ or ‘bad’ credit risk. The bank then wants to use a machine learning algorithm to train a model on this existing data to classify new customers as ‘good’ or ‘bad’. Good customers will be automatically granted credit, and bad customers will be automatically denied.

German credit dataset

Luckily for our purposes, a real dataset like this exists from a German bank, shared by Professor Hans Hofman from Hamburg University in 1994. Each row represents a previous customer, with each column representing an attribute, such as age or employment status, and a final column in which the customer’s credit risk has been labelled (either 1 for ‘Good’, or 2 for ‘Bad’).

For example, the 1,000th customer in the dataset has the following attributes:

‘A12 45 A34 A41 4576 A62 A71 3 A93 A101 4 A123 27 A143 A152 1 A173 1 A191 A201 1’

The ‘A41’ attribute in the 4th column indicates that this customer is requesting the credit in order to purchase a used car (a full description of the attribute codes can be found here The final column represents the classification of this customer’s credit risk (in this case 1 = ‘good’).

Building a model

Let’s imagine I’m a data scientist at the bank and I want to be able to predict the target variable (risk score of ‘good’ or ‘bad’) using the attributes. I’m going to use Python, including the pandas module to wrangle the underlying CSV file into an appropriate format (a ‘data frame’), and the scikit-learn module to do the classification.

import pandas as pd
from sklearn import tree

Next, I’ll load in the german credit dataset, including the column headings (remember, for simplicity we’re only going to look at three features – how long they’ve been in negative balance, their age and the number of dependents):

features = ["duration", "age", "num_depend", "risk"]
df = pd.read_csv("../Downloads/", sep=" ", header=0, names=features)

The target variable, the thing we want to predict, is ‘risk’ (where 1 = ‘good’ and 2 = ‘bad’). Let’s label the target variable y and the features X.

y = df[["risk"]]
X = df[features]

Now I’ll apply a basic Decision Tree classifier to this data. This algorithm partitions the data points (i.e. the customers) into smaller and smaller groups according to differences in the values of their attributes which relate to their classification (i.e. ‘people over 30’, ‘people without dependents’). This is by no means the most sophisticated technique for this task, but it is simple enough for our purposes. We end up with a model which can take as input any customer with the relevant set of attributes and return a classification of that customer as a good or bad credit risk.

The bank can then use this model to automatically make a decision about whether or not to grant or deny credit to the customer. Imagine a customer, Alice, makes an application for credit, and provides the following attributes;

Alice = {'duration' : 10, 'age' : 40, 'num_depend' : 1}

We then use our model to classify Alice:

# convert the python Dict into a pandas dataframe
Alice = pd.Series(Alice)
# reshape the values since sklearn doesn't accept 1d arrays
Alice = Alice.values.reshape(1, -1)
print clf.predict(Alice)

The output of our model for Alice is 2 (i.e. ‘bad’), so Alice is not granted the credit.

Logic, significance and consequences of automated decision taking

How could the bank provide Alice with ‘meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing’?

One proposal might be to provide Alice with a representation of the decision tree model that resulted from the training. This is what that looks like:

Reading from the top-down, each fork in the tree shows the criteria for placing an individual in one of two sides of the fork.

If Alice knows which personal attributes the bank knows about her (i.e. 40 years old, 1 dependent, 20 days in negative balance), she could potentially use this decision tree to work out whether or not this system would decide that she was a good credit risk. Reading from the top: the first fork asks whether the individual has 15.5 days or less in negative balance; since Alice has 20 days in negative balance, she is placed in the right hand category. The next fork asks whether the Alice has 43.5 days or less in negative balance, which she does. The next fork asks whether Alice is 23.5 years old or less, which she isn’t. The final fork on this branch asks if Alice has been in negative balance for 34.5 days or more, which she hasn’t, and at this point the model concludes that Alice is a bad credit risk.

While it’s possible for Alice to follow the logic of this decision tree, it might not provide a particularly intuitive or satisfactory explanation to Alice as to why the model gives the outputs it does. But it does at least give Alice some warning about the logic and the effects of this model.

Another way that the bank might provide Alice with information ‘meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing’ would be to allow Alice to try out what decisions the model would recommend based on a variety of different values for the attributes it considers. For instance, what if Alice was older, or younger? Would she receive a different decision?

The bank could show Alice the age threshold at which she would be considered a good or bad credit risk. If we begin with age = 40, we find that Alice is classified as a bad credit risk. The same is true for Alice at 41, 42, and 43. However, at age 44, Alice’s credit risk classification would tip over from bad to good. That small exercise in experimentation may give Alice an intuitive sense of at least one aspect of the logic of the decision-making system and its envisaged effects. We could do something similar with the other attributes – what if Alice had only had a negative balance for 10 days? What if Alice had more or less children?

If this kind of interactive, exploratory analysis were made available to Alice before she applied for credit, it might help her to decide whether or not she felt OK about this kind of automated-decision making system. It might help her decide whether she wants to object to it, as Article 22 entitles her to do. Rather than being presented in dry mathematical terms, the relationships between these variables and the target variable could be presented in colloquial and user-friendly ways; Alice could be told ‘you’re 4 years too young’ and ‘you’ve been in the red for too long’ to be offered credit.

At a later date, the data on which this model is trained might change, and thus the resulting model might give a different answer for Alice. But it is still possible to take a snapshot of the model at a particular time and, on that basis, provide potentially meaningful interfaces through which Alice could understand the logic, significance and effects of the system on her ability to gain credit.

Explanation: unknown

The point of this exercise is to put the abstract discussions surrounding the GDPR’s provisions on automated decision making into a specific context. If data controllers were to provide dynamic, exploratory systems which allow data subjects to explore the relationships between inputs and outputs, they may actually be functionally equivalent to an ex post explanation for a particular decision. From this perspective, the supposed distinction between an ex ante ‘explanation of system functionality’ and an ex post ‘explanation of a specific decision’ becomes less important. What’s important is that Alice can explore the logic and determine the likely effects of the automated decision-making system given her personal circumstances.

Some important questions remain. It’s easy enough, with a simple, low-dimensional model, to explore the relationships between certain features and the target variable. But it’s not clear how these relationships can be meaningfully presented to the data subject, especially for the more complex models that arise from other machine learning methods. And we know very little about how those who are subject to such automated decisions would judge their fairness, and what grounds they might have for objecting to them. Might Alice reject the putative relationship between a particular feature and the target variable? Might she object to the sampling techniques (in this case, Alice might quite reasonably argue that the attributes of German bank customers in 1994 have little bearing on her as a non-German credit applicant in 2017)? Perhaps Alice would reject the thresholds at which applicants are judged as ‘good’ or ‘bad’?

I hope this simplistic, but specific and somewhat realistic example can serve as a starting point for focused discussion on the options for usable, human-centered transparency around automated decision-making. This is a significant challenge which will require more research at the intersection of machine learning, law and human-computer interaction. While there has been some promising work on transparent / interpretable machine learning in recent years (e.g. ‘Quantitative Input Influence‘ and LIME), relatively little research has focused on the human factors of these systems. We know very little about how people might interpret and evaluate these forms of transparency, and how that might be affected by their circumstances and relative position in the context in which the decision is made.

These questions are important to explore if we want to create automated decision-making systems which adhere not just to the letter of data protection law, but also its spirit. The duty to provide information on the logic, significance and effects of algorithmic decision-making will mean very little, if it doesn’t provide data subjects with the ability to make an informed and reasonable decision about whether to subject themselves to such decisions.

White House report on big data and discrimination

The White House has recently published another report on the social and ethical impacts of big data, entitled ‘Big Data: A Report on Algorithmic Systems, Opportunity and Civil Rights’. This could be considered the third in a trilogy, following 2012’s ‘Consumer Data Privacy in a Networked World’ and 2014’s ‘Big Data: Seizing Opportunities, Preserving Values’.

Each report has been a welcome contribution in an evolving debate about data and society, with impacts around the world as well as in the US context. They also reflect, I think, the progress that’s been made in the general direction of understanding in this complex and fast-moving policy area.

The 2012 report was largely an affirmation of commitment to the decades-old fair information practice principles, which noted the challenges posed by new technology. The 2014 report addressed the possibility that big data might lead to forms of unintended discrimination, but didn’t demonstrate any advanced understanding of the potential mechanisms behind such effects. In a paper written shortly after, Solon Barocas and Andrew Selbst commented that ‘because the origin of the discriminatory effects remains unexplored, the 2014 report’s approach does not address the full scope of the problem’.

The latest report does begin to dig more deeply into the heart of big data’s discrimination problem. It describes a number of policy areas – including credit, employment, higher education and criminal justice – in which there is a ‘problem’ to which a ‘big data opportunity’ might be a solution, along with a civil rights ‘challenge’ which must be overcome.

This framing is not without its problems. One might reasonably suspect that the problems in these policy areas are themselves at least partly the result of government mismanagement or market failure, and that advocating a big data ’solution’ would merely be a sticking plaster.

In any case, the report does well to note some of the perils and promise of big data in these areas. It acknowledges some of the complex processes by which big data may have disparate impacts – thus filling the gap in understanding identified by Barocas and Selbst in their 2014 paper. It also alludes to ways in which big data could also help us detect discrimination and thus help prevent it (something I have written about recently). It advocates what it calls ’equal opportunity by design’ approaches to algorithmic hiring. Towards the end of the report, it refers to ‘promising avenues for research and development that could address fairness and discrimination in algorithmic systems, such as those that would enable the design of machine learning systems that constrain disparate impact or construction of algorithms that incorporate fairness properties into their design and execution’. This may be a reference to nascent interdisciplinary research on computational fairness, transparency and accountability (see e.g. the FAT-ML workshop).

While I’d like to see more recognition of the latter, both among the wider academic community and in policy discussions, I hope that its inclusion in the White House report signals a positive direction in the big data debate over the coming years.

‘Privacy and consumer markets’ – talk at 31c3

I just gave a talk at the 31st annual Chaos Communication Congress in Hamburg. The blurb:

“The internet may be the nervous system of the 21st century, but its main business purpose is helping marketers work out how to make people buy stuff. This talk maps out a possible alternative, where consumers co-ordinate online, pooling their data and resources to match demand with supply.”

It was live-streamed and the video should be up on the ccc-tv soon. Slides from the talk are available here in PDF or ODP

Thanks to all the organisers for running such a great event!

YouGov Profiles

I haven’t blogged here in a while. But I did write this piece on YouGov’s Profiler app –  a rather fun but warped view on the research company’s consumer profiling data.

It’s published in The Conversation – if you haven’t come across them yet, I strongly recommend taking a look. They publish topical and well-informed opinion pieces from academics, and their motto is ‘academic rigour, journalistic flair’. Best of all, all the articles are licensed under a Creative Commons (BY-ND) license – ensuring they can be republished and shared as widely as possible.

What do they know about me? Open data on how organisations use personal data

I recently wrote a guest post for the Open Knowledge Foundation’s working group on Personal Data and Privacy Working Group. It delves into the UK register of data controllers – a data source I’ve written about before and which forms the basis of a forthcoming research paper. This time, I’m looking through the data in light of some of the recent controversies we’ve seen in the media including and the construction worker’s blacklist fiasco…

Publishing this information in obscure, unreadable and hidden privacy policies and impact assessments is not enough to achieve meaningful transparency. There’s simply too much of it out there to capture in a piecemeal fashion, in hidden web pages and PDFs. To identify the good and bad things companies do with our personal information, we need more data, in a more detailed, accurate, machine-readable and open format. In the long run, we need to apply the tools of ‘big data’ to drive new services for better privacy management in the public and private sector, as well as for individuals themselves.

You can read the rest here. Thanks to the OKF/ORG for kick-starting such interesting discussions through the mailing list – I’m looking forward to continuing them at the OKF event in Berlin this summer and elsewhere. If you want to participate, do join the working group.

Care.Data: Why we need a new social contract for personal health data

In an ideal world, our collective medical records would be a public good, carefully stewarded by responsible institutions, used to derive medical insights and manage public health better. This is the basic premise of the scheme, and construed as such it suggests a simple moral equation with an obvious answer; give up a little individual privacy for the greater public good. The problem is, our world is not ideal. We’re in the midst of multiple crises of trust in government, the private sector and the ability of our existing global digital infrastructure to adequately deal with the challenges of personal data.

The NHS conducted a privacy impact assessment for the scheme, to identify and weigh its risks and benefits. In discussing why citizens might choose to opt-out of sharing their own data (as 40% of surveyed GP’s said they would), the final paragraph is both infuriating and revealing:

‘However, some people may believe that any use of patient identifiable data without explicit patient consent is unacceptable. These people are unlikely to be supportive of whatever its potential benefits and may object to the use of personal confidential data for wider healthcare purposes.’

In other words, there are some people who will selfishly exercise their individual rights to privacy (for whatever misguided reasons), to the cost and detriment of the public good.

While the leaflet promoting the scheme encourages donating ones data as a contribution to the public health service, even left-wing Bevanites have reason to be sceptical. While many of us instinctively trust ‘our NHS’, the truth is large parts of it are no longer ‘ours’, and the scheme is a perfect example. As expected, the contract to provide the ‘data extraction’ service was won by an unnaccountable private sector provider (Atos, who are also responsible for disability benefit assessments), while some of the main beneficiaries of all the data itself will be a plethora of commercial entities.

This is not to say that private sector use of health data is inherently bad. The trouble with the scheme goes deeper than that; it is a microcosm of a much wider malaise about the future of personal data and the value of privacy.

The social contract governing the use of our health information was written for a different age, where ‘records’ meant paper, folders and filing cabinets rather than entries in giant, mine-able databases. This social contract (if it ever even existed) never granted a mandate for the new kinds of purposes HSCIC proposes.

Such a mandate would have to be based on a realistic and robust assessment of the long-term risks and a stronger regulatory framework for downstream users. Crucially, it would need to proactively engage citizens, enabling them to make informed choices about their personal data and its role in our national information infrastructure. Rather than seizing this opportunity to negotiate a new deal around data sharing, the architects of this scheme have attempted to hush it in through the backdoor.

Thankfully, there are alternative ways to reap the benefits of aggregated health data. One example is Swiss initiative, a patient data co-operative, owned and run by its members. By giving patients themselves a stake and a say in the governance of their data, the project aims to harness that data to ‘benefit the individual citizen and society without discrimination and invasion into privacy’.

Personal data collected unethically is like bad debt. You can aggregate it into complex derivatives, but in the end it’s still toxic. If the NHS start out on the wrong foot with health data, no amount of beneficial re-use will shore up public trust when things go wrong.

5 Stars of Personal Data Access

As a volunteer ‘data donor’ at the Midata Innovation Lab, I’ve recently been attempting to get my data back from a range of suppliers. As our lives become more data-driven, an increasing number of people want access to a copy of the data gathered about them by service providers, personal devices and online platforms. Whether it’s financial transactions data, activity records from a Fitbit or Nike Fuelband, or gas and electricity usage, access to our own data has the potential to drive new services that help us manage our lives and gain self-insight. But anyone who has attempted to get their own data back from service providers will know the process is not always simple. I encountered a variety of complicated access procedures, data formats, and degrees of detail.

For instance, BT gave me access to my latest bill as a CSV file, but previous months were only available as PDF documents. And my broadband usage was displayed as a web page in a seperate part of the site. Wouldn’t it be useful to have everything – broadband usage, landline, and billing – in one file, covering, say, the last year of service? Or, even better, a secure API which would allow trusted applications to access the latest data directly from my BT account, so I don’t have to?

Another problem was that in order to get my data, I sometimes had to sign up for unwanted services. My mobile network provider, GiffGaff, require me to opt-in to their marketing messages in order to receive my monthly usage report. FitBit users need to pay for a premium account to get access to the raw data from their own device.

Wouldn’t it be nice to rate these services according to a set of best practices? In 2006, when the open data movement was in its infancy, Tim Berners-Lee defined ‘Five Stars of Open Data‘ to describe how ‘open’ a data source is. If it’s on the web under an open license, it gets one star. Five stars means that it is in a machine-readable, non-proprietary format, and uses URI’s and links to other data for context. While we don’t necessarily want our private, personal data to be ‘open’ in Berners-Lee’s sense, we do want standard ways to get access to our personal data from a service. So, here are my suggested ‘Five Stars of Personal Data Access’ (to be read as complementary, not necessarily hierarchical):

1. My data is made available to me for free in a digital form. For instance, through a web dashboard, or email, rather than as a paper statement. There are no strings attached; I do not need to pay for premium services or sign up to marketing alerts to read it.

2. My data is machine-readable (such as CSV rather than PDF).

3. My data is in a non-proprietary format (such as CSV, XML or JSON, rather than Excel).

4. My data is complete; all the relevant fields are included in the same place. For instance, usage history and billing are included in the same file or feed.

5. My data is up-to-date; available as a regularly-updated feed, rather than a static file I have to look up and download. This could be via a secure API that I can connect trusted third-party services to.

The Midata programme has considered these issues from the outset, calling for suppliers to adopt common procedures and formats. Simplifying this process is an important step towards a world where individuals are empowered by their own data. My initial attempts to get my data back from suppliers point to a number of areas for improvement, which I’ve tried to reflect in these star ratings. Of course, there’s lots of room for debate over the definitions I’ve given here. And I’m sure there are other important aspects I’ve missed out. What would you add?