Self-sufficient programming: RSA cryptosystem with plain Python

[WARNING: this is an exercise purely for fun: it is definitely very insecure, do not use this code in a real system and do not attempt to write your own cryptographic functions!]

Despite working in computer science, these days I barely have the need to write any code in my day-to-day work. This is not unusual for someone working in CS, especially those of us at the ‘softer’ edges. So I decided to set myself a little project to get back into it.

I’ve also been thinking about how the development stack for most programming languages is so dependent on a byzantine labyrinth of third party libraries and packages. This is great in many ways, because whatever you want to do, someone else has probably already done it better and more secure.

But it also means that most applications are dependent on code written and maintained by other people. In some cases this can lead to an unexpected global mess, such as the time when a developer of several popular NPM libraries pulled them from the package management system, breaking the ‘Jenga tower of Javascript‘.

While I don’t think this is actually the answer to the above problems, it could be a fun exercise to see how many of the basic building blocks of modern computing could be re-created from scratch. By me. Someone whose programming has always been a little shoddy and now very out of practice. Armed with nothing more than a few scribbled notes and old slides covering undergraduate-level CS, and using only basic Python (i.e. the Standard Library, no external packages). This is a challenge, because while Python supposedly ships with all the basic stuff you should need (‘batteries included’), in practice this is arguably not true.


What project to pick? I decided to have a go at implementing the RSA cryptosystem for asymmetric key cryptography, a pretty fundamental part of modern computer security. This seemed like a good challenge for two reasons:

  • RSA is not in the Python standard library, and requires various functions which I’d naturally go looking for in external libraries (e.g. finding coprimes).
  • In debates about national security, it is often said that governments can’t effectively ban encryption because it’s not any particular piece of software, it’s just math. Anyone who knows RSA and how to program can implement it themselves given a general purpose computer. But how many people does that include? Can even I, a non-cryptographer with a basic knowledge of how it works in theory, create an actual working version of RSA?

So as an entirely pedagogical / auto-didactical weekend project I’m going to have a go. I’m sure that it will be highly insecure, but I’ll be happy if I can make something that just about works.

I’m allowing myself access to material describing the RSA system (e.g. lecture slides, the original paper), and to stackexchange for basic Python syntax etc. that I’ve forgotten. But no use of external libraries or peeking at existing Python implementations of RSA.

So far I’ve got the following steps of the key generation process:

Pick two large prime numbers:

First we need to pick two large prime numbers at random. So first, we’ll get the user to type some random keys on the keyboard (this will be familiar if you’ve used e.g. Gnu-PG):

user_entropy = input("please generate some entropy by typing lots of random characters: ")

entropy = 0
for letter in user_entropy:
entropy = entropy + ord(letter)

That turns the user input into a single number. Then we need to find the nearest prime number, with two functions:

def isPrime(num):
for i in range(2,num):
if (num % i) == 0:
prime = False
prime = True
return prime
def find_nearest_prime(num):
while num < 100000:
if isPrime(num):
return num
num += 1

Get N and Φ(N)

Now we have two prime numbers, we multiply then together to get the composite number N which will be the second part of the private and public keys.

n = prime1*prime2

We also need Φ(N) (the ‘totient’):

phi_n = ((prime1-1)*(prime2-1))

Get e (public key)

Now we have to find e, the public key component. The easy bit condition is that it has to be between 1 and Φ(N). The more tricky condition is that it has to be coprime with both N and Φ(N). Two numbers are coprime if they have no common factors other than 1. So first thing we need is a function to find the factors of a number:

def get_factors(num):
factors = []
for i in range(2,num):
if ((num % i) == 0):
return factors

Now we can write a function to check if two numbers have common factors other than 1:

def isCoprime(num1,num2):
num1_factors = get_factors(num1)
num2_factors = get_factors(num2)
if set(num1_factors).isdisjoint(set(num2_factors)):
# print('no common factors - they coprime!')
return True
# print('there are common factors, not coprime')
return False

Now we can write a function to find values for e that will satisfy those conditions:

def find_e(n,phi_n):
candidates = []
for i in range(3,n):
if isPrime(i):
if((isCoprime(i,n)) and (isCoprime(i,phi_n))):
return candidates

This returns a list of potential values for e which we can pick.

Get d (private key)

How about the private key for decrypting messages (d)? This should be the multiplicative inverse of e; that means that when multiplied by e, it should equal 1. My notes say this is equivalent to: e * d = 1 mod N. This is where I’ve run into trouble. My initial attempts to define this function don’t seem to have worked when I tested the encryption and decryption functions on it. It’s also incredibly slow.

At this point I’m also not sure if the problem lies somewhere else in my code. I made some changes and now I’m waiting for it to calculate d again. Watch this space …

[UPDATE: I got it working …]

OK, so it seems like it wasn’t working before because I’d transcribed the multiplicative inverse condition wrong. I had had (d * e) % N, but it’s actually (d * e) Φ(N). So the correct function is:

def find_d(prime1,n):
for i in range(prime1,n):
if (((i*e) % phi_n) == 1):
return i

So that’s the basic key generation functions done. I put in some command line interactions to get this all done, and save the keypair to a text file in the working directory.

Encryption and decryption

With the keypairs saved as dictionaries, the encryption and decryption functions are relatively simple:

def encrypt(pt):
return (pt ** public_key['e']) % public_key['n']
def decrypt(ct):
return (ct ** private_key['d'] % public_key['n'])

At the moment, this only works to encrypt integers rather than text strings. There are various ways we could handle encoding text to integers.

Functional … just!

So there we go. A day was just enough to put together a minimally functional RSA cryptosystem.

The main issue is that even for pairs of small prime numbers, it takes a while to find e. Keysizes in the 10-20’s range are pretty quick to compute, but NIST recommends asymmetric keys should be at least 2048-bits. Trying to generate a key this big means leaving the script running for a long, long time.

There are probably loads of ways I could improve the code. Also it would be better to default to a higher value for e. Finally, the default key management is basically nothing (a unencrypted plaintext file).

How to comply with GDPR Article 22? Automated credit decisions

This post explores automated decision-making systems in the context of the EU General Data Protection Regulation (GDPR). Recent discussions have focused on what exactly the GDPR requires of data controllers who are implementing automated decision-making systems. In particular, what information should be provided to those who are subject to automated decisions? I’ll outline the legal context and then present an example of what that might mean at a technical level, working through a very simple machine learning task in an imaginary credit lending scenario.

The legal context

In May next year, the GDPR will come into force in EU member states (including the UK). Part of the Regulation that has gained a fair amount of attention recently is Article 22, which sets out rights and obligations around the use of automated decision making. Article 22 gives individuals the right to object to decisions made about them purely on the basis of automated processing (where those decisions have significant / legal effects). Other provisions in the GDPR (in Articles 13,14, and 15) give data subjects the right to obtain information about the existence of an automated decision-making system, the ‘logic involved’ and its significance and envisaged consequences. Article 22 is an updated version of Article 15 in the old Data Protection Directive. Member states implemented the Directive into domestic law around a couple of decades ago, but the rights in Article 15 of the Directive have barely been exercised. To put it bluntly, no one really has a grip on what it meant in practice, and we’re now in a similar situation with the new regulation.

In early proposals for the GDPR, the new Article 22 (Article 20 in earlier versions) looked like it might be a powerful new right providing greater transparency in the age of big data profiling and algorithmic decision-making. However, the final version of the text significantly watered it down, to the extent that it is arguably weaker in some respects than the previous Article 15. One of the significant ambiguities is around whether Articles 13, 14, 15, or 22 give individuals a ‘right to an explanation’, that is, an ex post explanation of why a particular automated decision was made about them.

Explaining automated decision-making

The notion of a ‘right to an explanation’ for an automated decision was popularised in a paper which garnered a lot of media attention in summer of last year. However, as Sandra Wachter and colleagues argue in a recent paper, the final text carefully avoids mentioning such a right in the operative provisions. Instead, the GDPR only gives the subjects of automated decisions the right to obtain what Wachter et al describe as an ex ante ‘explanation of system functionality’. Under Articles 15 (1) h, and 14 (2) g, data controllers must provide ‘meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing’ to the data subject. But neither of these provisions amount to an ex post explanation for a particular decision that has been made. The only suggestion of such a right appears in a Recital (71), which says appropriate safeguards should include the ability of data subjects ‘to obtain an explanation of the decision reached after such assessment’.

Since Recitals are not operative provisions and therefore not binding, this suggests that there is no ex post ‘right to explanation’ for specific decisions in the GDPR. It is conceivable that such a right becomes established by the court at a later date, especially if it were to pay close attention to Recital 71. It is also possible that member state DPAs, and the new European Data Protection Board in its harmonisation role, interpret Article 22 in this way, and advise data controllers accordingly. But until then, it looks like data controllers will not be required to provide explanations for specific automated decisions.

Having followed discussions about this aspect of the GDPR since the first proposal text was released in 2012, one of the difficulties has been a lack of specific and detailed examples of how these provisions are supposed to operate in practice. This makes it hard to get a grip on the supposed distinction between a ‘right to explanation of a decision’ and a mere ‘right to an explanation of system functionality’.

If I’m entitled as a data subject to an explanation of a system’s ‘functionality’, and its ‘likely effects’ on me, that could mean a lot of things. It could be very general or quite detailed and specific. At a certain level of generality, such an explanation could be completely uninformative (e.g. ‘based on previous data the model will make a prediction or classification, which will be used to make a decision’). On the other hand, if the system were to be characterised in a detailed way,  showing how particular outputs relate to particular inputs (e.g. ‘applicants with existing debts are 3x less likely to be offered credit’), it might be possible for me to anticipate the likely outcome of a decision applied to me. But without looking at specific contexts, system implementations, and feasible transparency measures, it’s difficult to interpret which of these might be feasibly required by the GDPR.

A practical example: automated credit decisions

Even if legal scholars and data protection officers did have a clear idea about what the GDPR requires in the case automated decision making systems, it’s another matter for that to be implemented at a technical level in practice. In that spirit, let’s work through a specific case in which a data controller might attempt to implement an automated decision-making system.

Lots of different things could be considered as automated decision-making systems, but the ones that are getting a lot of attention these days are systems based on models trained on historical data using machine learning algorithms, whose outputs will be used to make decisions. To illustrate the technology, I’m going to explain how one might build a very simple system using real data (note: this is not intended to be an example of ‘proper’ data science; I’m deliberately going to miss out some important parts of the process, such as evaluation, in order to make it simpler).

Imagine a bank wants to implement an automated system to determine whether or not an individual should be granted credit. The bank takes a bunch of data from previous customers, such as their age, whether or not they have children, and the number of days they have had a negative balance (in reality, they’d probably use many more features, but let’s stick with these three for simplicity). Each customer has been labelled as a ‘good’ or ‘bad’ credit risk. The bank then wants to use a machine learning algorithm to train a model on this existing data to classify new customers as ‘good’ or ‘bad’. Good customers will be automatically granted credit, and bad customers will be automatically denied.

German credit dataset

Luckily for our purposes, a real dataset like this exists from a German bank, shared by Professor Hans Hofman from Hamburg University in 1994. Each row represents a previous customer, with each column representing an attribute, such as age or employment status, and a final column in which the customer’s credit risk has been labelled (either 1 for ‘Good’, or 2 for ‘Bad’).

For example, the 1,000th customer in the dataset has the following attributes:

‘A12 45 A34 A41 4576 A62 A71 3 A93 A101 4 A123 27 A143 A152 1 A173 1 A191 A201 1’

The ‘A41’ attribute in the 4th column indicates that this customer is requesting the credit in order to purchase a used car (a full description of the attribute codes can be found here The final column represents the classification of this customer’s credit risk (in this case 1 = ‘good’).

Building a model

Let’s imagine I’m a data scientist at the bank and I want to be able to predict the target variable (risk score of ‘good’ or ‘bad’) using the attributes. I’m going to use Python, including the pandas module to wrangle the underlying CSV file into an appropriate format (a ‘data frame’), and the scikit-learn module to do the classification.

import pandas as pd
from sklearn import tree

Next, I’ll load in the german credit dataset, including the column headings (remember, for simplicity we’re only going to look at three features – how long they’ve been in negative balance, their age and the number of dependents):

features = ["duration", "age", "num_depend", "risk"]
df = pd.read_csv("../Downloads/", sep=" ", header=0, names=features)

The target variable, the thing we want to predict, is ‘risk’ (where 1 = ‘good’ and 2 = ‘bad’). Let’s label the target variable y and the features X.

y = df[["risk"]]
X = df[features]

Now I’ll apply a basic Decision Tree classifier to this data. This algorithm partitions the data points (i.e. the customers) into smaller and smaller groups according to differences in the values of their attributes which relate to their classification (i.e. ‘people over 30’, ‘people without dependents’). This is by no means the most sophisticated technique for this task, but it is simple enough for our purposes. We end up with a model which can take as input any customer with the relevant set of attributes and return a classification of that customer as a good or bad credit risk.

The bank can then use this model to automatically make a decision about whether or not to grant or deny credit to the customer. Imagine a customer, Alice, makes an application for credit, and provides the following attributes;

Alice = {'duration' : 10, 'age' : 40, 'num_depend' : 1}

We then use our model to classify Alice:

# convert the python Dict into a pandas dataframe
Alice = pd.Series(Alice)
# reshape the values since sklearn doesn't accept 1d arrays
Alice = Alice.values.reshape(1, -1)
print clf.predict(Alice)

The output of our model for Alice is 2 (i.e. ‘bad’), so Alice is not granted the credit.

Logic, significance and consequences of automated decision taking

How could the bank provide Alice with ‘meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing’?

One proposal might be to provide Alice with a representation of the decision tree model that resulted from the training. This is what that looks like:

Reading from the top-down, each fork in the tree shows the criteria for placing an individual in one of two sides of the fork.

If Alice knows which personal attributes the bank knows about her (i.e. 40 years old, 1 dependent, 20 days in negative balance), she could potentially use this decision tree to work out whether or not this system would decide that she was a good credit risk. Reading from the top: the first fork asks whether the individual has 15.5 days or less in negative balance; since Alice has 20 days in negative balance, she is placed in the right hand category. The next fork asks whether the Alice has 43.5 days or less in negative balance, which she does. The next fork asks whether Alice is 23.5 years old or less, which she isn’t. The final fork on this branch asks if Alice has been in negative balance for 34.5 days or more, which she hasn’t, and at this point the model concludes that Alice is a bad credit risk.

While it’s possible for Alice to follow the logic of this decision tree, it might not provide a particularly intuitive or satisfactory explanation to Alice as to why the model gives the outputs it does. But it does at least give Alice some warning about the logic and the effects of this model.

Another way that the bank might provide Alice with information ‘meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing’ would be to allow Alice to try out what decisions the model would recommend based on a variety of different values for the attributes it considers. For instance, what if Alice was older, or younger? Would she receive a different decision?

The bank could show Alice the age threshold at which she would be considered a good or bad credit risk. If we begin with age = 40, we find that Alice is classified as a bad credit risk. The same is true for Alice at 41, 42, and 43. However, at age 44, Alice’s credit risk classification would tip over from bad to good. That small exercise in experimentation may give Alice an intuitive sense of at least one aspect of the logic of the decision-making system and its envisaged effects. We could do something similar with the other attributes – what if Alice had only had a negative balance for 10 days? What if Alice had more or less children?

If this kind of interactive, exploratory analysis were made available to Alice before she applied for credit, it might help her to decide whether or not she felt OK about this kind of automated-decision making system. It might help her decide whether she wants to object to it, as Article 22 entitles her to do. Rather than being presented in dry mathematical terms, the relationships between these variables and the target variable could be presented in colloquial and user-friendly ways; Alice could be told ‘you’re 4 years too young’ and ‘you’ve been in the red for too long’ to be offered credit.

At a later date, the data on which this model is trained might change, and thus the resulting model might give a different answer for Alice. But it is still possible to take a snapshot of the model at a particular time and, on that basis, provide potentially meaningful interfaces through which Alice could understand the logic, significance and effects of the system on her ability to gain credit.

Explanation: unknown

The point of this exercise is to put the abstract discussions surrounding the GDPR’s provisions on automated decision making into a specific context. If data controllers were to provide dynamic, exploratory systems which allow data subjects to explore the relationships between inputs and outputs, they may actually be functionally equivalent to an ex post explanation for a particular decision. From this perspective, the supposed distinction between an ex ante ‘explanation of system functionality’ and an ex post ‘explanation of a specific decision’ becomes less important. What’s important is that Alice can explore the logic and determine the likely effects of the automated decision-making system given her personal circumstances.

Some important questions remain. It’s easy enough, with a simple, low-dimensional model, to explore the relationships between certain features and the target variable. But it’s not clear how these relationships can be meaningfully presented to the data subject, especially for the more complex models that arise from other machine learning methods. And we know very little about how those who are subject to such automated decisions would judge their fairness, and what grounds they might have for objecting to them. Might Alice reject the putative relationship between a particular feature and the target variable? Might she object to the sampling techniques (in this case, Alice might quite reasonably argue that the attributes of German bank customers in 1994 have little bearing on her as a non-German credit applicant in 2017)? Perhaps Alice would reject the thresholds at which applicants are judged as ‘good’ or ‘bad’?

I hope this simplistic, but specific and somewhat realistic example can serve as a starting point for focused discussion on the options for usable, human-centered transparency around automated decision-making. This is a significant challenge which will require more research at the intersection of machine learning, law and human-computer interaction. While there has been some promising work on transparent / interpretable machine learning in recent years (e.g. ‘Quantitative Input Influence‘ and LIME), relatively little research has focused on the human factors of these systems. We know very little about how people might interpret and evaluate these forms of transparency, and how that might be affected by their circumstances and relative position in the context in which the decision is made.

These questions are important to explore if we want to create automated decision-making systems which adhere not just to the letter of data protection law, but also its spirit. The duty to provide information on the logic, significance and effects of algorithmic decision-making will mean very little, if it doesn’t provide data subjects with the ability to make an informed and reasonable decision about whether to subject themselves to such decisions.

When good algorithms go bad…

I recently spoke on a panel at Strata called ‘When good algorithms go bad’, organised by the good people at DataKind and sponsored by Pivotal Labs and O’Reilly Media. My co-panellists were Madeleine Greenhalgh (Cabinet Office), Shahzia Holtom (Pivotal Labs), and Hetan Shah (Royal Statistical Society), with Duncan Ross (DataKind) chairing what was a very stimulating discussion about the ethics of data science.

I won’t attempt to summarise the excellent contributions of the other speakers (and the audience), but here are some relevant (I hope!) references for those interested in delving further into some of the topics raised:

Many thanks to DataKind UK for organising and inviting me!

White House report on big data and discrimination

The White House has recently published another report on the social and ethical impacts of big data, entitled ‘Big Data: A Report on Algorithmic Systems, Opportunity and Civil Rights’. This could be considered the third in a trilogy, following 2012’s ‘Consumer Data Privacy in a Networked World’ and 2014’s ‘Big Data: Seizing Opportunities, Preserving Values’.

Each report has been a welcome contribution in an evolving debate about data and society, with impacts around the world as well as in the US context. They also reflect, I think, the progress that’s been made in the general direction of understanding in this complex and fast-moving policy area.

The 2012 report was largely an affirmation of commitment to the decades-old fair information practice principles, which noted the challenges posed by new technology. The 2014 report addressed the possibility that big data might lead to forms of unintended discrimination, but didn’t demonstrate any advanced understanding of the potential mechanisms behind such effects. In a paper written shortly after, Solon Barocas and Andrew Selbst commented that ‘because the origin of the discriminatory effects remains unexplored, the 2014 report’s approach does not address the full scope of the problem’.

The latest report does begin to dig more deeply into the heart of big data’s discrimination problem. It describes a number of policy areas – including credit, employment, higher education and criminal justice – in which there is a ‘problem’ to which a ‘big data opportunity’ might be a solution, along with a civil rights ‘challenge’ which must be overcome.

This framing is not without its problems. One might reasonably suspect that the problems in these policy areas are themselves at least partly the result of government mismanagement or market failure, and that advocating a big data ’solution’ would merely be a sticking plaster.

In any case, the report does well to note some of the perils and promise of big data in these areas. It acknowledges some of the complex processes by which big data may have disparate impacts – thus filling the gap in understanding identified by Barocas and Selbst in their 2014 paper. It also alludes to ways in which big data could also help us detect discrimination and thus help prevent it (something I have written about recently). It advocates what it calls ’equal opportunity by design’ approaches to algorithmic hiring. Towards the end of the report, it refers to ‘promising avenues for research and development that could address fairness and discrimination in algorithmic systems, such as those that would enable the design of machine learning systems that constrain disparate impact or construction of algorithms that incorporate fairness properties into their design and execution’. This may be a reference to nascent interdisciplinary research on computational fairness, transparency and accountability (see e.g. the FAT-ML workshop).

While I’d like to see more recognition of the latter, both among the wider academic community and in policy discussions, I hope that its inclusion in the White House report signals a positive direction in the big data debate over the coming years.

Predictions, predictions

The Crystal Ball by John William Waterhouse

It is prediction time. Around about this time of year, pundits love to draw on their specialist expertise to predict significant events for the year ahead. The honest ones also revisit their previous yearly predictions to see how they did.

Philip Tetlock is an expert on expert judgement. He conducted a series of experiments between 1984-2003 to uncover whether experts were better than average at predicting outcomes in their given specialism. Predictions had to be specific and quantifiable, in areas ranging from economics to politics and internatonal relations. He found that many so-called experts were pretty bad at telling the future, and some were even worse than the man or woman on the street. Most worryingly, there was an inverse relationship between an experts fame and the accuracy of their predictions.

But he also discovered certain factors that make some experts better predictors than others. He found that training in probabilistic reasoning, avoidance of common cognitive biases, and evaluating previous guesses enabled experts to make more accurate forecasts. Predictions based on the combined judgements of multiple individuals also proved helpful.

This work has continued in the Good Judgement Project, a research project involving several thousand volunteer forecasters, which now has a commercial spin-off, Good Judgement Inc. The project has experimented with various methods of training, forming forecasters into complementary teams, and aggregation algorithms. By rigourously and systematically testing these and other factors, the system aims to uncover the determinants of accurate forecasts. It has already proven highly successful, winning a CIA-funded competition several years in a row (‘Intelligence Advanced Research Projects Activity – Aggregative Contingent Estimation’ (IARPA-ACE)).

The commercial spin-off began as an invite-only scheme, but now it has a new part called ‘Good Judgement Open’ which allows anyone to sign up and have a go. I’ve just signed up and made my first prediction in response to the following question:

“Before the end of 2016, will a North American country, the EU, or an EU member state impose sanctions on another country in response to a cyber attack or cyber espionage?”

You can view the question and the current forecast from users of the site. Users compete to be the most prescient in their chosen areas of expertise.

It’s an interesting concept. I expect it will also prove to be a pretty shrewd way of harvest intelligence and recruit superforecasters for Good Judgement Inc. In this sense the business model is like Facebook and Google, i.e. monetising user data, although not in order to sell targeted advertising.

I can think of a number of ways the site could be improved further. I’d like to be given a tool which helps me break down my prediction into various necessary and jointly sufficient elements and allow me to place a probability estimate on each. For instance, let’s say the question about international sanctions in response to a cyberattack depends on several factors; the likelihood of severe attacks, the ability of digital forensics to determine the origin of the attack, and the likelihood of sanctions as a response. I have more certainty about some of these factors than others, so a system which split them into parts, and periodically revise them, would be helpful (in Bayesian terms, I’d like to be explicit about my priors and posteriors).

One could also experiment with keeping predictions secret until after the outcome of some event. This would mean one forecaster’s prediction wouldn’t contaminate the predictions of others (perhaps if they were well known as a reliable forecaster). This would allow for forecastors to say ‘I knew it!’ without saying ‘I told you so’ (we could call it the ‘Reverse Cassandra’). Of course you’d need some way to prove that you didn’t just write the prediction after the event and back-date it. Or create a prediction for every possible outcome and then selectively reveal the correct one, a classic con illustrated by TV magician Derren Brown. If you wanted to get really fancy, you could do that kind of thing with cryptography and the blockchain.

After looking into this a bit more, I came across this blog by someone called gwern, who appears to be incredibly knowledgeable about prediction markets (and much else besides).

Galileo and Big Data: The Philosophy of Data Science

The heliocentric view of the solar system was initially rejected… for good reasons.

I recently received a targeted email from a recruitment agency, advertising relevant job openings. Many of them were for ‘Data Science’ roles. At recruitment agencies. There’s a delicious circularity here; somewhere, a data scientist is employed by a recruitment agency to use data science to target recruitment opportunities to potential data scientists for data science roles at recruitment agencies… to recruit more data scientists.

For a job title that barely existed a few years ago, Data Science is clearly a growing role. There are many ways to describe a data scientist, but ‘experts who use analytical techniques from statistics, computing and other disciplines to create value from new (‘big’) data’ provides a good summary (from Nesta).

For me, one of the most interesting and novel aspects of these new data-intensive scientific methods is the extent to which they seem to change the scientific method. In fact, the underpinnings of this new approach can be helpfully explained in terms an old debate in the philosophy of science.

Science is typically seen as a process of theory-driven hypothesis testing. For instance, Galileo’s heliocentric theory made predictions about the movements of the stars; he tested these predictions with his telescope; and his telescopic observations appeared to confirm his predictions and, at least in his eyes, proved his theory (if Galileo were a pedantic modern scientist, he might instead have described it more cautiously as ‘rejecting the null hypothesis’). As we know, Galileo’s theory didn’t gain widespread acceptance for many years (indeed, it still isn’t universally accepted today).

We could explain the establishment’s dismissal of Galileo’s findings as a result of religious dogma. But as Paul Feyerabend argued in his seminal book Against Method (1975), Galileo’s case was actually quite weak (despite being true). His theory appeared to directly contradict many ordinary observations. For instance, rather than birds getting thrown off into the sky, as one might expect if the earth were moving, they can fly a stable course. It was only later that such phenomena could be accommodated by the heliocentric view.

So Galileo’s detractors were, at the time, quite right to reject his theory. Many of their objections came from doubts about the veracity of the data gathered from Galileo’s unfamiliar telescopic equipment. Even if a theory makes a prediction that is borne out by an observation, we might still rationally reject it if it contradicts other observations, if there are competing theories with more explanatory power, and if – as with Galileo’s telescope – there are doubts about the equipment used to derive the observation.

Fast forward three centuries, to 1906, when French physicist Pierre Duhem published La Théorie Physique. Son Objet, sa Structure (The Aim and Structure of Physical Theory). Duhem argued that physicists cannot simply test one hypothesis in isolation. Because whatever the result, one could always reject the ‘auxiliary’ hypotheses that support the observation. These include hypotheses about the background conditions, about the efficacy of your measurement equipment, and so on. No experiment can conclusively confirm or deny the hypothesis because there will always be background assumptions that are open to question.

This idea became known as the Duhem-Quine thesis, after the American philosopher and logician W. V. O. Quine argued that this problem applied not only to physics, but to all of science and even to the truths of mathematics and logic.

The Duhem-Quine thesis is echoed in the work of leading proponents of data-driven science, who argue that a paradigm shift is taking place. Rather than coming up with a hypothesis that predicts a linear relationship between one set of variables and another (‘output’) variable, data-driven science can simply explore every possible relationship (including highly complex, non-linear functions) between any set of variables. This shift has been described as going from data to algorithmic models, from ‘model-based’ to ‘model-free’ science, and from parametric to ‘non-parametric’ modelling (see, for instance, Stuart Russel & Peter Norvig’s 2009 Artificial Intelligence, chapter 18).

While this paradigm shift may be less radical than the Duhem-Quine thesis (it certainly doesn’t cast doubt on the foundations of mathematical truths, as Quine’s holism appears to do), it does embrace one of its key messages; you cannot test a single hypothesis in isolation with a single observation, since there are always auxiliary hypotheses involved – other potential causal factors which might be relevant to the output variable. Modern data science techniques attempt to include as many potential causal factors as possible, and to automatically test an extensive range of complex, non-linear relationships between them and the output variables.

Contemporary philosophers of science are beginning to draw these kinds of connections between their discipline and this paradigm shift in data science. For instance, Wolfgang Pietsch’s work on theory-ladenness in data-intensive science compares the methodology behind common data science techniques (namely, classificatory trees and non-parametric regression) to theories of eliminative induction originating in the work of John Stuart Mill.

There are potential dangers in pursuing theory-free and model-free analysis. We may end up without comprehensible explanations for the answers our machines give us. This may be scientifically problematic, because we often want science to explain a phenomenon rather than just predict it. It may also be problematic in an ethical sense, because machine learning is increasingly used in ways that affect people’s lives, from predicting criminal behaviour to targeting job offers. If data scientists can’t explain the underlying reasons why their models make certain predictions, why they target certain people rather than others, we cannot evaluate the fairness of decisions based on those models.

The Who, What and Why of Personal Data (new paper in JOAL)

Some of my PhD research has been published in the current issue of the Journal of Open Access to Law, from Cornell University.

“Data protection laws require organisations to be transparent about how they use personal data. This article explores the potential of machine-readable privacy notices to address this transparency challenge. We analyse a large source of open data comprised of semi-structured privacy notifications from hundreds of thousands of organisations in the UK, to investigate the reasons for data collection, the types of personal data collected and from whom, and the types of recipients who have access to the data. We analyse three specific sectors in detail; health, finance, and data brokerage. Finally, we draw recommendations for possible future applications of open data to privacy policies and transparency notices.”

It’s available under open access at the JOAL website (or download the PDF)

Searching for Truthiness, Part 2: Knowledge-Based Trust

In the last post I explored two approaches to making computers do smart things, in particular relating to search engines. The knowledge representation approach (affiliated with traditional AI and the semantic web) involves creating ontologies, defining objects and relations, and getting software to make logical inferences over them. What I called the statistical approach (also known as machine learning) involves using data, often generated by human activity, to detect patterns and make a probabilistic assessment of the right answer. In the case of search, what we click on in response to queries and inbound hyperlinks are used to rank search results.

This brings us to the recent paper by some engineers at Google, on what they call knowledge-based trust (KBT). The problem faced by the statistical approach is that it is based on what millions of ordinary, fallible humans do on the web. That includes clicking on and linking to pages with sensational but unsubstantiated headlines, or dubious medical information. This means our biases get picked up by the system alongside our better judgement. If you train a computer with flawed data, it’s going to return flawed results; garbage in, garbage out. What the paper proposes is a new way to suppress (or at least, downgrade) such content based on the number of facts it contains.

But how can a search engine determine the factual content of a web page, if all it measures are clicks and links? It can’t. This is where the knowledge representation approach comes back to the rescue. By comparing statements extracted from web pages with a pre-existing body of knowledge, the researchers hope that a search engine could assess the trustworthiness of a page.

Google have been working on both the knowledge representation and statistical approaches for a long time. This proposal is one example of how the two approaches could be usefully integrated. Those little information boxes that crop up for certain Google searches are another. Try searching ‘Tiger vs Shark‘ and the first thing you’ll see above the normal search results is a tabular comparison of their respective properties – useful for those ‘who would win in a fight between x and y’ questions. These factoids are driven by a pre-existing body of structured data.

But hold on, where does this pre-existing body of knowledge come from, and why should we trust it, especially if it’s used to re-order search results? It comes from the ‘Knowledge Vault‘, Google’s repository of machine-readable information about the world, from geography, biology, history – you name it, they probably have it. It’s based on a collaboratively generated database called Freebase, created (or, perhaps more accurately, ‘curated’) since 2007 by Metaweb, and acquired by Google in 2010. It’s now due to shut down and be replaced by Wikidata, another source of structured data, extracted from Wikipedia.

So while our collective clicks and links may be a bad measure of truthiness, perhaps our collaborative encyclopedia entries can serve as a different standard for truth-assessment. Of course, if this standard is flawed, then the knowledge-based-trust score is going to be equally flawed (garbage in, garbage out). If you think Wikipedia (and hence Wikidata) is dodgy, then you won’t be very impressed by KBT-enhanced search results. If, on the other hand, you think it’s good enough, then it could lead to a welcome improvement. But we can’t escape some of the foundational epistemic questions whichever approach we adopt. In attempting to correct one source of bias, we introduce another. Whether the net effect is positive, or the biases cancel each other out, I don’t know. But what I do know is that isn’t just a question for software engineers to answer.

The main content of the paper itself is highly technical and, dare I say, boring for those of us outside of this branch of computer science. Its main contribution is a solution to the problem of distinguishing noise in the knowledge extraction process from falsehood in the source, something which has so far held back the practical application of such techniques to search ranking. But the discussion that the paper has prompted poses some very important social and political questions.

Risks of the Knowledge-Based Trust approach

The most immediate concern has come from the search engine optimisation community. Will SEO experts now be recommending websites to ‘push up the fact quotient’ on their content? Will marketers have even more reason to infiltrate Wikipedia in an effort to push their ‘facts’ into Wikidata? What about all the many contexts in which we assert untrue claims for contextually acceptable and obvious reasons (e.g. fiction, parody, or hyperbole)? Will they have a harder time getting hits?

And what about all the claims that are ‘unverifiable’ and have no ‘truth value’, as the logical positivists (see previous post) would have said? While KBT would only be one factor in the search rankings, it would still punish content containing many of these kinds of claims. Do we want an information environment that’s skewed towards statements that can be verified and against those that are unverifiable?

The epistemological status of what the researchers call ‘facts’ is also intriguing. The researchers seem to acknowledge that the knowledge base might not be completely accurate, when they include sentences like “facts extracted by automatic methods such as KV may be wrong”. This does seem to be standard terminology in this branch of computer science, but for philosophers, linguists, logicians, sociologists and others, the loose use of the ‘f’ word will ring alarm bells. Even putting aside these academic perspectives, our everyday use of ‘fact’ usually implies truth. It would be far less confusing for to simply call them statements, which can be either true or false.

Finally, while I don’t think it presents a serious danger right now, and indeed it could improve search engines in some ways, moving in this direction has risks for public debate, education and free speech. One danger is that sources containing claims that are worth exploring, but have insufficient evidence, will be systematically suppressed. If there’s no way for a class of maybe-true claims to get into the Knowledge Vault or Wikidata or whatever knowledge base is used, then you have to work extra hard to get people to even consider them. Whatever process is used to revise and expand the knowledge base will inevitably become highly contested, raising conflicts that may often prove irreconcilable.

It will be even harder if your claim directly contradicts the ‘facts’ found in the search engine’s knowledge base. If your claim is true, then society loses out. And even if your claim is false, as John Stuart Mill recognised, society may still benefit from having received opinion challenged:

“Even if the received opinion be not only true, but the whole truth; unless it is suffered to be, and actually is, vigorously and earnestly contested, it will, by most of those who receive it, be held in the manner of a prejudice, with little comprehension or feeling of its rational grounds.” – On Liberty (1859)

Search engines that rank claims by some single standard of truthiness are just one more way that free speech can be gradually, messily eroded. Of course, the situation we have now – the tyranny of the linked and clicked – may be erosive in different, better or worse ways. Either way, the broader problem is that search engines – especially those with a significant majority of the market – can have profound effects on the dissemination of information and misinformation in society. We need to understand these effects and find ways to deal with their political and social consequences.