Category Archives: datascience

When good algorithms go bad…

I recently spoke on a panel at Strata called ‘When good algorithms go bad’, organised by the good people at DataKind and sponsored by Pivotal Labs and O’Reilly Media. My co-panellists were Madeleine Greenhalgh (Cabinet Office), Shahzia Holtom (Pivotal Labs), and Hetan Shah (Royal Statistical Society), with Duncan Ross (DataKind) chairing what was a very stimulating discussion about the ethics of data science.

I won’t attempt to summarise the excellent contributions of the other speakers (and the audience), but here are some relevant (I hope!) references for those interested in delving further into some of the topics raised:

Many thanks to DataKind UK for organising and inviting me!

Predictions, predictions

The Crystal Ball by John William Waterhouse

It is prediction time. Around about this time of year, pundits love to draw on their specialist expertise to predict significant events for the year ahead. The honest ones also revisit their previous yearly predictions to see how they did.

Philip Tetlock is an expert on expert judgement. He conducted a series of experiments between 1984-2003 to uncover whether experts were better than average at predicting outcomes in their given specialism. Predictions had to be specific and quantifiable, in areas ranging from economics to politics and internatonal relations. He found that many so-called experts were pretty bad at telling the future, and some were even worse than the man or woman on the street. Most worryingly, there was an inverse relationship between an experts fame and the accuracy of their predictions.

But he also discovered certain factors that make some experts better predictors than others. He found that training in probabilistic reasoning, avoidance of common cognitive biases, and evaluating previous guesses enabled experts to make more accurate forecasts. Predictions based on the combined judgements of multiple individuals also proved helpful.

This work has continued in the Good Judgement Project, a research project involving several thousand volunteer forecasters, which now has a commercial spin-off, Good Judgement Inc. The project has experimented with various methods of training, forming forecasters into complementary teams, and aggregation algorithms. By rigourously and systematically testing these and other factors, the system aims to uncover the determinants of accurate forecasts. It has already proven highly successful, winning a CIA-funded competition several years in a row (‘Intelligence Advanced Research Projects Activity – Aggregative Contingent Estimation’ (IARPA-ACE)).

The commercial spin-off began as an invite-only scheme, but now it has a new part called ‘Good Judgement Open’ which allows anyone to sign up and have a go. I’ve just signed up and made my first prediction in response to the following question:

“Before the end of 2016, will a North American country, the EU, or an EU member state impose sanctions on another country in response to a cyber attack or cyber espionage?”

You can view the question and the current forecast from users of the site. Users compete to be the most prescient in their chosen areas of expertise.

It’s an interesting concept. I expect it will also prove to be a pretty shrewd way of harvest intelligence and recruit superforecasters for Good Judgement Inc. In this sense the business model is like Facebook and Google, i.e. monetising user data, although not in order to sell targeted advertising.

I can think of a number of ways the site could be improved further. I’d like to be given a tool which helps me break down my prediction into various necessary and jointly sufficient elements and allow me to place a probability estimate on each. For instance, let’s say the question about international sanctions in response to a cyberattack depends on several factors; the likelihood of severe attacks, the ability of digital forensics to determine the origin of the attack, and the likelihood of sanctions as a response. I have more certainty about some of these factors than others, so a system which split them into parts, and periodically revise them, would be helpful (in Bayesian terms, I’d like to be explicit about my priors and posteriors).

One could also experiment with keeping predictions secret until after the outcome of some event. This would mean one forecaster’s prediction wouldn’t contaminate the predictions of others (perhaps if they were well known as a reliable forecaster). This would allow for forecastors to say ‘I knew it!’ without saying ‘I told you so’ (we could call it the ‘Reverse Cassandra’). Of course you’d need some way to prove that you didn’t just write the prediction after the event and back-date it. Or create a prediction for every possible outcome and then selectively reveal the correct one, a classic con illustrated by TV magician Derren Brown. If you wanted to get really fancy, you could do that kind of thing with cryptography and the blockchain.

After looking into this a bit more, I came across this blog by someone called gwern, who appears to be incredibly knowledgeable about prediction markets (and much else besides).

Galileo and Big Data: The Philosophy of Data Science

The heliocentric view of the solar system was initially rejected… for good reasons.

I recently received a targeted email from a recruitment agency, advertising relevant job openings. Many of them were for ‘Data Science’ roles. At recruitment agencies. There’s a delicious circularity here; somewhere, a data scientist is employed by a recruitment agency to use data science to target recruitment opportunities to potential data scientists for data science roles at recruitment agencies… to recruit more data scientists.

For a job title that barely existed a few years ago, Data Science is clearly a growing role. There are many ways to describe a data scientist, but ‘experts who use analytical techniques from statistics, computing and other disciplines to create value from new (‘big’) data’ provides a good summary (from Nesta).

For me, one of the most interesting and novel aspects of these new data-intensive scientific methods is the extent to which they seem to change the scientific method. In fact, the underpinnings of this new approach can be helpfully explained in terms an old debate in the philosophy of science.

Science is typically seen as a process of theory-driven hypothesis testing. For instance, Galileo’s heliocentric theory made predictions about the movements of the stars; he tested these predictions with his telescope; and his telescopic observations appeared to confirm his predictions and, at least in his eyes, proved his theory (if Galileo were a pedantic modern scientist, he might instead have described it more cautiously as ‘rejecting the null hypothesis’). As we know, Galileo’s theory didn’t gain widespread acceptance for many years (indeed, it still isn’t universally accepted today).

We could explain the establishment’s dismissal of Galileo’s findings as a result of religious dogma. But as Paul Feyerabend argued in his seminal book Against Method (1975), Galileo’s case was actually quite weak (despite being true). His theory appeared to directly contradict many ordinary observations. For instance, rather than birds getting thrown off into the sky, as one might expect if the earth were moving, they can fly a stable course. It was only later that such phenomena could be accommodated by the heliocentric view.

So Galileo’s detractors were, at the time, quite right to reject his theory. Many of their objections came from doubts about the veracity of the data gathered from Galileo’s unfamiliar telescopic equipment. Even if a theory makes a prediction that is borne out by an observation, we might still rationally reject it if it contradicts other observations, if there are competing theories with more explanatory power, and if – as with Galileo’s telescope – there are doubts about the equipment used to derive the observation.

Fast forward three centuries, to 1906, when French physicist Pierre Duhem published La Théorie Physique. Son Objet, sa Structure (The Aim and Structure of Physical Theory). Duhem argued that physicists cannot simply test one hypothesis in isolation. Because whatever the result, one could always reject the ‘auxiliary’ hypotheses that support the observation. These include hypotheses about the background conditions, about the efficacy of your measurement equipment, and so on. No experiment can conclusively confirm or deny the hypothesis because there will always be background assumptions that are open to question.

This idea became known as the Duhem-Quine thesis, after the American philosopher and logician W. V. O. Quine argued that this problem applied not only to physics, but to all of science and even to the truths of mathematics and logic.

The Duhem-Quine thesis is echoed in the work of leading proponents of data-driven science, who argue that a paradigm shift is taking place. Rather than coming up with a hypothesis that predicts a linear relationship between one set of variables and another (‘output’) variable, data-driven science can simply explore every possible relationship (including highly complex, non-linear functions) between any set of variables. This shift has been described as going from data to algorithmic models, from ‘model-based’ to ‘model-free’ science, and from parametric to ‘non-parametric’ modelling (see, for instance, Stuart Russel & Peter Norvig’s 2009 Artificial Intelligence, chapter 18).

While this paradigm shift may be less radical than the Duhem-Quine thesis (it certainly doesn’t cast doubt on the foundations of mathematical truths, as Quine’s holism appears to do), it does embrace one of its key messages; you cannot test a single hypothesis in isolation with a single observation, since there are always auxiliary hypotheses involved – other potential causal factors which might be relevant to the output variable. Modern data science techniques attempt to include as many potential causal factors as possible, and to automatically test an extensive range of complex, non-linear relationships between them and the output variables.

Contemporary philosophers of science are beginning to draw these kinds of connections between their discipline and this paradigm shift in data science. For instance, Wolfgang Pietsch’s work on theory-ladenness in data-intensive science compares the methodology behind common data science techniques (namely, classificatory trees and non-parametric regression) to theories of eliminative induction originating in the work of John Stuart Mill.

There are potential dangers in pursuing theory-free and model-free analysis. We may end up without comprehensible explanations for the answers our machines give us. This may be scientifically problematic, because we often want science to explain a phenomenon rather than just predict it. It may also be problematic in an ethical sense, because machine learning is increasingly used in ways that affect people’s lives, from predicting criminal behaviour to targeting job offers. If data scientists can’t explain the underlying reasons why their models make certain predictions, why they target certain people rather than others, we cannot evaluate the fairness of decisions based on those models.