Category Archives: ethics

When good algorithms go bad…

I recently spoke on a panel at Strata called ‘When good algorithms go bad’, organised by the good people at DataKind and sponsored by Pivotal Labs and O’Reilly Media. My co-panellists were Madeleine Greenhalgh (Cabinet Office), Shahzia Holtom (Pivotal Labs), and Hetan Shah (Royal Statistical Society), with Duncan Ross (DataKind) chairing what was a very stimulating discussion about the ethics of data science.

I won’t attempt to summarise the excellent contributions of the other speakers (and the audience), but here are some relevant (I hope!) references for those interested in delving further into some of the topics raised:

Many thanks to DataKind UK for organising and inviting me!

Galileo and Big Data: The Philosophy of Data Science

The heliocentric view of the solar system was initially rejected… for good reasons.

I recently received a targeted email from a recruitment agency, advertising relevant job openings. Many of them were for ‘Data Science’ roles. At recruitment agencies. There’s a delicious circularity here; somewhere, a data scientist is employed by a recruitment agency to use data science to target recruitment opportunities to potential data scientists for data science roles at recruitment agencies… to recruit more data scientists.

For a job title that barely existed a few years ago, Data Science is clearly a growing role. There are many ways to describe a data scientist, but ‘experts who use analytical techniques from statistics, computing and other disciplines to create value from new (‘big’) data’ provides a good summary (from Nesta).

For me, one of the most interesting and novel aspects of these new data-intensive scientific methods is the extent to which they seem to change the scientific method. In fact, the underpinnings of this new approach can be helpfully explained in terms an old debate in the philosophy of science.

Science is typically seen as a process of theory-driven hypothesis testing. For instance, Galileo’s heliocentric theory made predictions about the movements of the stars; he tested these predictions with his telescope; and his telescopic observations appeared to confirm his predictions and, at least in his eyes, proved his theory (if Galileo were a pedantic modern scientist, he might instead have described it more cautiously as ‘rejecting the null hypothesis’). As we know, Galileo’s theory didn’t gain widespread acceptance for many years (indeed, it still isn’t universally accepted today).

We could explain the establishment’s dismissal of Galileo’s findings as a result of religious dogma. But as Paul Feyerabend argued in his seminal book Against Method (1975), Galileo’s case was actually quite weak (despite being true). His theory appeared to directly contradict many ordinary observations. For instance, rather than birds getting thrown off into the sky, as one might expect if the earth were moving, they can fly a stable course. It was only later that such phenomena could be accommodated by the heliocentric view.

So Galileo’s detractors were, at the time, quite right to reject his theory. Many of their objections came from doubts about the veracity of the data gathered from Galileo’s unfamiliar telescopic equipment. Even if a theory makes a prediction that is borne out by an observation, we might still rationally reject it if it contradicts other observations, if there are competing theories with more explanatory power, and if – as with Galileo’s telescope – there are doubts about the equipment used to derive the observation.

Fast forward three centuries, to 1906, when French physicist Pierre Duhem published La Théorie Physique. Son Objet, sa Structure (The Aim and Structure of Physical Theory). Duhem argued that physicists cannot simply test one hypothesis in isolation. Because whatever the result, one could always reject the ‘auxiliary’ hypotheses that support the observation. These include hypotheses about the background conditions, about the efficacy of your measurement equipment, and so on. No experiment can conclusively confirm or deny the hypothesis because there will always be background assumptions that are open to question.

This idea became known as the Duhem-Quine thesis, after the American philosopher and logician W. V. O. Quine argued that this problem applied not only to physics, but to all of science and even to the truths of mathematics and logic.

The Duhem-Quine thesis is echoed in the work of leading proponents of data-driven science, who argue that a paradigm shift is taking place. Rather than coming up with a hypothesis that predicts a linear relationship between one set of variables and another (‘output’) variable, data-driven science can simply explore every possible relationship (including highly complex, non-linear functions) between any set of variables. This shift has been described as going from data to algorithmic models, from ‘model-based’ to ‘model-free’ science, and from parametric to ‘non-parametric’ modelling (see, for instance, Stuart Russel & Peter Norvig’s 2009 Artificial Intelligence, chapter 18).

While this paradigm shift may be less radical than the Duhem-Quine thesis (it certainly doesn’t cast doubt on the foundations of mathematical truths, as Quine’s holism appears to do), it does embrace one of its key messages; you cannot test a single hypothesis in isolation with a single observation, since there are always auxiliary hypotheses involved – other potential causal factors which might be relevant to the output variable. Modern data science techniques attempt to include as many potential causal factors as possible, and to automatically test an extensive range of complex, non-linear relationships between them and the output variables.

Contemporary philosophers of science are beginning to draw these kinds of connections between their discipline and this paradigm shift in data science. For instance, Wolfgang Pietsch’s work on theory-ladenness in data-intensive science compares the methodology behind common data science techniques (namely, classificatory trees and non-parametric regression) to theories of eliminative induction originating in the work of John Stuart Mill.

There are potential dangers in pursuing theory-free and model-free analysis. We may end up without comprehensible explanations for the answers our machines give us. This may be scientifically problematic, because we often want science to explain a phenomenon rather than just predict it. It may also be problematic in an ethical sense, because machine learning is increasingly used in ways that affect people’s lives, from predicting criminal behaviour to targeting job offers. If data scientists can’t explain the underlying reasons why their models make certain predictions, why they target certain people rather than others, we cannot evaluate the fairness of decisions based on those models.

Searching for Truthiness, Part 2: Knowledge-Based Trust

In the last post I explored two approaches to making computers do smart things, in particular relating to search engines. The knowledge representation approach (affiliated with traditional AI and the semantic web) involves creating ontologies, defining objects and relations, and getting software to make logical inferences over them. What I called the statistical approach (also known as machine learning) involves using data, often generated by human activity, to detect patterns and make a probabilistic assessment of the right answer. In the case of search, what we click on in response to queries and inbound hyperlinks are used to rank search results.

This brings us to the recent paper by some engineers at Google, on what they call knowledge-based trust (KBT). The problem faced by the statistical approach is that it is based on what millions of ordinary, fallible humans do on the web. That includes clicking on and linking to pages with sensational but unsubstantiated headlines, or dubious medical information. This means our biases get picked up by the system alongside our better judgement. If you train a computer with flawed data, it’s going to return flawed results; garbage in, garbage out. What the paper proposes is a new way to suppress (or at least, downgrade) such content based on the number of facts it contains.

But how can a search engine determine the factual content of a web page, if all it measures are clicks and links? It can’t. This is where the knowledge representation approach comes back to the rescue. By comparing statements extracted from web pages with a pre-existing body of knowledge, the researchers hope that a search engine could assess the trustworthiness of a page.

Google have been working on both the knowledge representation and statistical approaches for a long time. This proposal is one example of how the two approaches could be usefully integrated. Those little information boxes that crop up for certain Google searches are another. Try searching ‘Tiger vs Shark‘ and the first thing you’ll see above the normal search results is a tabular comparison of their respective properties – useful for those ‘who would win in a fight between x and y’ questions. These factoids are driven by a pre-existing body of structured data.

But hold on, where does this pre-existing body of knowledge come from, and why should we trust it, especially if it’s used to re-order search results? It comes from the ‘Knowledge Vault‘, Google’s repository of machine-readable information about the world, from geography, biology, history – you name it, they probably have it. It’s based on a collaboratively generated database called Freebase, created (or, perhaps more accurately, ‘curated’) since 2007 by Metaweb, and acquired by Google in 2010. It’s now due to shut down and be replaced by Wikidata, another source of structured data, extracted from Wikipedia.

So while our collective clicks and links may be a bad measure of truthiness, perhaps our collaborative encyclopedia entries can serve as a different standard for truth-assessment. Of course, if this standard is flawed, then the knowledge-based-trust score is going to be equally flawed (garbage in, garbage out). If you think Wikipedia (and hence Wikidata) is dodgy, then you won’t be very impressed by KBT-enhanced search results. If, on the other hand, you think it’s good enough, then it could lead to a welcome improvement. But we can’t escape some of the foundational epistemic questions whichever approach we adopt. In attempting to correct one source of bias, we introduce another. Whether the net effect is positive, or the biases cancel each other out, I don’t know. But what I do know is that isn’t just a question for software engineers to answer.

The main content of the paper itself is highly technical and, dare I say, boring for those of us outside of this branch of computer science. Its main contribution is a solution to the problem of distinguishing noise in the knowledge extraction process from falsehood in the source, something which has so far held back the practical application of such techniques to search ranking. But the discussion that the paper has prompted poses some very important social and political questions.

Risks of the Knowledge-Based Trust approach

The most immediate concern has come from the search engine optimisation community. Will SEO experts now be recommending websites to ‘push up the fact quotient’ on their content? Will marketers have even more reason to infiltrate Wikipedia in an effort to push their ‘facts’ into Wikidata? What about all the many contexts in which we assert untrue claims for contextually acceptable and obvious reasons (e.g. fiction, parody, or hyperbole)? Will they have a harder time getting hits?

And what about all the claims that are ‘unverifiable’ and have no ‘truth value’, as the logical positivists (see previous post) would have said? While KBT would only be one factor in the search rankings, it would still punish content containing many of these kinds of claims. Do we want an information environment that’s skewed towards statements that can be verified and against those that are unverifiable?

The epistemological status of what the researchers call ‘facts’ is also intriguing. The researchers seem to acknowledge that the knowledge base might not be completely accurate, when they include sentences like “facts extracted by automatic methods such as KV may be wrong”. This does seem to be standard terminology in this branch of computer science, but for philosophers, linguists, logicians, sociologists and others, the loose use of the ‘f’ word will ring alarm bells. Even putting aside these academic perspectives, our everyday use of ‘fact’ usually implies truth. It would be far less confusing for to simply call them statements, which can be either true or false.

Finally, while I don’t think it presents a serious danger right now, and indeed it could improve search engines in some ways, moving in this direction has risks for public debate, education and free speech. One danger is that sources containing claims that are worth exploring, but have insufficient evidence, will be systematically suppressed. If there’s no way for a class of maybe-true claims to get into the Knowledge Vault or Wikidata or whatever knowledge base is used, then you have to work extra hard to get people to even consider them. Whatever process is used to revise and expand the knowledge base will inevitably become highly contested, raising conflicts that may often prove irreconcilable.

It will be even harder if your claim directly contradicts the ‘facts’ found in the search engine’s knowledge base. If your claim is true, then society loses out. And even if your claim is false, as John Stuart Mill recognised, society may still benefit from having received opinion challenged:

“Even if the received opinion be not only true, but the whole truth; unless it is suffered to be, and actually is, vigorously and earnestly contested, it will, by most of those who receive it, be held in the manner of a prejudice, with little comprehension or feeling of its rational grounds.” – On Liberty (1859)

Search engines that rank claims by some single standard of truthiness are just one more way that free speech can be gradually, messily eroded. Of course, the situation we have now – the tyranny of the linked and clicked – may be erosive in different, better or worse ways. Either way, the broader problem is that search engines – especially those with a significant majority of the market – can have profound effects on the dissemination of information and misinformation in society. We need to understand these effects and find ways to deal with their political and social consequences.

Snowden, Morozov and the ‘Internet Freedom Lobby’

The dust from whistleblower Edward Snowden’s revelations has still not settled, and his whistle looks set to carry on blowing into this new year. Enough time has elapsed since the initial furore to allow us to reflect on its broader implications. One interesting consequence of the Snowden story is the way it has changed the debate about Silicon Valley and the ‘internet freedom’ lobby. In the past, some commentators have (rightly or wrongly) accused this lobby of cosying up to Silicon Valley companies and preaching a naive kind of cyberutopianism.

The classic proponent of this view is the astute (though unecessarily confrontational) journalist Evgeny Morozov, but variations on his theme can be found in the work of BBC documentarian-in-residence Adam Curtis (whose series ‘All Watched Over by Machines of Loving Grace‘ wove together an intellectual narrative from 60’s era hippies, through Ayn Randian libertarianism to modern Silicon Valley ideology). According to these storytellers, big technology companies and non-profit groups have made faustian bargains based on their perceived mutual interest in keeping the web ‘free from government interference’. In fact, they say, this pact only served to increase the power of both the state and the tech industry, at the expense of democracy.

Whilst I agree (as Snowden has made clear) that modern technology has facilitated a something of a digital land grab, the so-called ‘internet freedom lobby’ are not to blame. One thing that was irksome about these critiques was the lack of distinction between parts of this ‘lobby’. Who exactly are they talking about?

Sure, there are a few powerful ideological libertarians and profiteering social media pundits in the Valley, but there has long been a political movement arguing for digital rights which has had very little to do with that ilk. Morozov’s critique always jarred with me whenever I came across one of the many the principled, privacy-conscious technophiles who could hardly have been accused of Randian individualism or cosying up to powerful elites.

If there is any truth in the claim, it is this; on occasion, the interests of internet users have coincided with the interests of technology companies. For instance, when a web platform is forced to police behaviour on behalf of the Hollywood lobby, both the platform and its users lose. More broadly, much of the free/libre/open source world is funded directly or indirectly from the profits of tech companies.

But the Snowden revelations have driven a rhetorical wedge further between those interests. Before Snowden, people like Morozov could paint digital rights activists as naive cheerleaders of tech companies – and in some cases they may have been right. But they ignored the many voices in those movements who stood both for emancipatory power of the web as a communications medium, and against its dangers as a surveillance platform. After Snowden, the privacy wing of the digital rights community has taken centre stage and can no longer be ignored.

At a dialectical level, Silicon Valley sceptics like Morozov should be pleased. If any of his targets in the digital rights debate have indeed been guilty of naivety about the dangers of digital surveillance, the Snowden revelations have shown them the cold light of day and proved Morozov right. But in another sense, Snowden proved him wrong. Snowden is a long-term supporter of the Electronic Frontier Foundation, whose founders and supporters Morozov has previously mocked. Snowden’s revelations, and their reception by digital rights advocates, shows that they were never soft on digital surveillance, by state or industry.

Of course, one might say Snowden’s revelations were the evidence that Morozov needed to finally silence any remaining Silicon Valley cheerleaders. As he said in a recent Columbia Journalism Review interview: “I’m destroying the internet-centric world that has produced me. If I’m truly successful, I should become irrelevant.”

Do you need a Personal Charity Manager?

Charity‘ – by flickr user Howard Lake under CC-BY-SA 2.0 license

As an offshoot of some recent work, I’ve been thinking a lot about intermediaries and user agents, who act on behalf of individuals to help them achieve their goals. Whether they are web browsers and related plugins that remember stuff for you or help you stay focused, or energy switching platforms like Cheap Energy Club who help you get the best deal on energy, these intermediaries provide value by helping you to follow through on your best intentions. I don’t trust myself to keep on top of the best mobile phone tariff for me, so I delegate that to a third party. I know that when I’m tired or bored, I’ll get distracted by Youtube, so I use a browser plugin to remove that option when I’m supposed to be working.

Intermediaries, user agents, personal information managers, impartial advisers – however you refer to them, they help us by overcoming our in-built tendencies to forget, to make bad choices in the heat of the moment, or to disregard important information. Behavioural economics has revealed us to be fundamentally less rational in our everyday behaviour than we think. Research into the very real concept of willpower shows that all the little everyday decisions we have to take exact a toll on our mental energy, meaning that even with the best intentions, it’s very unlikely that we consistently make the best choices day-to-day. The modern world is incredibly complex, so anything that helps us make more informed decisions, and actually act consistently in line with those decisions on a daily basis, has got to be a good thing.

Most of these intermediary systems operate on our interactions with the market and public services, but few look at our interactions with ‘third sector’ organisations. This is an enormous opportunity. Nowhere else is the gap between good intentions and actual behaviour more apparent than in the area of charitable giving. If asked in a reflective state of mind, most people would agree that they could and should do more to make the world a better place. Most people would agree that expensive cups of coffee, new clothes, or a holiday are not as important as alleviating world hunger or curing malaria. Even if home comforts are deserved, we would probably like to cut down on them just a little bit, if doing so would significantly help the needy (ethicist Peter Singer suggests donating just 10% of your income to an effective charity).

But on a day-to-day basis, this perspective fades into the background. I want a coffee, I can easily afford to buy one, so why not? And anyway, how do you know the money you donate to charity is actually going to do anything? International aid is horribly complex, so how can an ordinary person with a busy life possibly work out what’s effective? High net worth individuals employ full time philanthropy consultants to do that for them. So even if we recognise on an abstract, rational level that we ought to do something, the burden of working out what to do, the hassle of remembering to do it, and the mental effort of resisting more immediate conflicting urges, are ultimately overwhelming. The result is inertia – doing nothing at all.

Many charities attempt to bypass this by catching our attention with adverts which tug at the heartstrings and present eye-catching statistics. As a result, until recently I went about giving to charity in a completely haphazard way – one-off donations to whoever managed to grab my attention at the right moment. But wouldn’t it be better if we could take our rational, considered ethical commitments and find ways to embed them in our lives, to make them easy to adhere to, reducing the mental and administrative burden? I’ve found several organisations that can help you work out how to give more effectively and stay committed to giving (see Giving What We Can). But there is even more scope for intermediaries to provide holistic systems to help you develop and achieve your ethical goals.

Precisely what form they take (browser plugins, online services, or real, human support?), and what we call them (Personal Charity Managers, Ethical Assistants, Philanthropic Nudges, Moral Software Agents), I won’t attempt to predict. They wouldn’t be a panacea; ethical intermediaries will never replace careful, considered moral deliberation, rigorous debate about right and wrong, and practising virtue in daily life. But as services that practically help us follow through on our carefully considered moral beliefs, and manage our charitable giving, they could be revolutionary.

Is an elaborate art joke?

Last week I was sent a link to – a new web application where you can upload your data from Facebook, and choose whether to license it directly to marketers or make it available as open data. It’s a neat idea which has been explored by a number of other startups (e.g., YesProfile, Teckler).

Obviously, uploading all of your Facebook data to a random website raises a whole host of privacy concerns – exactly what you’d expect a rock-solid privacy policy / terms-of-service to address. Unfortunately, there doesn’t seem to be any such terms for If you click the Terms of Service button on the registration page it takes you nowhere.

Looking at the page source, the html anchor points to an empty ‘#’ id, which suggests that there is not some problem with the link, but that there was nowhere to link to in the first place; suspicious! If I was serious about starting a service like this, the very first thing I’d do is draft a terms-of-service and privacy policy. Then before launching the website, I’d triple-check to make sure it appears prominently on the registration form.

Looking at the ‘Browse Open Data’ part of the website, you can look at the supposedly de-identified Facebook profiles that other users have submitted. These include detailed data and metadata like number of friends, hometown, logins, etc. The problem is, despite the removal of names, the information on these profiles is almost certainly enough to re-identify the individual in the majority of cases.

These two glaring privacy issues and technical problems make me think this whole thing might just be an elaborate hoax. In which case, Ha ha. Well, done, you got me. After digging a little deeper, it looks like the website is a project from Commodify, Inc., an artist-run startup, and Moddr, who describe themselves as;

Rotterdam-based media/hacker/co-working space and DIY/FOSS/OSHW fablab for artgeeks, part of the venue WORM: Institute for Avantgardistic Recreation

They’re behind a few other projects in a similar vein, such as ‘Give Me My Data‘. I remembered seeing a very amusing presentation on the Web 2.0 Suicide Machine project by Walter Langelaar a year or two ago.

So I registered using a temporary dummy email addresses, to have a look around, but I didn’t get to upload my (fake) data because the data upload page says it’s currently being updated. I tried sending an email to the mailing address moderator ( listed as ) but it bounced.

If this is intended as a real service, then it’s pretty awful as far as privacy is concerned. If it’s intended as a humorous art project, then that’s fine – as long as as there are no real users who have been duped into participating.

Data on Strike

What happens to a smart city when there’s no access to personal data?

Last week I had the pleasure of attending the Digital Revolutions Oxford summer school, a gathering of PhD’s doing research into the ‘digital economy’. On the second day, we were asked to form teams and engage in some wild speculation. Our task was to imagine a news headline in 2033, covering some significant event that relates to the research we are currently undertaking. My group took this as an opportunity to explore various utopian / dystopian themes relating to power struggles over personal data, smart cities and prosthetic limbs.

The headline we came up with was ‘Data Strike: Citizens refuse to give their data to Governments and Corporations’. Our hypothesis was that as ‘smart cities’ materialise, essential pieces of infrastructure will become increasingly dependent on the personal data of the city’s inhabitants. For instance, the provision of goods and services will be carefully calibrated to respond and adjust to the circumstances of individual consumers. Management of traffic flow and transportation systems will depend on uninterrupted access to every individual’s location data. Distributed public health systems will feed back data live from our immune systems to the health authorities.

In a smart city, personal data itself is as critical a piece of infrastructure as you can get. And as any observer of strike action will know, critical infrastructure can quickly be brought to a halt if the people it depends on decide not to co-operate. What would happen in a smart city if its inhabitants decided to go on a data strike? We imagined a city-wide personal data blackout, where individuals turn off or deliberately scramble their personal devices, wreaking havoc on the city’s systems. Supply chains would misfire as targeted consumers dissappear from view. Public health monitoring signals would be scrambled. Self-driving cars would no longer know when to pick up and drop off passengers – or when to stop for pedestrians.

We ventured out into the streets of Oxford to see what ‘the public’ thought about our sensational predictions, and whether they would join the strike. I had trouble selling the idea of a ‘data co-operative’ to sceptical passengers waiting at the train station, but was surprised by the general level of concern and awareness about the use of personal data. As a break from dry academic work, this exercise in science fiction was a bit of light relief. But I think we touched on a serious point. Smart cities need information infrastructure, but ensuring good governance of this infrastructure will be paramount. Otherwise we may sleepwalk into a smart future where convenience and efficiency are promoted at the expense of privacy, autonomy and equality. We had better embed these values into smart infrastructure now, while the idea of a data strike still sounds ridiculous.

Thanks to Research Council’s UK Digital Economy Theme, Know Innovation and the Oxford CDT in healthcare innovation, for funding / organising / hosting the event. More comprehensive coverage can be found over on Chris Phethean’s write-up.