Category Archives: algorithm

When good algorithms go bad…

I recently spoke on a panel at Strata called ‘When good algorithms go bad’, organised by the good people at DataKind and sponsored by Pivotal Labs and O’Reilly Media. My co-panellists were Madeleine Greenhalgh (Cabinet Office), Shahzia Holtom (Pivotal Labs), and Hetan Shah (Royal Statistical Society), with Duncan Ross (DataKind) chairing what was a very stimulating discussion about the ethics of data science.

I won’t attempt to summarise the excellent contributions of the other speakers (and the audience), but here are some relevant (I hope!) references for those interested in delving further into some of the topics raised:

Many thanks to DataKind UK for organising and inviting me!

Searching for Truthiness, Part 2: Knowledge-Based Trust

In the last post I explored two approaches to making computers do smart things, in particular relating to search engines. The knowledge representation approach (affiliated with traditional AI and the semantic web) involves creating ontologies, defining objects and relations, and getting software to make logical inferences over them. What I called the statistical approach (also known as machine learning) involves using data, often generated by human activity, to detect patterns and make a probabilistic assessment of the right answer. In the case of search, what we click on in response to queries and inbound hyperlinks are used to rank search results.

This brings us to the recent paper by some engineers at Google, on what they call knowledge-based trust (KBT). The problem faced by the statistical approach is that it is based on what millions of ordinary, fallible humans do on the web. That includes clicking on and linking to pages with sensational but unsubstantiated headlines, or dubious medical information. This means our biases get picked up by the system alongside our better judgement. If you train a computer with flawed data, it’s going to return flawed results; garbage in, garbage out. What the paper proposes is a new way to suppress (or at least, downgrade) such content based on the number of facts it contains.

But how can a search engine determine the factual content of a web page, if all it measures are clicks and links? It can’t. This is where the knowledge representation approach comes back to the rescue. By comparing statements extracted from web pages with a pre-existing body of knowledge, the researchers hope that a search engine could assess the trustworthiness of a page.

Google have been working on both the knowledge representation and statistical approaches for a long time. This proposal is one example of how the two approaches could be usefully integrated. Those little information boxes that crop up for certain Google searches are another. Try searching ‘Tiger vs Shark‘ and the first thing you’ll see above the normal search results is a tabular comparison of their respective properties – useful for those ‘who would win in a fight between x and y’ questions. These factoids are driven by a pre-existing body of structured data.

But hold on, where does this pre-existing body of knowledge come from, and why should we trust it, especially if it’s used to re-order search results? It comes from the ‘Knowledge Vault‘, Google’s repository of machine-readable information about the world, from geography, biology, history – you name it, they probably have it. It’s based on a collaboratively generated database called Freebase, created (or, perhaps more accurately, ‘curated’) since 2007 by Metaweb, and acquired by Google in 2010. It’s now due to shut down and be replaced by Wikidata, another source of structured data, extracted from Wikipedia.

So while our collective clicks and links may be a bad measure of truthiness, perhaps our collaborative encyclopedia entries can serve as a different standard for truth-assessment. Of course, if this standard is flawed, then the knowledge-based-trust score is going to be equally flawed (garbage in, garbage out). If you think Wikipedia (and hence Wikidata) is dodgy, then you won’t be very impressed by KBT-enhanced search results. If, on the other hand, you think it’s good enough, then it could lead to a welcome improvement. But we can’t escape some of the foundational epistemic questions whichever approach we adopt. In attempting to correct one source of bias, we introduce another. Whether the net effect is positive, or the biases cancel each other out, I don’t know. But what I do know is that isn’t just a question for software engineers to answer.

The main content of the paper itself is highly technical and, dare I say, boring for those of us outside of this branch of computer science. Its main contribution is a solution to the problem of distinguishing noise in the knowledge extraction process from falsehood in the source, something which has so far held back the practical application of such techniques to search ranking. But the discussion that the paper has prompted poses some very important social and political questions.

Risks of the Knowledge-Based Trust approach

The most immediate concern has come from the search engine optimisation community. Will SEO experts now be recommending websites to ‘push up the fact quotient’ on their content? Will marketers have even more reason to infiltrate Wikipedia in an effort to push their ‘facts’ into Wikidata? What about all the many contexts in which we assert untrue claims for contextually acceptable and obvious reasons (e.g. fiction, parody, or hyperbole)? Will they have a harder time getting hits?

And what about all the claims that are ‘unverifiable’ and have no ‘truth value’, as the logical positivists (see previous post) would have said? While KBT would only be one factor in the search rankings, it would still punish content containing many of these kinds of claims. Do we want an information environment that’s skewed towards statements that can be verified and against those that are unverifiable?

The epistemological status of what the researchers call ‘facts’ is also intriguing. The researchers seem to acknowledge that the knowledge base might not be completely accurate, when they include sentences like “facts extracted by automatic methods such as KV may be wrong”. This does seem to be standard terminology in this branch of computer science, but for philosophers, linguists, logicians, sociologists and others, the loose use of the ‘f’ word will ring alarm bells. Even putting aside these academic perspectives, our everyday use of ‘fact’ usually implies truth. It would be far less confusing for to simply call them statements, which can be either true or false.

Finally, while I don’t think it presents a serious danger right now, and indeed it could improve search engines in some ways, moving in this direction has risks for public debate, education and free speech. One danger is that sources containing claims that are worth exploring, but have insufficient evidence, will be systematically suppressed. If there’s no way for a class of maybe-true claims to get into the Knowledge Vault or Wikidata or whatever knowledge base is used, then you have to work extra hard to get people to even consider them. Whatever process is used to revise and expand the knowledge base will inevitably become highly contested, raising conflicts that may often prove irreconcilable.

It will be even harder if your claim directly contradicts the ‘facts’ found in the search engine’s knowledge base. If your claim is true, then society loses out. And even if your claim is false, as John Stuart Mill recognised, society may still benefit from having received opinion challenged:

“Even if the received opinion be not only true, but the whole truth; unless it is suffered to be, and actually is, vigorously and earnestly contested, it will, by most of those who receive it, be held in the manner of a prejudice, with little comprehension or feeling of its rational grounds.” – On Liberty (1859)

Search engines that rank claims by some single standard of truthiness are just one more way that free speech can be gradually, messily eroded. Of course, the situation we have now – the tyranny of the linked and clicked – may be erosive in different, better or worse ways. Either way, the broader problem is that search engines – especially those with a significant majority of the market – can have profound effects on the dissemination of information and misinformation in society. We need to understand these effects and find ways to deal with their political and social consequences.

Searching for Truthiness, Part 1: Logical Positivism vs. Statistics

Wittgenstein (second from right), whose early work inspired logical positivism


Recent coverage of a research paper by some Google engineers has ruffled some feathers in the world of SEO. The paper demonstrates a method for what they call a ‘knowledge-based trust’ approach to ranking search results. Instead of using ‘exogenous’ signals like the number of inbound hyperlinks to a web resource (as in the traditional Google PageRank algorithm), the KBT approach factors in ‘endogenous’ signals, namely, the ‘correctness of factual information’ found on the resource.

To understand what this change means, I think it’s worth briefly considering two approaches to knowledge: one is based on statistical measures and exemplified by modern search engines; the other has its roots in a key movement in 20th century philosophy.

One of the fundamental suppositions of analytic philosophy is that there is an objective, rigorous method for pursuing answers to complex questions. The idea that our ethical, political or metaphysical beliefs aren’t just matters of subjective opinion but can be interrogated, revised and improved using objective analytical methods that transcend mere rhetoric.

A group of philosophers in the 1920’s took this idea to an extreme in a movement called logical positivism. They believed that every sentence in any human language could in principle be classified as either verifiable or unverifiable. ‘Analytic’ statements, like those in mathematics, can be verified through logic. ‘Synthetic’ statements, like ‘water is h20’ can be verified through scientific experiment. Every other kind of statement, according to the logical positivists, was an expression of feeling, an exhortation to action or just plain nonsense, and unless you already agree with it there’s no objective way you could be convinced.

The allure of verificationism was that it offered a systematic way to assess any deductive argument. Take every statement, determine an appropriate method of verification for the statement, discarding any which are unverifiable. Sort the statements into premises and conclusions, and determine the truth value of each premise by reference to trusted knowledge sources. Finally, assess whether the conclusions validly follow from the premises using the methods of formal logic. To use a tired syllogism as an example, take the premises ‘All men are mortal’, ‘Socrates is a man’, and the conclusion ‘Socrates is mortal’. The premises can be verified as true through reference to biology and the historical record. Each statement can then be rendered in predicate logic so that the entire argument can be shown to be sound.

While I doubt that the entirety of intellectual debate and enquiry can be reduced down in this way without losing some essential meaning (not to mention rhetorical force), it certainly provides a useful model for certain aspects of reasoning. For better or worse, this model has been used time and time again in attempts to build artificial intelligence. Armed with predicate logic, ontologies to classify things, and lots of fact-checked machine-readable statements, computers can all sorts of clever things.

Search engines could not only find pages based on keywords but do little bits of reasoning to help give us new information that isn’t explicitly written anywhere but can be inferred from a stock of pre-existing information. This is a perfect job for computers because they are great at following well defined rules incredibly fast over massive amounts of data. This is the purpose of projects like Freebase and Wikidata – to take the knowledge we’ve built up in natural language and translate it into machine readable data (stored as key-value pairs or triples). It’s the vision of the semantic web outlined by Tim Berners-Lee.

The search engines we know and love are based on a different approach. This is less focused on logic and knowledge representation and more on statistics. Rather than attempting to represent and reason about the world, the statistical approach tries to get computers to learn how to perform a task based on data (usually generated as a by-product of human activity). For instance, the relevance of a response to a search query isn’t determined by the ‘meaning’ of the query and pre-digested statements about the world, but by the number of inbound links and clicks on a page. We gave up trying to get computers to understand what we’re talking about, and allowed them to guess what we’re after based on the sheer brute force of correlation.

In the next post I’ll look at how Google might integrate these two approaches to improve search engine results.