Searching for Truthiness, Part 1: Logical Positivism vs. Statistics

Wittgenstein (second from right), whose early work inspired logical positivism


Recent coverage of a research paper by some Google engineers has ruffled some feathers in the world of SEO. The paper demonstrates a method for what they call a ‘knowledge-based trust’ approach to ranking search results. Instead of using ‘exogenous’ signals like the number of inbound hyperlinks to a web resource (as in the traditional Google PageRank algorithm), the KBT approach factors in ‘endogenous’ signals, namely, the ‘correctness of factual information’ found on the resource.

To understand what this change means, I think it’s worth briefly considering two approaches to knowledge: one is based on statistical measures and exemplified by modern search engines; the other has its roots in a key movement in 20th century philosophy.

One of the fundamental suppositions of analytic philosophy is that there is an objective, rigorous method for pursuing answers to complex questions. The idea that our ethical, political or metaphysical beliefs aren’t just matters of subjective opinion but can be interrogated, revised and improved using objective analytical methods that transcend mere rhetoric.

A group of philosophers in the 1920’s took this idea to an extreme in a movement called logical positivism. They believed that every sentence in any human language could in principle be classified as either verifiable or unverifiable. ‘Analytic’ statements, like those in mathematics, can be verified through logic. ‘Synthetic’ statements, like ‘water is h20’ can be verified through scientific experiment. Every other kind of statement, according to the logical positivists, was an expression of feeling, an exhortation to action or just plain nonsense, and unless you already agree with it there’s no objective way you could be convinced.

The allure of verificationism was that it offered a systematic way to assess any deductive argument. Take every statement, determine an appropriate method of verification for the statement, discarding any which are unverifiable. Sort the statements into premises and conclusions, and determine the truth value of each premise by reference to trusted knowledge sources. Finally, assess whether the conclusions validly follow from the premises using the methods of formal logic. To use a tired syllogism as an example, take the premises ‘All men are mortal’, ‘Socrates is a man’, and the conclusion ‘Socrates is mortal’. The premises can be verified as true through reference to biology and the historical record. Each statement can then be rendered in predicate logic so that the entire argument can be shown to be sound.

While I doubt that the entirety of intellectual debate and enquiry can be reduced down in this way without losing some essential meaning (not to mention rhetorical force), it certainly provides a useful model for certain aspects of reasoning. For better or worse, this model has been used time and time again in attempts to build artificial intelligence. Armed with predicate logic, ontologies to classify things, and lots of fact-checked machine-readable statements, computers can all sorts of clever things.

Search engines could not only find pages based on keywords but do little bits of reasoning to help give us new information that isn’t explicitly written anywhere but can be inferred from a stock of pre-existing information. This is a perfect job for computers because they are great at following well defined rules incredibly fast over massive amounts of data. This is the purpose of projects like Freebase and Wikidata – to take the knowledge we’ve built up in natural language and translate it into machine readable data (stored as key-value pairs or triples). It’s the vision of the semantic web outlined by Tim Berners-Lee.

The search engines we know and love are based on a different approach. This is less focused on logic and knowledge representation and more on statistics. Rather than attempting to represent and reason about the world, the statistical approach tries to get computers to learn how to perform a task based on data (usually generated as a by-product of human activity). For instance, the relevance of a response to a search query isn’t determined by the ‘meaning’ of the query and pre-digested statements about the world, but by the number of inbound links and clicks on a page. We gave up trying to get computers to understand what we’re talking about, and allowed them to guess what we’re after based on the sheer brute force of correlation.

In the next post I’ll look at how Google might integrate these two approaches to improve search engine results.