Searching for Truthiness, Part 1: Logical Positivism vs. Statistics

Wittgenstein (second from right), whose early work inspired logical positivism


Recent coverage of a research paper by some Google engineers has ruffled some feathers in the world of SEO. The paper demonstrates a method for what they call a ‘knowledge-based trust’ approach to ranking search results. Instead of using ‘exogenous’ signals like the number of inbound hyperlinks to a web resource (as in the traditional Google PageRank algorithm), the KBT approach factors in ‘endogenous’ signals, namely, the ‘correctness of factual information’ found on the resource.

To understand what this change means, I think it’s worth briefly considering two approaches to knowledge: one is based on statistical measures and exemplified by modern search engines; the other has its roots in a key movement in 20th century philosophy.

One of the fundamental suppositions of analytic philosophy is that there is an objective, rigorous method for pursuing answers to complex questions. The idea that our ethical, political or metaphysical beliefs aren’t just matters of subjective opinion but can be interrogated, revised and improved using objective analytical methods that transcend mere rhetoric.

A group of philosophers in the 1920’s took this idea to an extreme in a movement called logical positivism. They believed that every sentence in any human language could in principle be classified as either verifiable or unverifiable. ‘Analytic’ statements, like those in mathematics, can be verified through logic. ‘Synthetic’ statements, like ‘water is h20’ can be verified through scientific experiment. Every other kind of statement, according to the logical positivists, was an expression of feeling, an exhortation to action or just plain nonsense, and unless you already agree with it there’s no objective way you could be convinced.

The allure of verificationism was that it offered a systematic way to assess any deductive argument. Take every statement, determine an appropriate method of verification for the statement, discarding any which are unverifiable. Sort the statements into premises and conclusions, and determine the truth value of each premise by reference to trusted knowledge sources. Finally, assess whether the conclusions validly follow from the premises using the methods of formal logic. To use a tired syllogism as an example, take the premises ‘All men are mortal’, ‘Socrates is a man’, and the conclusion ‘Socrates is mortal’. The premises can be verified as true through reference to biology and the historical record. Each statement can then be rendered in predicate logic so that the entire argument can be shown to be sound.

While I doubt that the entirety of intellectual debate and enquiry can be reduced down in this way without losing some essential meaning (not to mention rhetorical force), it certainly provides a useful model for certain aspects of reasoning. For better or worse, this model has been used time and time again in attempts to build artificial intelligence. Armed with predicate logic, ontologies to classify things, and lots of fact-checked machine-readable statements, computers can all sorts of clever things.

Search engines could not only find pages based on keywords but do little bits of reasoning to help give us new information that isn’t explicitly written anywhere but can be inferred from a stock of pre-existing information. This is a perfect job for computers because they are great at following well defined rules incredibly fast over massive amounts of data. This is the purpose of projects like Freebase and Wikidata – to take the knowledge we’ve built up in natural language and translate it into machine readable data (stored as key-value pairs or triples). It’s the vision of the semantic web outlined by Tim Berners-Lee.

The search engines we know and love are based on a different approach. This is less focused on logic and knowledge representation and more on statistics. Rather than attempting to represent and reason about the world, the statistical approach tries to get computers to learn how to perform a task based on data (usually generated as a by-product of human activity). For instance, the relevance of a response to a search query isn’t determined by the ‘meaning’ of the query and pre-digested statements about the world, but by the number of inbound links and clicks on a page. We gave up trying to get computers to understand what we’re talking about, and allowed them to guess what we’re after based on the sheer brute force of correlation.

In the next post I’ll look at how Google might integrate these two approaches to improve search engine results.

‘Privacy and consumer markets’ – talk at 31c3

I just gave a talk at the 31st annual Chaos Communication Congress in Hamburg. The blurb:

“The internet may be the nervous system of the 21st century, but its main business purpose is helping marketers work out how to make people buy stuff. This talk maps out a possible alternative, where consumers co-ordinate online, pooling their data and resources to match demand with supply.”

It was live-streamed and the video should be up on the ccc-tv soon. Slides from the talk are available here in PDF or ODP

Thanks to all the organisers for running such a great event!

How to improve how we prove; from paper-and-ink to digital verified attributes

'Stamp of Approval' by Sudhamshu Hebbar, CC-BY 2.0
‘Stamp of Approval’ by Sudhamshu Hebbar, CC-BY 2.0

Personal information management services (PIMS) are an emerging class of digital tools designed to help people manage and use data about themselves. At the core of this is information about your identity and credentials, without which you cannot prove who you are or that you have certain attributes. This is a boring but necessary part of accessing services, claiming benefits and compensation, and a whole range of other general ‘life admin’ tasks.

Currently the infrastructure for managing these processes is stuck somewhere in the Victorian era, dominated by rubber stamps, handwritten signatures and paper forms, dealt with through face-to-face interactions with administrators and shipped around through snail mail. A new wave of technology aims to radically simplify this infrastructure through digital identities, certificates and credentials. Examples include GOV.UK Verify, the UK government identity scheme, and services like MiiCard and Mydex which allow individuals to store and re-use digital proofs of identity and status. The potential savings from these new services are estimated at £3 billion in the UK alone (disclosure: I was part of the research team behind this report).

Yesterday I learned a powerful first-hand lesson about the current state of identity management, and the dire need for PIMS to replace it. It all started when I realised that a train ticket, which I’d bought in advance, would be invalid because my discount rail card expired before the date of travel. After discovering I could not simply pay off the excess to upgrade to a regular ticket, I realised my only option would be to renew the railcard.

That may sound simple, but it was not. To be eligible for the discount, I’d need to prove to the railcard operator that I’m currently a post-graduate student. They require a specific class of (very busy) University official to fill in, sign and stamp their paper form and verify a passport photo. There is a semi-online application system, but this still requires a University administrator to complete the paperwork and send a scanned copy, and then there’s an additional waiting time while a new railcard is sent by post from an office in Scotland.

So I’d need to make a face-to-face visit to one of the qualified University administrators with all the documents, and hope that they are available and willing to deal with them. Like many post-graduate students, I live in a different city so this involves an 190 minute, £38 train round-trip.  When I arrive, the first administrator I ask to sign the documentation tells me that I will have to leave the documentation with their office for an unspecified number of days (days!) while they ‘check their system’ to verify that I am who I say I am.

I tried to communicate the absurdity of the situation: I had travelled 60 miles to get a University-branded pattern of ink stamped on a piece of paper, in order to verify my identity to the railcard company, but the University administrators couldn’t stamp said paper because they needed several days to check a database to verify that I exist and I am me – while I stand before them with my passport, driver’s license, proof of address and my student identity card.

Finally I was lucky enough to speak to another administrator whom I know personally, who was able to deal with the paperwork in a matter of seconds. In the end, the only identity system which worked was a face to face interaction predicated on interpersonal trust; a tried-and-tested protocol which pre-dates the scanned passport, the Kafka-esque rubber stamp, and the pen-pushing Victorian clerk.

Here’s how an effective digital identity system would have solved this problem. Upon enrolment, the university would issue me with a digital certificate, verifying my status as a postgraduate, which would be securely stored and regularly refreshed in my personal data store (PDS). When the time comes to renew my discount railcard, I would simply log in to my PDS and accept a connection from the railcard operator’s site. I pay the fee and they extend the validity of my existing railcard.

From the user experience perspective, that’s all there is to it – a few clicks and it’s done. In the background, there’s a bit more complexity. My PDS would receive a request from the railcard operator’s system for the relevant digital certificate (essentially a cryptographically signed token generated by the University’s system). After verifying the authenticity of the request, my PDS sends a copy of the certificate. The operator’s back-end system then checks the validity of the certificate against the public key of the issuer (in this case, the university). If it all checks out, the operator has assurance from the University that I am eligible for the discount. It should take a matter of seconds.

From a security perspective, it’s harder to fake a signature made out of cryptography than one made out of ink (ironically, it would probably have been less effort for me to forge the ink signature than to obtain it legitimately). Digital proofs can also be better for privacy, as they reveal the minimal amount of information about me that the railcard operator needs to determine my eligibility, and the data is only shared when I permit it.

Identity infrastructure is important for reasons beyond convenience and security – it’s also about equality and access. I’m lucky that I can afford to pay the costs when these boring parts of ‘life admin’ go wrong – paying for a full price ticket wouldn’t have put my bank balance in the red. But if you’re at the bottom of the economic ladder, you have much more to lose when you can’t access the discounted services, benefits and compensation you are entitled to. Reforming our outdated systems could therefore have a disproportionately positive impact for the least well-off.

YouGov Profiles

I haven’t blogged here in a while. But I did write this piece on YouGov’s Profiler app –  a rather fun but warped view on the research company’s consumer profiling data.

It’s published in The Conversation – if you haven’t come across them yet, I strongly recommend taking a look. They publish topical and well-informed opinion pieces from academics, and their motto is ‘academic rigour, journalistic flair’. Best of all, all the articles are licensed under a Creative Commons (BY-ND) license – ensuring they can be republished and shared as widely as possible.

Public Digital Infrastructure: Who Pays?

Glen Canyon Bridge & Dam, Page, Arizona, by flickr user Thaddeus Roan under CC-BY 2.0
Glen Canyon Bridge & Dam, Page, Arizona, by flickr user Thaddeus Roan under CC-BY 2.0

Every day, we risk our personal security and privacy by relying on lines of code written by a bunch under-funded non-profits and unpaid volunteers. These essential pieces of infrastructure go unnoticed and under-funded; that is, until they fail.

Take OpenSSL, one of the most common tools for encrypting internet traffic. It means that things like confidential messages and credit card details aren’t transferred as plain text. It probably saves you from identity fraud, theft, stalking, blackmail, and general inconvenience dozens of times a day. At the time when a critical security flaw (known as ‘Heartbleed’) was discovered in OpenSSL’s code last April, there was just one person paid to work full-time on the project – the rest of it being run largely by volunteers.

What about the Network Time Protocol? It keeps most of the world’s computer’s clocks synchronised so that everything is, you know, on time. NTP has been developed and maintained over the last 20 years by one university professor and a team of volunteers.

Then there is OpenSSH, which is used to securely log in to remote computers across a network – used every day by systems administrators to keep IT systems, servers, and websites working whilst keeping out intruders. That’s maintained by another under-funded team who recently started a fundraising drive because they could barely afford to keep the lights on in their office.

Projects like these are essential pieces of public digital infrastructure; they are the fire brigade of the internet, the ambulance service for our digital lives, the giant dam holding back a flood of digital sewage. But our daily dependence on them is largely invisible and unquantified, so it’s easy to ignore their importance. There is no equivalent to pictures of people being rescued from burning buildings. The image of a programmer auditing some code is not quite as visceral.

So these projects survive on small handouts, occasionally large ones from large technology companies. Whilst it’s great that commercial players want to help secure the open source code they use in their products, this alone is not an ideal solution. Imagine if the ambulance service were funded by ad-hoc injections of cash from various private hospitals, who had no obligation to maintain their contributions. Or if firefighters only got new trucks and equipment when some automobile manufacturer thinks it would be good PR.

There’s a good reason to make this kind of critical public infrastructure open-source. Proprietary code can only be audited behind closed doors, so that means everyone who relies on it has to trust the provider to discover its flaws, fix them, and be honest when they fail. Open source code, on the other hand, can be audited by anyone. The idea is that ‘many eyes make all bugs shallow’ – if everyone can go looking for them, bugs are much more likely to be found.

But just because anyone can, that doesn’t mean that someone will. It’s a little like the story of four people named Everybody, Somebody, Anybody, and Nobody:

There was an important job to be done and Everybody was sure that Somebody would do it. Anybody could have done it, but Nobody did it. Somebody got angry about that because it was Everybody’s job. Everybody thought that Anybody could do it, but Nobody realized that Everybody wouldn’t do it. It ended up that Everybody blamed Somebody when Nobody did what Anybody could have done.

Everybody would benefit if Somebody audited and improved OpenSSL/NTP/OpenSSH/etc, but Nobody has sufficient incentive to do so. Neither proprietary software nor the open source world is delivering the quality of critical public digital infrastructure we need.

One solution to this kind of market failure is to treat critical infrastructure as a public good, deserving of public funding. Public goods are traditionally defined as ‘non-rival’, meaning that one person’s use of the good does not reduce its availability to others, and ‘non-excludable’, meaning that it is not possible to exclude certain people from using it. The examples given above certainly meet this criteria. Code is infinitely reproducible at nearly zero marginal cost, and its use, absent any patents or copyrights, is impossible to constrain.

The costs of creating and sustaining a global, secure, open and free-as-in-freedom digital infrastructure are tiny in comparison to the benefits. But direct, ongoing public funding for those who maintain this infrastructure is rare. Meanwhile, we find that billions have been spent on intelligence agencies whose goal is to make security tools less secure. Rather than undermining such infrastructure, governments should be pooling their resources to improve it.

Related: The Linux foundation have an initiative to address this situation, with the admirable backing of some industry heavyweights
While any attempt to list all the critical projects of the internet is likely to be incomplete and lead to disagreement, Jonathan Wilkes and volunteers have nevertheless begun one

‘Surprise Minimisation’

A little while ago I wrote a short post for the IAPP on the notion of ‘surprise minimisation’. In summary, I’m not that keen on it;

I’m left struggling to see the point of introducing yet another term in an already jargon-filled debate. Taken at face-value, recommending surprise minimisation seems no better than simply saying “don’t use data in ways people might not like”—if anything, it’s worse because it unhelpfully equates surprise with objection, and vice-versa. The available elaborations of the concept don’t add much either, as they seem to boil down to an ill-defined mixture of existing principles.

Why Surprise Minimisation is a Misguided Principle

A Study of International Personal Data Transfers

Whilst researching open registers of data controllers, I was left with some interesting data on international data transfers which didn’t make it into my main research paper. This formed the basis of a short paper for the 2014 Web Science conference which took place last month.

The paper presents a brief analysis of the destinations of 16,000 personal data transfers from the UK. Each ‘transfer’ represents an arrangement between a data controller in the UK to send data to a country located overseas. Many of these destinations are simply listed by the rather general categories of ‘European Economic Area’ or ‘Worldwide’, so the analysis focuses on just those transfers where specific countries were mentioned.

I found that even when we adjust for the size of their existing UK export market, countries whose data protection regimes are approved as ‘adequate’ by the European Commission had higher rates of data transfers. This indicates that easing legal restrictions on cross-border transfers does indeed positively correlate with a higher number of transfers (although the direction of causation can’t be established). I was asked by the organisers to produce a graphic to illustrate the findings, so I’m sharing that below.


What do they know about me? Open data on how organisations use personal data

I recently wrote a guest post for the Open Knowledge Foundation’s working group on Personal Data and Privacy Working Group. It delves into the UK register of data controllers – a data source I’ve written about before and which forms the basis of a forthcoming research paper. This time, I’m looking through the data in light of some of the recent controversies we’ve seen in the media including and the construction worker’s blacklist fiasco…

Publishing this information in obscure, unreadable and hidden privacy policies and impact assessments is not enough to achieve meaningful transparency. There’s simply too much of it out there to capture in a piecemeal fashion, in hidden web pages and PDFs. To identify the good and bad things companies do with our personal information, we need more data, in a more detailed, accurate, machine-readable and open format. In the long run, we need to apply the tools of ‘big data’ to drive new services for better privacy management in the public and private sector, as well as for individuals themselves.

You can read the rest here. Thanks to the OKF/ORG for kick-starting such interesting discussions through the mailing list – I’m looking forward to continuing them at the OKF event in Berlin this summer and elsewhere. If you want to participate, do join the working group.

Open Research in Practice: responding to peer review with GitHub

I wrote a tweet last week that got a bit of unexpected (but welcome) attention;

I got a number of retweets and replies in response, including:

The answers are: Yes, yes, yes and yes. I thought I’d respond in more detail and pen some short reflections on github, collaboration and open research.

The backstory; I had submitted a short paper (co-authored with my colleague David Matthews from the Maths department) to a conference workshop (WWW2014: Social Machines). While the paper was accepted (yay!), the reviewers had suggested a number of substantial revisions. Since the whole thing had to be written in Latex, with various associated style, bibliography and other files, and version control was important, we decided to create a github repo for the revision process. I’d seen Github used for paper authoring before by another colleague and wanted to try it out.

Besides version control, we also decided to make use of other features of Github, including ‘issues’. I took the long list of reviewer comments and filed them as a new issue. We then separated these out into themes which were given their own sub-issues. From here, we could clearly see what needed changing, and discuss how we were going to do it. Finally, once we were satisfied with an issue, that could be closed.

At first I considered making the repository private – I was a little bit nervous to put all the reviewer comments up on a public repo, especially as some were fairly critical of our first draft (although, in hindsight, entirely fair). In the end, we opted for the open approach – that way, if anyone is interested they can see the process we went through. While I doubt anyone will be interested in such a level of detail for this paper, opening up the paper revision process as a matter of principle is probably a good idea.

With the movement to open up the ‘grey literature’ of science – preliminary data, unfinished papers, failed experiments – it seems logical to extend this to the post-peer-review revision process. For very popular and / or controversial papers, it would be interesting to see how authors have dealt with reviewer comments. It could help provide context for subsequent debates and responses, as well as demystify what can be a strange and scary process for early-career researchers like myself.

I’m sure there are plenty of people more steeped in the ways of open science who’ve given this a lot more thought. New services like FigShare, and open access publishers like PLoS and PeerJ, are experimenting with opening up all the whole process of academic publishing. There are also dedicated paper authoring tools that extend on git-like functionality – next time, I’d like to try one of the collaborative web-based Latex editors like ShareLatex or WriteLatex. Either way, I’d recommend adopting git or something git-like, for co-authoring papers and the post-peer-review revision process. The future of open peer review looks bright – and integrating this with an open, collaborative revision process is a no-brainer.

Next on my reading list for this topic is this book on the future of academic publishing by Kathleen FitzPatrick – Planned Obsolescence
— UPDATE: Chad Kohalyk just alerted me to a relevant new feature rolled out by Github yesterday – a better way to track diffs in rendered prose. Thanks!

Care.Data: Why we need a new social contract for personal health data

In an ideal world, our collective medical records would be a public good, carefully stewarded by responsible institutions, used to derive medical insights and manage public health better. This is the basic premise of the scheme, and construed as such it suggests a simple moral equation with an obvious answer; give up a little individual privacy for the greater public good. The problem is, our world is not ideal. We’re in the midst of multiple crises of trust in government, the private sector and the ability of our existing global digital infrastructure to adequately deal with the challenges of personal data.

The NHS conducted a privacy impact assessment for the scheme, to identify and weigh its risks and benefits. In discussing why citizens might choose to opt-out of sharing their own data (as 40% of surveyed GP’s said they would), the final paragraph is both infuriating and revealing:

‘However, some people may believe that any use of patient identifiable data without explicit patient consent is unacceptable. These people are unlikely to be supportive of whatever its potential benefits and may object to the use of personal confidential data for wider healthcare purposes.’

In other words, there are some people who will selfishly exercise their individual rights to privacy (for whatever misguided reasons), to the cost and detriment of the public good.

While the leaflet promoting the scheme encourages donating ones data as a contribution to the public health service, even left-wing Bevanites have reason to be sceptical. While many of us instinctively trust ‘our NHS’, the truth is large parts of it are no longer ‘ours’, and the scheme is a perfect example. As expected, the contract to provide the ‘data extraction’ service was won by an unnaccountable private sector provider (Atos, who are also responsible for disability benefit assessments), while some of the main beneficiaries of all the data itself will be a plethora of commercial entities.

This is not to say that private sector use of health data is inherently bad. The trouble with the scheme goes deeper than that; it is a microcosm of a much wider malaise about the future of personal data and the value of privacy.

The social contract governing the use of our health information was written for a different age, where ‘records’ meant paper, folders and filing cabinets rather than entries in giant, mine-able databases. This social contract (if it ever even existed) never granted a mandate for the new kinds of purposes HSCIC proposes.

Such a mandate would have to be based on a realistic and robust assessment of the long-term risks and a stronger regulatory framework for downstream users. Crucially, it would need to proactively engage citizens, enabling them to make informed choices about their personal data and its role in our national information infrastructure. Rather than seizing this opportunity to negotiate a new deal around data sharing, the architects of this scheme have attempted to hush it in through the backdoor.

Thankfully, there are alternative ways to reap the benefits of aggregated health data. One example is Swiss initiative, a patient data co-operative, owned and run by its members. By giving patients themselves a stake and a say in the governance of their data, the project aims to harness that data to ‘benefit the individual citizen and society without discrimination and invasion into privacy’.

Personal data collected unethically is like bad debt. You can aggregate it into complex derivatives, but in the end it’s still toxic. If the NHS start out on the wrong foot with health data, no amount of beneficial re-use will shore up public trust when things go wrong.