An old friend contacted me after many years, having found me through the internet. As there are many people with my name listed on the internet (several senior academics, for instance, all often much published and cited), this started me thinking about dataset analysis again
Thanks to Valerys Blog I found a relevant paper – You Are What You Say: Privacy Risks of Public Mentions by Dan Frankowski, Dan Cosley, Shilad Sen, Loren Terveen and John Riedl. They argue that data in two sparse relation spaces can be linked, and mined to provide more information than the user might have expected:
“It is common to segregate different aspects […of our lives…> in different places: you might write opinionated rants about movies in your blog under a pseudonym while participating in a forum or web site for scholarly discussion of medical ethics under your real name. However, it may be possible to link these separate identities, because the movies, journal articles, or authors you mention are from a sparse relation space whose properties (e.g., many items related to by only a few users) allow reidentification.This re-identification violates people’s intentions to separate aspects of their life and can have negative consequences; it also may allow other privacy violations, such as obtaining a stronger identifier like name and address.”
When I googled the way my friend probably looked for me on the internet, my name jumped right out. The way he probably found me was a combination of my name, and something he knew about me. No-one else with my name – in fact relatively few people – has that link (ie it was a sparse relation space). The intersection of my name, and this other link, identified me. (If you just enter my name Im number 42 out of 43,600 on Google, and you have no easy way of being sure which is me. If you google my name plus simulation, for example, Im the first 6 items out of 397.)
Some people expose enormous amounts of their lives on the internet: blogs, sites, personal photos on Flickr, etc. Im not quite that bad, but as I dont have any need to hide one half of my life from the other, and as Im involved in marketing, it doesnt worry me that I can be googled. (Indeed the Googlebot hit this blog site over 33,000 times last year – one hit every fifteen minutes, on average – with Inktomi and the MSN bot not far behind. People were referred to this site by Google searches over 1300 times – more than three times a day on average.)
What is interesting though is this idea of the sparse relation space. The article says:
“A sparse relation space is a dataset that (a) relates people to items; (b) is sparse, having relatively few relationships recorded per person; and (c) involves a relatively large space of items. Examples of sparse relation spaces include customer purchase data from Target stores, music played on an online music player like iTunes, articles edited in Wikipedia, and books mentioned by bloggers on their public blogs. Sparse relation spaces differ from databases… which have a fixed number of columns (like zip code) and values present for most users….Privacy loss may occur whenever an agent has access to two sparse relation spaces with overlapping users and items. If there is no overlap, there is no risk. However, overlap is a real possibility as people’s relations to items are increasingly available, whether explicitly revealed (forums, blogs, ratings) or implicitly collected (web logs, purchase history). Risks are most severe when one of the datasets is identified, i.e., has personally identifying data such as a social security number or a name and address. Non-identified datasets lack such data but can be used with an identified dataset to leak sensitive information.”
Its the sparseness of the relationships which helps identification: if they were non-sparse, there would be too much data to allow identification. When you think about it, though, most of our relations are sparse, in the sense that they are datasets with few relationships recorded per person. (Even human relationships: I see my dentist twice a year, my accountant once a year, etc.).
So perhaps a fruitful field for large dataset analysis consists in hoovering up small datasets and deducing links between them. The article goes on to examine algorithms for making these links, and ways in which database owners, or users, might make identification more difficult.
Whats interesting about the paper is the idea that intersections dont just have to be between facts or tags (like a name plus a key word) but can also be deduced between fuzzy facts – eg from anonymous postings. Presumably this is how Holmes works, though Tangram appears to do something else entirely. For any one relation space you dont have enough information to deduce more. But put two together, and coincidences between them start to appear, which allow you to make increasingly probable assumptions.