Work behind the scenes

On Saturday morning I read two articles by Tommaso Venturini: one co-written with Bruno Latour. The co-written one, Fill in the gap, argues that, for the social sciences, “the advent of digital data poses a … challenge …  Micro-macro models have run their course. The time is now to develop the formal techniques necessary to unfold the origami of collective existence and this should be the aim of the renewed alliance between the social and natural sciences. For the next few years, at least, efforts should be shifted from simulating to mapping,from simple explanations to complex observations.” The second essay, “Three maps and three misunderstandings: A digital mapping of climate diplomacy”, is perhaps an example of what they mean. It uses the published Proceedings of the negotiations of the United Nations Framework Convention on Climate Change (UNFCCC) between 1995 and 2013 as a corpus to analyse climate change negotiations.

Firstly they distinguish “three common misunderstandings about digital methods”:
1. “a split-conception of digital traces that is both too restrictive and too ambitious”: “as classic statistics has made very clear, the representativeness of a sample depends only indirectly upon its size. The quality of a sample rather lies in its similarity to the sampled population and in its capacity to include the same variability…in digital research, the work done to adapt the information to the investigation’s objectives (data) is most of the time preceded by the work done by someone else to make such information available (traces).The trace/data distinction invites us to consider both operations with the greatest attention as they both influence research results”
2. “a vacillation between disregard and distrust on the conditions of production of digital traces”: on which they say “the work done to adapt the information to the investigation’s objectives (data) is most of the time preceded by the work done by someone else to make such information available (traces).The trace/data distinction invites us to consider both operations with the greatest attention as they both influence research results….Most digital traces are collected for purposes of marketing (such as loyalty or credit cards), surveillance (as in air travel), technical optimization (as in telecommunication networks), or information sharing (as in the ENB reports we address in this paper). In one way or the other, they are second-hand data, the production of which is not directly controlled by the researches. Using these traces requires therefore questioning the conditions of their production.”
3. a tendency to mistake ‘‘digital’’ for ‘‘automatic: “computerized research is neither faster nor easier. The experience of a digital project is the experience of a series of successive impediments. First, choices must be made on how to harvest the traces: what we are tracking is not as easy to collect as we had thought. The traces are messier than expected, and transforming them into data is problematic (how to correct the mistakes, detect the duplicates, manage normalization, remove irrelevant results?). Second, we have to select the most appropriate analysis tools: analysis algorithms are numerous, but they are all poorly documented and must all be adjusted to the available data. Third, results must be visualized, but how to choose among dozens of possible results and visualizations? And after having overcome all these impediments, we often obtain disappointing results that force us to reconsider all choices made throughout the process.”

The result is some interesting graphics, showing the focus or salience of different countries and different topics at different times in the negotiations. It’s a useful piece of research, though I have some problems with their methods. (Only one source used. Key terms chosen by “an original approach to text analysis that combines automatic extractions and manual selection of the key issue-terms” I don’t think this is original, if only because automatic selection is usually disappointing and most people add some sort of manual check. Also, in politics and diplomacy, a term can have different meanings depending on who uses it. (Brexit, for example!)

I went straight from this to a workshop at the Photographers’ Gallery, given by Adam Brown and Nicolas Maleve. I confess I had no idea what to expect. In the event, we started by photographing each other, in the style of Irving Penn’s Small Trades series. Then we all cut the photos down manually using Photoshop to 150 by 150 px head shots, then feeding these to a machine learning algorithm written by Nicolas (in Python and OpenGL). This learned what constituted a human face, and the worked on the original images to identify the faces, and then to swap bits of other faces in and out. (The latter bit just for fun, the former a basic building block of surveillance technology.)

The result was both very amusing and also a practical lesson in the work involved in preparing such machine learning databases. Often the classification, which has to be done by humans, is done through the Mechanical Turk – which I once described as an API for the poor. It is a long and ultimately boring job, and if you only get paid 5 cents per match, there is little incentive to do it well. Amazon say they have quality control systems, but I wonder how good these are?

There is a triangular relationship between user, annotator and algorithm, which is not as simple as it might seem at first sight.

The impact of referring to the Penn Small Trades series was also to make us think about photography as a trade. How would images of a photographer look now, or in twenty years’ time?

I spend Sunday with all this going round in my head, cooking fish.


On Monday morning I found two links on Twitter. The first was to an NY Times article on ‘Maintenance’, which points to declining standards in infrastructure maintenance, which, it argues are not solely due to lack of money but also to “an impoverished and immature conception of technology, one that fetishizes innovation as a kind of art and demeans upkeep as mere drudgery.”

As I understand it, most academic projects are funded on a fixed term basis. Once the year or whatever is up, the project ends. There is no more money and the researchers dissipate. So who does the maintenance on training sets? Who takes in new data, eg to Venturini’s climate change database? Who updates the criteria for data selection and massaging? In some cases, industry works on a longer term framework. Aircraft, for example, have very prescriptive maintenance schedules and procedures. But I wonder about training sets such as the Sun Database, which Nicolas showed us. (This was written up in 2010 here.) Since 2010, Google Scholar finds links to the YuPenn dataset and its development Yu++, and also the ‘Maryland’ dataset, and probably several others besides.) It is clear that the tendency is to start over – not to rewrite or update an existing dataset, but to start a new one, with an extra category or some other technical tweak that makes it new (and probably not backwardly-compatible.)

The second Twitter link was to Kitsugi, “the Japanese art of repairing broken pottery with lacquer dusted or mixed with powdered gold, silver, or platinum, a method similar to the maki-e technique. As a philosophy, it treats breakage and repair as part of the history of an object, rather than something to disguise….highlighting the cracks and repairs as simply an event in the life of an object rather than allowing its service to end at the time of its damage or breakage”. I suppose this an attitude that Taleb would call anti-fragile.


Tommaso Venturini, Pablo Jensen and Bruno Latour (2015) “Fill in the Gap. A New Alliance for Social and Natural Sciences”, Journal of Artificial Societies and Social Simulation 18 (2) 11

Tommaso Venturini et al (2014) “Three maps and three misunderstandings: A digital mapping of climate diplomacy”, in Big Data & Society, July–December 2014: 1–19 DOI: 10.1177/2053951714543804

Leave a Reply

Your email address will not be published. Required fields are marked *