The problem with corpuses…

A previous post describes several attempts to provide ‘creative’ programmes – systems to write poetry, music, or even complete musicals by computer.

My concern is that most of the presenters started off by saying ‘we built a corpus of jazz chords/ stories from hit musicals / pictures/ words’ and then went on to analyse the probability that b would (or should) occur, given that a already had, or some such probability function.

This is a perfectly viable method for, say, linguistic analysis of normal texts. Tom Hedges’ jazz example, for instance, claims that, given a piece of music, the software can detect who composed it, by analysing the chord sequences.

There is some quite complex maths behind these systems.

The jazz example uses, and compares, several Markov and Bayesian classifiers.

Stephen McGregor’s poetry example says: “This paper puts forth a method for discovering computationally-derived conceptual spaces that reflect human conceptualization of musical and poetic creativity. We describe a lexical space that is defined through co-occurrence statistics, and compare the dimensions of
this space with human responses …. This novel method finds low-dimensional subspaces that represent particular conceptual regions within a vector space model of distributional semantics. Word-vectors from these discovered conceptual spaces are considered, and
argued to be useful for the evaluation of creativity.”

My concern is partly technical and partly philosophical.

Firstly, it’s easy to get caught up finding better maths and better corpuses. However, by using a corpus you can only ‘create’, with varying degrees of sophistication, within the envelope of what has already been done. I suspect, for example, if you had a corpus of all paintings made by ‘establishment’ painters in Europe before 1900, you could never use it to generate Les Demoiselles d’Avignon, although it might have generated any number of paintings that would have been commercially successful at the time. (See my recent post on Jonas Lundh.)

If you had a corpus of 19th century words, ‘digital’ and ‘computer’ would hardly appear, and if they did, they would have very different meanings (so they would be in different ‘conceptual spaces’.)

This would be especially dangerous if this sort of ‘creativity’ becomes widespread, and the works it produces are fed back into the reference corpuses for future ‘creative’ systems to draw on, thus ‘freezing’ the corpuses for a long period.

Secondly, Richard Price, in reading poetry composed by a programme that used two separate corpuses (one based on Wikipedia texts, one on the BL’s own reference collection of 19th century prose), pointed out that the two ‘felt different’. (I can’t remember what he actually said, but that was his argument as I heard it.) This is only to be expected. Language changes.

As Wittgenstein would say, so do language games. A corpus is probably a collection of very many ‘language games’, making up a new somewhat blurred game of its own. (Nineteenth century assumptions, values and terminology, for instance.) So when you draw up your ‘reference corpus’, you have to consider what you are getting: the corpus will impose game rules on all the derivative works you produce whilst using it.

Lastly, I am not sure to what extent these are ‘one shot’ algorithms. That is, they work out a set of probabilities and produce a set of words or chords or pixels, and that’s it.

It would be good to have a ‘fitness function’ to decide how good the resulting art work was, and then to use a genetic algorithm to try various options and to maximise the fitness. I think this is what human creatives do, except for those times when a poem, for example, just ‘comes to them’ and they write it straight out with minimal alterations later. So, again using the example of a poem, you could have fitness functions to assess coherence, grammar, etc., as well as poetic rules if you wanted (eg rhyme schemes?) and the programme could grow a poem that optimised these rules.

Of course, specifying the ‘fitness function’ would be very difficult. It may well be that human creativity lies not so much in generating new combinations – this can be done with random numbers – as in knowing which ones ‘work’.

Leave a Reply

Your email address will not be published. Required fields are marked *