Let data speak for themselves?

The Edge World Question for 2007 is What are you optimistic about. As last year, it has thrown up some fascinating answers. One is an essay by Bart Kosko who argues that data are increasingly being freed from the tyranny of modelling and allowed to tell their own story.

His argument is that, historically, we have made sense of data by imposing models on them. Sometimes, these “rely on an arcane ability to guess at nature with symbolic mathematics. It is hard to see a direct evolutionary basis for such refined symbol manipulation although there may be several indirect bases that involve language skills and spatial reasoning.” Some may be productive, but some have a high degree of observer bias. Often, especially with complex data, we dont (or cant) realise how high this is.

Increasing computing power, however, allows us to let the data speak for themselves. He cites ways this can be done:

Neural networks:
” Feedforward neural networks further reduced the expert to a supervisor who gives the network preferred input-output pairs to train its synaptic throughput structure. But this comes at the expense of having no logical audit trail in the networks innards that can explain what patterns the network encodes or what patterns it partly or wholly forgets when it learns new input-output patterns. Unsupervised neural networks tend to be less powerful but omit more modeler bias because the user does not give the network preferred outputs or teaching signals. All these AI systems are model-free in the sense that the user does not need to state a math model of the process at hand. But each still has some form of a functional math model that converts inputs to outputs.”

“Closed-form statistics also produced Bayesian models as a type of equation-based expert system where the expert can inject his favorite probability curve on the problem at hand. These models have the adaptive benefit that successive data often washes away the experts initial math guesses just as happens in an adaptive fuzzy system. The AI systems are Bayesian in this sense of letting experts encode expertise directly into a knowledge structure—but again the knowledge structure itself is a model of sorts and thus an imposition on the data.”

“The bootstrap has produced a revolution of sorts in statistics… in effect puts the data set in a bingo hopper and lets the user sample from the data set over and over again just so long as the user puts the data back in the hopper after drawing and recording it. Computers easily let one turn an initial set of 100 data points into tens of thousands of resampled sets of 100 points each. Efron and many others showed that these virtual samples contain further information about the original data set. This gives a statistical free lunch except for the extensive computation involved—but that grows a little less expensive each day…. Bootstrap resampling has started to invade almost every type of statistical decision making. Statisticians have even shown how to apply it in complex cases of time-series and dependent data.” (Further outline here and a more tehnical explanation here.

Exciting stuff. What new pictures might emerge slowly from the data swamp? Ive pedantically referred to data in the plural here (datum is the singular) because it seems to give them more independence and authority. One day, will data have rights, like humans do and animals should?

Leave a Reply

Your email address will not be published. Required fields are marked *