How does data scale

Fascinating posting on Good Math, Bad Math arguing that we have much more data now, and that we can use this to discover new relationships, but at the same time there are new problems.

Statistics: Google have recently “sorted 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record is 209 seconds on 910 computers.” and “took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers…. writing it to 48,000 hard drives. ”

Databases: interesting sideswipe at RDBMS programmers: sometimes a simpler application can be faster and better, eg if it allows parallel processing. (Googles own MapReduce by the way is nothing to do with maps, its a parralel processing system. “One petabyte is a thousand terabytes, or, to put this amount in perspective, it is 12 times the amount of archived web data in the U.S. Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.”

Advantages of large datasets: “Scale means that you can do some amazing things that used to be impossible.” GMBM quotes an example from medicine: comparing the genomes of one virus to all the other known viruses.

Disadvantages: “there are things that would be easy for small-scale data that becomes unmanageable on a large scale…. Weve drawn that line in terms of something called algorithmic complexity. The line has traditionally been that anything whose complexity can be bounded by some polynomial computed from the size of its input is computable, while anything whose complexity is bounded by some exponential in the size of its input is not. But scale changes that. When youre working with large scale data, polynomial complexity is not really doable. Even if youve got something which grows very slowly, like sorting, becomes close to impossible at scale.”

Disadvantage two: “Theres another important downside to scale. When we look at large quantities of information, what were really doing is searching for patterns. And being the kind of creatures that we are, and given the nature of the laws of probability, we are going to find patterns. Distinguishing between a real legitimate pattern, and something random that just happens to look like a pattern can be somewhere between difficult and impossible. Using things like Bayesian methods to screen out the false positives can help, but scale means that scientists need to learn new methods – both the new ways of doing things that they couldnt do before, and the new ways of recognizing when theyve screwed up.”

Leave a Reply

Your email address will not be published. Required fields are marked *