spiffre.eu

The Statistical Nature of 21st Century Information

on January 19, 2010 | Internet | , ,

Not so long ago, somebody I know was telling me how he had never used Wikipedia. His argument was that it wasn’t to be trusted:

“I heard that when a malicious change is made, it only takes a couple of minutes to fix it; but if someone changes Napoléon’s D.O.B., the person responsible for finding out which is which doesn’t necessarily know the right answer.”

It used to be, we dreamt of a monolithic computer that would know all (hello, Multivac); you’d walk up to it, and ask a question either directly or through a punched card (depending on the decade you were in). And the computer would know the answer, the way you know how to conjugate verbs or an academic knows what Napoléon’s dates are: by referring to an academic source of information. You can look around all you want nowadays, you won’t find such a thing. We do, however, have a system that gives the same result, except through very different means: by simply indexing the whole world, we have a statistical source of information.

As an example, we’ve tried for years (decades) to teach word processors to correct our mistakes: typos, orthographic mistakes, and grammar errors. A 12-year old could do it. And yet a computer still can’t, not really (admittedly, a lot of progress has been made, but programmers still took decades to achieve this).

Now on the other hand, if you simply google “napoleon 1769 1821″, without a doubt (and without even clicking on any result), you know you have the right, official, 100% certified dates for Napoleon. As another example, let’s say you’re not sure whether “unconstitutional” or “inconstitutional” is the proper spelling; just type in both, and see for yourself, again, without accessing any specific website other than Google.

Does Google know what Napoléon’s dates are? Does it have any clue as to the proper orthography of the word  “unconstitutional”? Has it been taught any grammar rules? Absolutely not. And yet you can have certainty through 2 indicators: the number of results, and the type of results – are the top links leading to official, commercial or academic websites?

The interest of this, of course, is that a single, authoritative source of information can be mistaken, voluntarily corrupted, involuntarily biased, unavailable, etc, while a statistical source – a mass of information stored all over the place – is a distributed, incorruptible source of information.

For an in-depth explanation, there’s an entire chapter in The Long Tail by Chris Anderson (some of it can be read here, in case you’re too lazy and/or cheap to by the book).


2 Comments for this entry

Leave a Reply