News: Analysis & Commentary: COMMENTARY
COMMENTARY: HE WHO MINES DATA MAY STRIKE FOOL'S GOLD
Michael Drosnin has performed a tremendous public service by writing The Bible Code, the fast-selling new book that claims to find hidden messages in the Bible about dinosaurs, Bill Clinton, and the Land of Magog. Not because Drosnin is correct, but because his methodology is so bad that it's a valuable example of how not to read data.
The pitfall Drosnin tumbled into threatens to ensnare any unwary practitioner of "data mining," the popular technique for building predictive models of the real world by discerning patterns in masses of computer data. Done right, data mining can help discover drugs, forecast recessions, weed out credit-card fraud, and pinpoint sales prospects. Done wrong, it produces bogus correlations that range from useless to dangerous.
The error Drosnin committed in The Bible Code was a data-mining classic. He wrote out the Hebrew Bible on a huge grid of letters and used a computer to look for words that appear across, up, down, or diagonally. The cryptic "messages" consist of seemingly related words that appear near each other--for instance, dinosaur and asteroid.
GARBAGE IN. It's best not to spill too much ink on The Bible Code. Drosnin says that he used the code to foresee the assassination of Israeli Prime Minister Yitzhak Rabin, among other events. But his approach is immune to statistical verification--or rebuttal, for that matter. Eliyahu Rips, the Israeli mathematician whom Drosnin credits as the code's discoverer, says he doesn't support the book. Its main value, then, is to illustrate a principle enunciated by Andrew W. Lo, a finance professor at Massachusetts Institute of Technology: "Given enough time, enough attempts, and enough imagination, almost any pattern can be teased out of any data set."
Experts from economists to epidemiologists have made similar mistakes. It was once common to mine health records in search of "hot spots" with above-average cancer rates. Epidemiologists would then develop hypotheses about what might have caused the apparent outbreak. This terrorized residents, usually for no good reason. Some places have above-average cancer rates by pure chance.
Data mining can lead to costly misinterpretations. ProCyte Corp. in Kirkland, Wash., was dismayed in 1992 when a clinical trial found that its new drug, Iamin, didn't seem to promote general healing of diabetic ulcer wounds. So the company searched through subsets of the data and found that Iamin seemed to work on certain foot wounds. But that was a statistical fluke, as it turned out after another expensive and fruitless clinical trial. Not allowed drug status, Iamin is now sold as a wound dressing.
Finance is rife with wrong-headed data mining. David J. Leinweber, managing director of First Quadrant Corp. in Pasadena, Calif., which manages $20 billion in assets, likes to illustrate the problem with "Stupid Data-Miner Tricks." For example, he sifted through a United Nations CD-ROM and discovered that historically, the single best predictor of the Standard & Poor's 500-stock index was butter production in Bangladesh.
The lesson: A formula that happens to fit the data of the past won't necessarily have any predictive value. That's true even of the Index of Leading Economic Indicators, which the Commerce Dept. turned over to the Conference Board in 1995. University of Pennsylvania economist Francis X. Diebold says the Commerce Dept.'s periodic rejiggering of the index made it fit the historical data more closely but didn't improve it as a forecasting tool.
The problem could get worse. With desktop computers becoming more powerful, data-mining tools are being used by people who are clueless about statistics. It's human nature to search for patterns--whether constellations in the stars or faces in the clouds. And computers allow that impulse to run wild. Says Alexis DePlanque, a senior research analyst at META Group in Stamford, Conn.: "We need to be sure we're not just empowering people to shoot themselves in the foot." That's true whether the data come from supermarket scanners or the Bible.By Peter CoyReturn to top