Edge 248June 30, 2008
(11,000 words)

THE THIRD CULTURE

THE END OF THEORY
Will the Data Deluge Makes the Scientific Method Obsolete?
By Chris Anderson

THE REALITY CLUB

George Dyson, Kevin Kelly, Stewart Brand, W. Daniel Hillis
respond to "The End of Theory"

IN THE NEWS

DER TAGESSPIEGEL

A NEW HUMANISM
("EIN NEUER HUMANISMUS")
By Kai Kupferschmidt

THE NEW REPUBLIC
Hedonic Man
By Alan Wolfe

WALL STREET JOURNAL
How The Rich Spend their Time: Stressed
By Robert Frank

THE NEW YORKER
The Itch
By Atul Gawande

THE NEW YORKER
What Was I Thinking
By Elizabeth Kolbert

WALL STREET JOURNAL
Free To Choose, But Often Wrong
By David A. Shawitz

WASHINGTON POST
How Rich People Spend Their Time




 


Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

[... continue below]

THE REALITY CLUB: George Dyson, Kevin Kelly, Stewart Brand, W. Daniel Hillis

GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not.

KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name.

STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for.

W. DANIEL HILLIS: Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years.

[...continue below]



article


Der Tagesspiegel
June 26, 2008


A NEW HUMANISM ("EIN NEUER HUMANISMUS")
There are things that are a bad mix: water and oil, for example. Or nature and humanities. But they are approaching each other.

By Kai Kupferschmidt

...as early as 1959 the physicist and writer Charles Percy Snow lamented that the humanities and natural sciences were adrift. Snow coined the phrase "two cultures". At the same time, he said saw a need for a "third culture" that would require a common culture of humanities and natural scientists.

The mid-nineties saw the American literary agent John Brockman present his idea of the third culture. It was different than the one imagined by Snow. Brockman noted that natural scientists such as the biologist Richard Dawkins or the physicist Roger Penrose had taken over the function which had previously been played by literary scholars by by writing books that explained science to the public. Brockman that this was the third culture.

Meanwhile, Snow's original idea is slowly becoming a reality. A second third culture is opening up. In Germany, traditional humanities scholars and scientists are moving together. Despite practical problems, there is a growing will on both sides to understand each other ...

Der Tagesspiegel: German Original

Google Translation

...


article


THE NEW REPUBLIC
July 9, 2008


HEDONIC MAN
The new economics and the pursuit of happiness.

Alan Wolfe

The collaboration of Kahneman and Tversky produced one of the major intellectual accomplishments of the late twentieth century: a series of ingeniously designed experiments that raised uncomfortable questions about "utility maximization," which was the major assumption of microeconomics. To wit: it makes no difference in theory whether you lose a ticket to a play or lose the $10 that the ticket cost, but when people lose the ticket they are far less likely to buy another one than when they lose the money. Kahneman and Tversky's explanation is that we create a mental account such that it makes sense to us to pay $10 to see a play but not $20, even though the utility sacrificed by losing the ticket and the money is identical.

Tversky died of cancer in 1996. Kahneman won the Nobel Prize in economics in 2002, and is an emeritus professor at Princeton. Between them, they rattled the role of reason in the pantheon of human motives. They made clear that even if we think we know what is in our own best interest, we frequently make decisions based on misinformation, myopia, and plain quirkiness. The picture of human nature that they developed was--in contrast to the world of homo economicus-- ironic, skeptical, almost wickedly complex.

...

See "A Short Course In Thinking About Thinking: A 'Master Class' By Danny Kahneman" [9.25.07]


article


THE WALL STREET JOURNAL
June 26, 2008


HOW THE RICH SPEND THEIR TIME: STRESSED
By Robert Frank

Leisure class gives way to workaholic elite scrambling to maintain their place in life

...The leisure class has given way to what I call the workaholic wealthy -- an elite of BlackBerry-crazed, network-obsessed, peripatetic travelers who have to keep scrambling to maintain their place in life.

According to research by Daniel Kahneman, the Nobel Prize-winning behavioral economist, quoted in an article in the Washington Post, "being wealthy is often a powerful predictor that people spend less time doing pleasurable things and more time doing compulsory things and feeling stressed."

...


article


NEW YORKER
June 30, 2008

ANNALS OF MEDICINE

THE ITCH
Its mysterious power may be a clue to a new theory about brains and bodies.
by Atul Gawande

The theory—and a theory is all it is right now—has begun to make sense of some bewildering phenomena. Among them is an experiment that Ramachandran performed with volunteers who had phantom pain in an amputated arm. They put their surviving arm through a hole in the side of a box with a mirror inside, so that, peering through the open top, they would see their arm and its mirror image, as if they had two arms. Ramachandran then asked them to move both their intact arm and, in their mind, their phantom arm—to pretend that they were conducting an orchestra, say. The patients had the sense that they had two arms again. Even though they knew it was an illusion, it provided immediate relief. ...

...


article


NEW YORKER
June 25, 2008

BOOKS

WHAT WAS I THINKING?
The latest reasoning about our irrational ways.
by Elizabeth Kolbert

As an academic discipline, Ariely’s field—behavioral economics—is roughly twenty-five years old. It emerged largely in response to work done in the nineteen-seventies by the Israeli-American psychologists Amos Tversky and Daniel Kahneman. (Ariely, too, grew up in Israel.) When they examined how people deal with uncertainty, Tversky and Kahneman found that there were consistent biases to the responses, and that these biases could be traced to mental shortcuts, or what they called “heuristics.” Some of these heuristics were pretty obvious—people tend to make inferences from their own experiences, so if they’ve recently seen a traffic accident they will overestimate the danger of dying in a car crash—but others were more surprising, even downright wacky. For instance, Tversky and Kahneman asked subjects to estimate what proportion of African nations were members of the United Nations. They discovered that they could influence the subjects’ responses by spinning a wheel of fortune in front of them to generate a random number: when a big number turned up, the estimates suddenly swelled.

...


article


THE WALL STREET JOURNAL
June 24, 2008


FREE TO CHOOSE, BUT OFTEN WRONG
By David A. Shaywitz

...When psychologists Daniel Kahneman and the late Amos Tversky conducted an experimental survey in the early 1980s asking people to answer this simple question, they discovered, to their surprise, that most respondents picked "b," even though this was the narrower choice and hence the less likely one. It seems that saliency – in this case, Linda's passionate political profile – trumps logic.

Over the past quarter-century, Mr. Kahneman and his colleagues have gone on to identify a range of flaws in our critical faculties, reshaping the study of economics by challenging the assumption that a person, when faced with a choice, can be counted on to make a rational decision.

...


article


WASHINGTON POST
June 23, 2008

HOW RICH PEOPLE SPEND THEIR TIME

People invariably believe that money can make them happy -- and rich people usually do report being happier than poor people do. But if this is the case, shouldn't wealthy people spend a lot more time doing enjoyable things than poor people?

Nobel Prize-winning behavioral economist Daniel Kahneman has found, however, that being wealthy is often a powerful predictor that people spend less time doing pleasurable things, and more time doing compulsory things and feeling stressed. ...

...



THE END OF THEORY
Will the Data Deluge Makes the Scientific Method Obsolete?

By Chris Anderson

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

Introduction

According to Chris Anderson, we are at "the end of science", that is, science as we know it." The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data. Welcome to the Petabyte Age."

"At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later."

In response to Anderson's essay, Stewart Brand notes that:

Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for.

The "crossing" that Anderson has named in his essay, has been developing in science for several years and in the Edge community in particular.

For example, during the TED Conference in 2005, before, and during, the annual Edge Dinner, there were illuminating informal conversations involving Craig Venter (who pioneered the use of high volume genome sequencing using vast amounts of computational power), Danny Hillis (designer of the "Connection Machine", the massively parallel supercomputer), and Sergey Brin and Larry Page of Google: new and radical DNA sequencing techniques meet computational robots meet server farms in search of a synthetic source of energy.

And in August, 2007, at the Edge event "Life: What A Concept", Venter made the following point:

I have come to think of life in much more a gene-centric view than even a genome-centric view, although it kind of oscillates. And when we talk about the transplant work, genome-centric becomes more important than gene-centric. From the first third of the Sorcerer II expedition we discovered roughly 6 million new genes that has doubled the number in the public databases when we put them in a few months ago, and in 2008 we are likely to double that entire number again. We're just at the tip of the iceberg of what the divergence is on this planet. We are in a linear phase of gene discovery maybe in a linear phase of unique biological entities if you call those species, discovery, and I think eventually we can have databases that represent the gene repertoire of our planet.

One question is, can we extrapolate back from this data set to describe the most recent common ancestor. I don't necessarily buy that there is a single ancestor. It’s counterintuitive to me. I think we may have thousands of recent common ancestors and they are not necessarily so common.

Andrian Kreye, editor of the Feuilleton of Sueddeutsche Zeitung wrote on his paper's editorial pages that the event was "a crucial moment in history. After all, it's where the dawning of the age of biology was officially announced".

In the July/August Seed Salon with novelist Tom Wolfe, neuroscientist Michael Gazzaniga explains how the conversation began to change when neuroscience took off in the '80s and '90s:

There was a hunger for the big picture: What does it mean? How do we put it together into a story? Ultimately, everything's got to have a narrative in science, as in life.

Now, thanks to Anderson, the new narrative has a name. "Welcome to the Petabyte Age".

JB


CHRIS ANDERSON is Editor-in-Chief of Wired magazine. A physicist by training, he has previously served in editorial positions at Nature and Science, the two premier science journals. He is author of the bestselling The Long Tail: Why the Future of Business Is Selling Less of More.

Chris Anderson's Edge Bio page

[permalink]

THE REALITY CLUB: George Dyson, Kevin Kelly, Stewart Brand

GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not.

KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name.

STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for.

[...continue below]

[Originally published the cover story, "The End of Science", Wired Magazine: Issue 16.07]


THE END OF THEORY
Will the Data Deluge Makes the Scientific Method Obsolete?

Illustration: Marian Bantjes

"All models are wrong, but some are useful."

So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don't have to settle for wrong models. Indeed, they don't have to settle for models at all.

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.

Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them."

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.

This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce.1 Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.

Learning to use a "computer" of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

There's no reason to cling to our old ways. It's time to ask: What can science learn from Google?


PETABYTE EXAMPLES

FEEDING THE MASSES:
Data In, Crop Predictions Out
CHASING THE QUARK:
Sometimes You Need to Throw Information Away
WINNING THE LAWSUIT:
Data Miners Dig for Dirt
TRACKING THE NEWS:
A Smarter Way to Predict Riots and Wars
WATCHING THE SKIES:
Space Is Big — But Not Too Big to Map
SCANNING OUR SKELETONS:
Bone Images Show Wear and Tear
TRACKING AIR FARES:
Elaborate Algorithms Predict Ticket Prices
PREDICTING THE VOTE:
Pollsters Identify Tiny Voting Blocs
PRICING TERRORISM:
Insurers Gauge Risks, Costs
VISUALIZING BIG DATA:
Bar Charts for Words
SPOTTING THE HOT ZONES:
Now We Can Monitor Epidemics Hour by Hour
SORTING THE WORLD:
Google Invents New Way to Manage Data

Feeding the Masses: Data In, Crop Predictions Out
By Ben Paynter


The Iowa agriculture landscape: Green areas are more productive for soy, corn, and wheat; red are least.
Image: Firstborn

Farmer's Almanac is finally obsolete. Last October, agricultural consultancy Lanworth not only correctly projected that the US Department of Agriculture had overestimated the nation's corn crop, it nailed the margin: roughly 200 million bushels. That's just 1.5 percent fewer kernels but still a significant shortfall for tight markets, causing a 13 percent price hike and jitters in the emerging ethanol industry. When the USDA downgraded expectations a month after Lanworth's prediction, the little Illinois-based company was hailed as a new oracle among soft-commodity traders — who now pay the firm more than $100,000 a year for a timely heads-up on fluctuations in wheat, corn, and soybean supplies.

The USDA bases its estimates on questionnaires and surveys — the agency calls a sample of farmers and asks what's what. Lanworth uses satellite images, digital soil maps, and weather forecasts to project harvests at the scale of individual fields. It even looks at crop conditions and rotation patterns — combining all the numbers to determine future yields.

Founded in 2000, Lanworth started by mapping forests for land managers and timber interests. Tracking trends in sleepy woodlands required just a few outer-space snapshots a year. But food crops are a fast-moving target. Now the company sorts 100 gigs of intel every day, adding to a database of 50 terabytes and counting. It's also moving into world production-prediction — wheat fields in Russia, Kazakhstan, and Ukraine are already in the data set, as are corn and soy plots in Brazil and Argentina. The firm expects to reach petabyte scale in five years. "There are questions about how big the total human food supply is and whether we as a country are exposed to risk," says Lanworth's director of information services, Nick Kouchoukos. "We're going after the global balance sheet."

Back to Contents

Chasing the Quark: Sometimes You Need to Throw Information Away
By David Harris


The Large Hadron Collider might find particles like the Higgs Boson — shown here as a simulation.
Photo: CERN

The ultimate digital camera will be demo'd at the Large Hadron Collider near Geneva later this year. While proton beams race in opposite directions around a 17-mile underground ring, crossing and recrossing the Swiss-French border on each circuit, six particle detectors will snap a billion "photos" per second of the resulting impacts. The ephemeral debris from those collisions may hold answers to some of the most exciting questions in physics.

The LHC, expected to run 24/7 for most of the year, will generate about 10 petabytes of data per second. That staggering flood of information would instantly overwhelm any conceivable storage technology, so hardware and software filters will reduce the take to roughly 100 events per second that seem most promising for analysis. Even so, the collider will record about 15 petabytes of data each year, the equivalent of 15,000 terabyte-size hard drives filled to the brim. Hidden in all those 1s and 0s might be extra dimensions of space, the mysterious missing dark matter, or a whole new world of exotic superparticles.

Back to Contents

Winning the Lawsuit: Data Miners Dig for Dirt
By John Bringardner


Infographic: Bob Dinetz

Way back in the 20th century, when Ford Motor Company was sued over a faulty ignition switch, its lawyers would gird for the discovery process: a labor-intensive ordeal that involved disgorging thousands of pages of company records. These days, the number of pages commonly involved in commercial litigation discovery has ballooned into the billions. Attorneys on the hunt for a smoking gun now want to see not just the final engineering plans but the emails, drafts, personal data files, and everything else ever produced in the lead-up to the finished product.

Welcome to e-discovery. Firms like Fios, Attenex, and hundreds of others now specialize in the scanning, indexing, and data-mining of discovery documents. (The industry got a boost in December 2006, when new federal rules went into effect requiring parties to produce discovery documents in electronic format.) E-discovery vendors pulled in $2 billion in 2006 — and that figure is expected to double by 2009.

So how has this evidentiary deluge changed the practice of law? Consider that five years ago, newly minted corporate litigators spent much of their time digging through warehouses full of paper documents. Today they're back at their desks, sorting through PDFs, emails, and memos on their double monitors — aided by semantic search technologies that scan for keywords and phrases. In another five years, don't be surprised to find juries chuckling over a plaintiff's incriminating IMs, voice messages, video conferences, and Twitters.

Back to Contents

Tracking the News: A Smarter Way to Predict Riots and Wars
By Adam Rogers

Small outbreaks of violence, like recent food riots in Haiti, can prefigure a larger crisis.
Photo: AP

Whether news of current events is good or bad, there is always a lot of it. Worldwide, an estimated 18,000 Web sites publish breaking stories in at least 40 languages. That universe of information contains early warnings about everything from natural disasters to political unrest — if you can read the data.

When the European Commission asked its researchers to come up with a way to monitor news feeds in 2002, all it really wanted was to see what the press was saying about the EU. The commission's Joint Research Center developed software that monitors 1,540 Web sites running some 40,000 articles a day. There's no database per se, just about 10 gigabytes of information flowing past a pattern-matching algorithm every day — 3.5 terabytes a year. When the system, called Europe Media Monitor, evolves to include online video, the daily dose of information could be measured in terabytes.

So what patterns does EMM find? Besides sending SMS and email news alerts to eurocrats and regular people alike, EMM counts the number of stories on a given topic and looks for the names of people and places to create geotagged "clusters" for given events, like food riots in Haiti or political unrest in Zimbabwe. Burgeoning clusters and increasing numbers of stories indicate a topic of growing importance or severity. Right now EMM looks for plain old violence; project manager Erik van der Goot is tweaking the software to pick up natural and humanitarian disasters, too. "That has crisis-room applications, where you have a bunch of people trying to monitor a situation," Van der Goot says. "We map a cluster of news reports on a screen in the front of the room — they love that."

EMM gives snapshots of the now. But "the big thing everyone would like to do is early warning of conflict and state failure," says Clive Best, a physicist formerly with the JRC. Other research groups, like the one run by Eric Horvitz at Microsoft Research, are working on that. "We have lots of data, and lots of things we can try to model predictively," says Horvitz. "People think in terms of trends, but I want to build a data set where I can mark something as a surprise — a surprising conflict or surprising turn in the economy."

Horvitz is developing a system that picks out the words national leaders use to describe one another, trying to predict the onset of aggression. EMM has something similar, called tonality detection. Essentially, it's understanding the verbs as well as the nouns. Because once you know how people feel about something, you're a step closer to being able to guess what they'll do next.

Back to Contents

Spotting the Hot Zones: Now We Can Monitor Epidemics Hour by Hour
By Sharon Weinberger

Illustration: Studio Tonne

If you want to stop a disease outbreak — or a bioterrorist attack — you have to act fast. But health information typically moves at the pace of the receptionist at your doctor's office. The goal of Essence, the Department of Defense's Electronic Surveillance System for the Early Notification of Community-based Epidemics, is to pick up the tempo. Begun in 1999 to collect health data in the Washington, DC, area, Essence now monitors much of the Military Health System, which includes 400 facilities around the world.

"You don't have to be accurate to detect things," says Jay Mansfield, director of strategic information systems at the Global Emerging Infections Surveillance and Response System, one of the agencies that developed Essence. "But you do need to be precise." Reports from every clinic, doctor, and pharmacy get broken into broad syndrome categories rather than specific diseases. One doctor might diagnose bronchitis and another pneumonia, but Essence doesn't care. It's just looking for similar illnesses and where and when they occur. "It's like a fire alarm," Mansfield says. "It goes off if there's smoke, so you can get in the kitchen and see what's going on."

Because 100 megabytes of data come in every day — the team stores 18 months' worth, about 2.5 terabytes — there's often more smoke than fire. A pharmacy running out of antidiarrheals could signal an outbreak of E. coli or just a two-for-one sale. Essence expanded to include new sources (like radiology and laboratory tests) this spring, which means the data issues just got even more complicated. The trick is parsing the data as it comes in so that patterns emerge in hours instead of days. "We detected a gastrointestinal outbreak in Korea," Mansfield says. "I called my boss, and he asked me, 'When did it happen?'"

Korea is 13 hours ahead of Washington. So Mansfield simply answered: "Tomorrow."

Back to Contents

Sorting the World: Google Invents New Way to Manage Data
By Patrick Di Justo


Used to be that if you wanted to wrest usable information from a big mess of data, you needed two things: First, a meticulously maintained database, tagged and sorted and categorized. And second, a giant computer to sift through that data using a detailed query.

But when data sets get to the petabyte scale, the old way simply isn't feasible. Maintenance — tag, sort, categorize, repeat — would gobble up all your time. And a single computer, no matter how large, can't crunch that many numbers.

Google's solution for working with colossal data sets is an elegant approach called MapReduce. It eliminates the need for a traditional database and automatically splits the work across a server farm of PCs. For those not inside the Googleplex, there's an open source version of the software library called Hadoop.

MapReduce can handle almost any type of information you throw at it, from photos to phone numbers. In the example below, we count the frequency of specific words in Google Books.

How Google Crunches the Numbers
MapReduce can handle almost any type of information you throw at it, from photos to phone numbers. In the example below, we count the frequency of specific words in Google Books.

Infographic: Office

1. Collect
MapReduce doesn't depend on a traditional structured database, where information is categorized as it's collected. We'll just gather up the full text of every book Google has scanned.

2. Map
You write a function to map the data: "Count every use of every word in Google Books." That request is then split among all the computers in your army, and each agent is assigned a hunk of data to work with. Computer A gets War and Peace, for example. That machine knows what words that book contains, but not what's inside Anna Karenina.

3. Save
Each of the hundreds of PCs doing a map writes the results to its local hard drive, cutting down on data transfer time. The computers that have been assigned "reduce" functions grab the lists from the mappers.

4. Reduce
The Reduce computers correlate the lists of words. Now you know how many times a particular word is used, and in which books.

5. Solve
The result? A data set about your data. In our example, the final list of words is stored separately so it can be quickly referenced or queried: "How often does Tolstoy mention Moscow? Paris?" You don't have to plow through unrelated data to get the answer.

Back to Contents

Watching the Skies: Space Is Really Big — But Not Too Big to Map
By Michael D. Lemonick


In images from the Sloan Digital Sky Survey, asteroids (circled in green) appear to move over time. Galaxies like NGC4517A, at lower right, don't.
Photo: Sloan Digital Sky Survey

In 1930, a young astronomer named Clyde Tombaugh found Pluto. He did it with a high tech marvel called a blink comparator; he put two photographs of the same patch of sky taken on different nights into the contraption and flipped back and forth between them. Stars would stay fixed, but objects like comets, asteroids, and planets moved.

Astronomers have since traded photographic plates for massive digital images. But Tombaugh's method — take a picture of the sky, take another one, compare — is still used to detect fast-changing stellar phenomena, like supernovae or asteroids headed toward Earth.

True, imaging the entire sky, and understanding those images, won't be easy. The first telescope that will be able to collect all that data, the Large Synoptic Survey Telescope, won't be finished until 2014. Perched atop Cerro Pachón, a mountain in northern Chile, the LSST will have a 27.5-foot mirror and a field of view 50 times the size of the full moon seen from Earth. Its digital camera will suck down 3.5 gigapixels of imagery every 17 seconds. "At that rate," says Michael Strauss, a Princeton astrophysicist, "the numbers get very big very fast."

The LSST builds on the most ambitious attempt to catalog the heavens so far, the Sloan Digital Sky Survey. Operating from a New Mexico mountaintop, the SDSS has returned about 25 terabytes of data since 1998, most of that in images. It has measured the precise distance to a million galaxies and has discovered about 500,000 quasars. But the Sloan's mirror is just one-tenth the power of the mirror planned for LSST, and its usable field of view just one-seventh the size. Sloan has been a workhorse, but it simply doesn't have the oomph to image the entire night sky, over and over, to look for things that change.

The LSST will cover the sky every three days. And within the petabytes of information it collects may lurk things nobody has even imagined — assuming astronomers can figure out how to teach their computers to look for objects no one has ever seen. It's the first attempt to sort astronomical data on this scale, says Princeton astrophysicist Robert Lupton, who oversaw data processing for the SDSS and is helping design the LSST. But the new images may allow him and his colleagues to watch supernovae explode, find undiscovered comets, and maybe even spot that killer asteroid.

Back to Contents

Scanning Our Skeletons: Bone Images Show Wear and Tear
By Thomas Goetz



What can you learn from 80 million x-rays? The secrets of aging, among other things. Sharmila Majumdar, a radiologist at UC San Francisco, is using an arsenal of computer tomography scans to understand how our bones wear out from the inside.

It works like this: A CT scanner takes superhigh-resolution x-rays of a bone, then combines those individual images into a three-dimensional structure. The results are incredibly detailed; a scan of a single segment of bone can run 30 gigs.

Majumdar's method is to churn through the data to identify patterns in how the trabeculae — the material inside bone — changes in people who have diseases like osteoporosis and arthritis. In one day of imaging, it's not uncommon for the lab to generate nearly a terabyte of data. Researchers also aggregate the data from many subjects, putting hundreds of terabytes to work. Majumdar hopes to learn why some patients suffer severe bone loss but others don't. "We don't know the mechanism of bone loss," she notes. "Once we learn that, we can create therapies to address it."

How to Look Inside Our Bones
This slice of a human hip joint is 82 microns thick — about half the width of a human hair. Other machines used by the lab can go as fine as 6 microns — the size of a human red blood cell. Each bone is scanned about 1,000 times, creating, in this case, a clear look at osteoporosis in action.
Using image processing, the slices are combined into a 3-D model, creating a picture of what the bone looks like from the outside ...
... and from the inside. This image of a human vertebra shows the internal microstructure of bone, called the trabeculae.
The lab then analyzes the model for weaknesses in density and strength. In this image, the thicker structures are color-coded green, while thinner material is colored red. Majumdar's lab combines hundreds of models to detect bone-loss patterns that help us understand how humans age.
Back to Contents

Tracking Air Fares: Elaborate Algorithms Predict Ticket Prices
By Cliff Kuang

"Flight Patterns" shows 141,000 aircraft paths over a 24-hour period.
Image: Aaron Koblin

In 2001, Oren Etzioni was on a plane chatting up his seat mates when he realized they had all paid less for their tickets than he did. "I thought, 'Don't get mad, get even,'" he says. So he came home to his computer lab at the University of Washington, got his hands on some fare data, and plugged it into a few basic prediction algorithms. He wanted to see if they could reliably foresee changes in ticket prices. It worked: Not only did the algorithms accurately anticipate when fares would go up or down, they gave reasonable estimates of what the new prices would be.

Etzioni's prediction model has grown far more complex since then, and the company he founded in 2003, Farecast, now tracks information on 1100 billion fares originating at 79 US airports. The database knows when airline prices are going to change and has uncovered a host of other secrets about air travel. Here's a dose of expert advice from the Farecast data vault:

1. Common wisdom is wrong ...
The lowest price tends to hit between eight and two weeks before departure. Buying tickets farther in advance usually doesn't save money.

2. ... except when it's right
The rule fails during peak demand: Friday departures for spring break, and Sunday returns during the summer, Thanksgiving, and Christmas. For these, now is never too early.

3. When the price drops, jump

Fifty percent of reductions are gone in two days. If you see a tasty fare, snatch it up.

4. If prices seem high, hold off
Behavioral economists call it framing: If last year's $200 flight is now $250, you'll probably find that too dear and won't buy. Everyone else is thinking the same thing. So when airlines hike the price of a route, they often have to cut rates later to boost sales.

5. The day you fly matters
Used to be, you could count on a cheaper fare if you stayed over a Saturday night. But during spring break and summer, weekend trips are in high demand, so flights on Friday, Saturday, and Sunday can easily cost $50 more than those midweek.

6. So does the day you buy
Price drops usually come early in the week. So a ticket bought on Saturday might be cheaper the next Tuesday. That's particularly true outside the summer rush, making fall the best time for a last-minute getaway.

7. Markups vary by destination
Flights to Europe in July can be $350 higher than in May or September. If you want a summer vacation, domestic and Caribbean travel is cheaper to begin with and doesn't rise as high.

8. Stay an extra day
At the end of holidays, there's usually a stampede to the airport. One more day with the in-laws can save you upwards of $100 — if you can stand it.

Back to Contents

Predicting the Vote: Pollsters Identify Tiny Voting Blocs
By Garrett M. Graff


Infographic: Build

Want to know exactly how many Democratic-leaning Asian Americans making more than $30,000 live in the Austin, Texas, television market? Catalist, the Washington, DC, political data-mining shop, knows the answer. CTO Vijay Ravindran says his company has compiled nearly 15 terabytes of data for this election year — orders of magnitude larger than the databases available just four years ago. (In 2004, Howard Dean's formidable campaign database clocked in at less than 100 GB, meaning that in one election cycle the average data set has grown 150-fold.) In the next election cycle, we should be measuring voter data in petabytes.

Large-scale data-mining and micro-targeting was pioneered by the 2004 Bush-Cheney campaign, but Democrats, aided by privately financed Catalist, are catching up. They're documenting the political activity of every American 18 and older: where they registered to vote, how strongly they identify with a given party, what issues cause them to sign petitions or make donations. (Catalist is matched by the Republican National Committee's Voter Vault and Aristotle Inc.'s immense private bipartisan trove of voter information.)

As databases grow, fed by more than 450 commercially and privately available data layers as well as firsthand info collected by the campaigns, candidates are able to target voters from ever-smaller niches. Not just blue-collar white males, but married, home-owning white males with a high school diploma and a gun in the household. Not just Indian Americans, but Indian Americans earning more than $80,000 who recently registered to vote.

Bill and Hillary's pollster, Mark Penn, has been promoting the dream of narrowcasting and microtrends for years (he invented "tech fatales," US women who drive decisions about electronics purchases). Penn was just a cycle or two early. The technology is finally catching up to his theories.

Back to Contents

Pricing Terrorism: Insurers Gauge Risks, Costs
By Vince Beiser

In the Aftermath of the September 11, 2001, attacks, Congress passed a law requiring commercial and casualty insurance companies to offer terrorism coverage. That was reassuring to jittery business owners but a major hassle for insurers, who, after all, are in the business of predicting risk. They might not know when something like a hurricane or earthquake will hit, but decades of data tell them where and how hard such an event is likely to be. But terrorists try to do the unexpected, and the range of what they might attempt is vast. A recent study published by the American Academy of Actuaries estimated that a truck bomb going off in Des Moines, Iowa, could cost insurers $3 billion; a major anthrax attack on New York City could cost $778 billion.

How do you predict a threat that's unpredictable by design? By marshaling trainloads of data on every part of the equation that is knowable. Then you make highly educated guesses about the rest.

The Target
A random office building isn't likely to be in terrorists' crosshairs, but it could become collateral damage in a strike on, say, a nearby courthouse. To get help quantifying these risks, insurance companies turn to specialized catastrophe-modeling firms like AIR Worldwide. In 2002, AIR enlisted a group of experts formerly with the CIA, FBI, and Energy and Defense departments to brainstorm 36 categories of targets: corporate headquarters, airports, bridges, and so forth. AIR researchers then assembled a database of more than 300,000 actual locations around the US.
The Risk
AIR's experts estimated the odds of an attack on each target type, taking into account various terror groups and the range of weapons they might use. They figured, for instance, that an animal research lab might have a higher risk of being hit by animal rights extremists than a post office.
The Damage
A decade ago, insurance companies — and the reinsurance companies that indemnify them — had only a rough idea of what they were covering. Today they have fine-grained details about nearly every property, down to the type of roofing and window glass. Terabytes of this data are run through models that factor in the area near the target, records of industrial accidents, and results of bomb tests. Those calculations yield estimates of casualties and damage, depending on whether the building was the target or a collateral hit.

The Bottom Line
Actuaries then convert all that mayhem into dollars, figuring out what the insurer will have to pay to repair buildings, replace equipment, and cover loss of life and medical care. What does that mean in terms of premiums? Typical coverage against terrorist attack for a five-story office building in Topeka, Kansas: $5,000 a year. That same building in lower Manhattan? $100,000. Even for mad bombers, it's all about location, location, location.

Infographics: Bryan Christie

Back to Contents

Visualizing Big Data: Bar Charts for Words
By Mark Horowitz


A visualization of thousands of Wikipedia edits that were made by a single software bot. Each color corresponds to a different page.
Image: Fernanda B. Viégas, Martin Wattenberg, and Kate Hollenbach

The biggest challenge of the Petabyte Age won't be storing all that data, it'll be figuring out how to make sense of it. Martin Wattenberg, a mathematician and computer scientist at IBM's Watson Research Center in Cambridge, Massachusetts, is a pioneer in the art of visually representing and analyzing complex data sets. He and his partner at IBM, Fernanda Viégas, created Many Eyes, a collaborative site where users can share their own dynamic, interactive representations of big data. He spoke with Wired's Mark Horowitz:

Wired: How do you define "big" data?

Wattenberg: You can talk about terabytes and exabytes and zettabytes, and at a certain point it becomes dizzying. The real yardstick to me is how it compares with a natural human limit, like the sum total of all the words you'll hear in your lifetime. That's surely less than a terabyte of text. Any more than that and it becomes incomprehensible by a single person, so we have to turn to other means of analysis: people working together, or computers, or both.

Wired: Why is a numbers guy like you so interested in large textual data sets?

Wattenberg: Language is one of the best data-compression mechanisms we have. The information contained in literature, or even email, encodes our identity as human beings. The entire literary canon may be smaller than what comes out of particle accelerators or models of the human brain, but the meaning coded into words can't be measured in bytes. It's deeply compressed. Twelve words from Voltaire can hold a lifetime of experience.

Wired: What will happen when we have digital access to everything, like all of English literature or all the source code ever written?

Wattenberg: There's something about completeness that's magical. The idea that you can have everything at your fingertips and process it in ways that were impossible before is incredibly exciting. Even simple algorithms become more effective when trained on big sets. Perhaps we'll find out more about plagiarism and literary borrowing when we have the spread of literature before us. We think of our current age as one of intellectual remixing and mashups, but maybe it's always been that way. You can only do that kind of analysis when you have the full spectrum of data.

Wired: Is that why, on Many Eyes, you have visualizations of Wikipedia using simple word trees and tag clouds?

Wattenberg: Wikipedia also has this idea of completeness. The information there again probably totals less than a terabyte, but it's huge in terms of encompassing human knowledge. Today, if you're analyzing numbers, there are a million ways to make a bar chart. If you're analyzing text, it's hard. I think the only way to understand a lot of this data is through visualization.

Back to Contents


On THE END OF THEORY By Chris Anderson

Responses by George Dyson, Kevin Kelly, Stewart Brand, W. Daniel Hillis

GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not.

KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name.

STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for.

W. DANIEL HILLIS: Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years.

[permalink]


GEORGE DYSON [6.29.08]

For a long time we have been stuck on the idea that the brain somehow contains a "model" of reality, and that Artificial Intelligence will be achieved when we figure out how to model that model within a machine. What's a model? We presume two requirements: a) Something that works; and b) Something we understand. You can have (a) without (b). Our large, distributed, petabyte-scale creations are starting to grasp reality in ways that work just fine but that we don't necessarily understand.

Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not.

The massively-distributed collective associative memory that constitutes the "Overmind" (or Kevin's OneComputer) is already forming associations, recognizing patterns, and making predictions—though this does not mean thinking the way we do, or on any scale that we can comprehend.

The sudden flood of large data sets and the opening of entirely new scientific territory promises a return to the excitement at the birth of (modern) Science in the 17th century, when, as Newton, Boyle, Hooke, Petty, and the rest of them saw it, it was "the Business of Natural Philosophy" to find things out. What Chris Anderson is hinting at is that Science will increasingly belong to a new generation of Natural Philosophers who are not only reading Nature directly, but are beginning to read the Overmind.


KEVIN KELLY [6.29.08]

There's a dawning sense that extremely large databases of information, starting in the petabyte level, could change how we learn things. The traditional way of doing science entails constructing a hypothesis to match observed data or to solicit new data. Here's a bunch of observations; what theory explains the data sufficiently so that we can predict the next observation?

It may turn out that tremendously large volumes of data are sufficient to skip the theory part in order to make a predicted observation. Google was one of the first to notice this. For instance, take Google's spell checker. When you misspell a word when googling, Google suggests the proper spelling. How does it know this? How does it predict the correctly spelled word? It is not because it has a theory of good spelling, or has mastered spelling rules. In fact Google knows nothing about spelling rules at all.

Instead Google operates a very large dataset of observations which show that for any given spelling of a word, x number of people say "yes" when asked if they meant to spell word "y." Google's spelling engine consists entirely of these datapoints, rather than any notion of what correct English spelling is. That is why the same system can correct spelling in any language.

In fact, Google uses the same philosophy of learning via massive data for their translation programs. They can translate from English to French, or German to Chinese by matching up huge datasets of humanly translated material. For instance, Google trained their French/English translation engine by feeding it Canadian documents which are often released in both English and French versions. The Googlers have no theory of language, especially of French, no AI translator. Instead they have zillions of datapoints which in aggregate link "this to that" from one language to another.

Once you have such a translation system tweaked, it can translate from any language to another. And the translation is pretty good. Not expert level, but enough to give you the gist. You can take a Chinese web page and at least get a sense of what it means in English. Yet, as Peter Norvig, head of research at Google, once boasted to me, "Not one person who worked on the Chinese translator spoke Chinese." There was no theory of Chinese, no understanding. Just data. (If anyone ever wanted a disproof of Searle's riddle of the Chinese Room, here it is.)

If you can learn how to spell without knowing anything about the rules or grammar of spelling, and if you can learn how to translate languages without having any theory or concepts about grammar of the languages you are translating, then what else can you learn without having a theory?

Chris Anderson is exploring the idea that perhaps you could do science without having theories.

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

There may be something to this observation. Many sciences such as astronomy, physics, genomics, linguistics, and geology are generating extremely huge datasets and constant streams of data in the petabyte level today. They'll be in the exabyte level in a decade. Using old fashioned "machine learning," computers can extract patterns in this ocean of data that no human could ever possibly detect. These patterns are correlations. They may or may not be causative, but we can learn new things. Therefore they accomplish what science does, although not in the traditional manner.

What Anderson is suggesting is that sometimes enough correlations are sufficient. There is a good parallel in health. A lot of doctoring works on the correlative approach. The doctor may not ever find the actual cause of an ailment, or understand it if he/she did, but he/she can correctly predict the course and treat the symptom. But is this really science? You can get things done, but if you don't have a model, is it something others can build on?

We don't know yet. The technical term for this approach in science is Data Intensive Scalable Computation (DISC). Other terms are "Grid Datafarm Architecture" or "Petascale Data Intensive Computing." The emphasis in these techniques is the data-intensive nature of computation, rather than on the computing cluster itself. The online industry calls this approach of investigation a type of "analytics." Cloud computing companies like Google, IBM, and Yahoo, and some universities have been holding workshops on the topic. In essence these pioneers are trying to exploit cloud computing, or the OneMachine, for large-scale science. The current tools include massively parallel software platforms like MapReduce and Hadoop (See: "A Cloudbook For The Cloud"), cheap storage, and gigantic clusters of data centers. So far, very few scientists outside of genomics are employing these new tools. The intent of the NSF's Cluster Exploratory program is to match scientists owning large databased-driven observations with computer scientists who have access and expertise with cluster/cloud computing.

My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name. A non-negative name will also help clarify the thesis. I am suggesting Correlative Analytics rather than No Theory because I am not entirely sure that these correlative systems are model-free. I think there is an emergent, unconscious, implicit model embedded in the system that generates answers. If none of the English speakers working on Google's Chinese Room have a theory of Chinese, we can still think of the Room as having a theory. The model may be beyond the perception and understanding of the creators of the system, and since it works it is not worth trying to uncover it. But it may still be there. It just operates at a level we don't have access to.

But the models' invisibility doesn't matter because they work. It is not the end of theories, but the end of theories we understand. George Dyson says this much better in his reponse to Chris Anderson's article (see above).

What George Dyson is suggesting is that this new method of doing science — gathering a zillion data points and then having the OneMachine calculate a correlative answer — can also be thought of as a method of communicating with a new kind of scientist, one who can create models at levels of abstraction (in the zillionics realm) beyond our own powers.

So far Correlative Analytics, or the Google Way of Science, has primarily been deployed in sociological realms, like language translation, or marketing. That's where the zillionic data has been. All those zillions of data points generated by our collective life online. But as more of our observations and measurements of nature are captured 24/7, in real time, in increasing variety of sensors and probes, science too will enter the field of zillionics and be easily processed by the new tools of Correlative Analytics. In this part of science, we may get answers that work, but which we don't understand. Is this partial understanding? Or a different kind of understanding?

Perhaps understanding and answers are overrated. "The problem with computers," Pablo Picasso is rumored to have said, "is that they only give you answers." These huge data-driven correlative systems will give us lots of answers — good answers — but that is all they will give us. That's what the OneComputer does — gives us good answers. In the coming world of cloud computing perfectly good answers will become a commodity. The real value of the rest of science then becomes asking good questions.

[See "The Google Way of Science" on Kevin Kelly's Blog — The Technium]


STEWART BRAND [6.29.08]

Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for.


W. DANIEL HILLIS [6.30.08]

I am a big fan of Google, and I love looking for mathematical patterns in data, but Chris Anderson‘s essay "The End of Theory: Will the Data Deluge Makes the Scientific Method Obsolete?"sets up a false distinction. He claims that using a large collection of data to "view data mathematically first and establish a context for it later" is somehow different from "the way science has worked for hundreds of years." I disagree.

Science always begins by looking for patterns in the data, and the first simple models are always just extrapolations of what we have seen before. Astronomers were able to accurately predict the motions of planets long before Newton’s theories. They did this by gathering lots of data and looking for mathematical patterns.

The "new" method that Chris Anderson describes has always been the starting point: gather lots of data, and assume it is representative of other situations. This works well as long as we do not try to extrapolate too far from what has been observed. It is a very simple kind of model, a model that says, "what we will see next will be very much what we have seen so far". This is usually a good guess.

Existing data always gives us our first hypothesis. Humans and other animals are probably hard-wired for that kind of extrapolation. Mathematical tools like differential equations and statistics were developed to help us do a better job of it. These tools of science have been used for centuries and computers have let us apply them to larger data sets. They have also allowed us to collect more data to extrapolate. The data-based methods that we apply to petabytes are the methods that we have always tried first.

The experimental method (hypothesize, model, test) is what allows science to get beyond what can be extrapolated from existing data. Hypotheses are most interesting when they predict something that is different from what we have seen so far. For instance, Newton’s model could predict the paths of undiscovered planets, whereas the old-fashioned data-based models could not. Einstein’s model, in turn, predicted measurements that would have surprised Newton. Models are interesting precisely because they can take us beyond the data.

Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years.

Chris Anderson is correct in his intuition that something is different about these new large databases, but he has misidentified what it is. What is interesting is that for the first time we have significant quantitative data about the variation of individuals: their behavior, their interaction, even their genes. These huge new databases give us a measure of the richness of the human condition. We can now look at ourselves with the tools we developed to study the stars.






Paperback - US
$10.17, 336 pp
Harper Perennial
  Hardcover - UK
£9.09 352 pp
Free Press, UK

What Are You Optimistic About?:
Today's Leading Thinkers on Why Things Are Good and Getting Better

Edited by John Brockman
Introduction by Daniel C. Dennett

"The optimistic visions seem not just wonderful but plausible." Wall Street Journal "Persuasively upbeat." O, The Oprah Magazine "Our greatest minds provide nutshell insights on how science will help forge a better world ahead." Seed "Uplifting...an enthralling book." The Mail on Sunday


aperback - US
$11.16, 336 pp
Harper Perennial
  Paperback - UK
£6.99, 352 pp
Free Press, UK

What Is Your Dangerous Idea?:
Today's Leading Thinkers on the Unthinkable

Edited by John Brockman
Introduction by Steven Pinker
Afterword by Richard Dawkins

"Danger – brilliant minds at work...A brilliant bok: exhilarating, hilarious, and chilling." The Evening Standard (London) "A selection of the most explosive ideas of our age." Sunday Herald "Provocative" The Independent "Challenging notions put forward by some of the world’s sharpest minds" Sunday Times "A titillating compilation" The Guardian "Reads like an intriguing dinner party conversation among great minds in science" Discover


Paperback - US
$11.16, 272 pp
Harper Perennial
  Paperback - UK
£5.39 288 pp
Pocket Books

What We Believe but Cannot Prove:
Today's Leading Thinkers on Science in the Age of Certainty

Edited by John Brockman
Introduction by Ian McEwan

"An unprecedented roster of brilliant minds, the sum of which is nothing short of an oracle — a book ro be dog-eared and debated." Seed "Scientific pipedreams at their very best." The Guardian "Makes for some astounding reading." Boston Globe Fantastically stimulating...It's like the crack cocaine of the thinking world.... Once you start, you can't stop thinking about that question." BBC Radio 4 "Intellectual and creative magnificence" The Skeptical Inquirer