Edge
248 — June 30, 2008 THE REALITY CLUB THE NEW REPUBLIC WALL STREET JOURNAL THE NEW YORKER THE NEW YORKER WALL STREET JOURNAL WASHINGTON POST |
![]() |
The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies. THE REALITY CLUB: George Dyson, Kevin Kelly, Stewart Brand, W. Daniel Hillis GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not. KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name. STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for. W. DANIEL HILLIS: Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years. |
By Kai Kupferschmidt ...as early as 1959 the physicist and writer Charles Percy Snow lamented that the humanities and natural sciences were adrift. Snow coined the phrase "two cultures". At the same time, he said saw a need for a "third culture" that would require a common culture of humanities and natural scientists. The mid-nineties saw the American literary agent John Brockman present his idea of the third culture. It was different than the one imagined by Snow. Brockman noted that natural scientists such as the biologist Richard Dawkins or the physicist Roger Penrose had taken over the function which had previously been played by literary scholars by by writing books that explained science to the public. Brockman that this was the third culture. Meanwhile, Snow's original idea is slowly becoming a reality. A second third culture is opening up. In Germany, traditional humanities scholars and scientists are moving together. Despite practical problems, there is a growing will on both sides to understand each other ... |
Alan Wolfe The collaboration of Kahneman and Tversky produced one of the major intellectual accomplishments of the late twentieth century: a series of ingeniously designed experiments that raised uncomfortable questions about "utility maximization," which was the major assumption of microeconomics. To wit: it makes no difference in theory whether you lose a ticket to a play or lose the $10 that the ticket cost, but when people lose the ticket they are far less likely to buy another one than when they lose the money. Kahneman and Tversky's explanation is that we create a mental account such that it makes sense to us to pay $10 to see a play but not $20, even though the utility sacrificed by losing the ticket and the money is identical. Tversky died of cancer in 1996. Kahneman won the Nobel Prize in economics in 2002, and is an emeritus professor at Princeton. Between them, they rattled the role of reason in the pantheon of human motives. They made clear that even if we think we know what is in our own best interest, we frequently make decisions based on misinformation, myopia, and plain quirkiness. The picture of human nature that they developed was--in contrast to the world of homo economicus-- ironic, skeptical, almost wickedly complex. See "A Short Course In Thinking About Thinking: A 'Master Class' By Danny Kahneman" [9.25.07] |
Leisure class gives way to workaholic elite scrambling to maintain their place in life ...The leisure class has given way to what I call the workaholic wealthy -- an elite of BlackBerry-crazed, network-obsessed, peripatetic travelers who have to keep scrambling to maintain their place in life. According to research by Daniel Kahneman, the Nobel Prize-winning behavioral economist, quoted in an article in the Washington Post, "being wealthy is often a powerful predictor that people spend less time doing pleasurable things and more time doing compulsory things and feeling stressed." |
ANNALS OF MEDICINE THE ITCH The theory—and a theory is all it is right now—has begun to make sense of some bewildering phenomena. Among them is an experiment that Ramachandran performed with volunteers who had phantom pain in an amputated arm. They put their surviving arm through a hole in the side of a box with a mirror inside, so that, peering through the open top, they would see their arm and its mirror image, as if they had two arms. Ramachandran then asked them to move both their intact arm and, in their mind, their phantom arm—to pretend that they were conducting an orchestra, say. The patients had the sense that they had two arms again. Even though they knew it was an illusion, it provided immediate relief. ... |
BOOKS WHAT WAS I THINKING? As an academic discipline, Ariely’s field—behavioral economics—is roughly twenty-five years old. It emerged largely in response to work done in the nineteen-seventies by the Israeli-American psychologists Amos Tversky and Daniel Kahneman. (Ariely, too, grew up in Israel.) When they examined how people deal with uncertainty, Tversky and Kahneman found that there were consistent biases to the responses, and that these biases could be traced to mental shortcuts, or what they called “heuristics.” Some of these heuristics were pretty obvious—people tend to make inferences from their own experiences, so if they’ve recently seen a traffic accident they will overestimate the danger of dying in a car crash—but others were more surprising, even downright wacky. For instance, Tversky and Kahneman asked subjects to estimate what proportion of African nations were members of the United Nations. They discovered that they could influence the subjects’ responses by spinning a wheel of fortune in front of them to generate a random number: when a big number turned up, the estimates suddenly swelled. |
...When psychologists Daniel Kahneman and the late Amos Tversky conducted an experimental survey in the early 1980s asking people to answer this simple question, they discovered, to their surprise, that most respondents picked "b," even though this was the narrower choice and hence the less likely one. It seems that saliency – in this case, Linda's passionate political profile – trumps logic. Over the past quarter-century, Mr. Kahneman and his colleagues have gone on to identify a range of flaws in our critical faculties, reshaping the study of economics by challenging the assumption that a person, when faced with a choice, can be counted on to make a rational decision. |
HOW RICH PEOPLE SPEND THEIR TIME People invariably believe that money can make them happy -- and rich people usually do report being happier than poor people do. But if this is the case, shouldn't wealthy people spend a lot more time doing enjoyable things than poor people? Nobel Prize-winning behavioral economist Daniel Kahneman has found, however, that being wealthy is often a powerful predictor that people spend less time doing pleasurable things, and more time doing compulsory things and feeling stressed. ... |
![]() |
THE END OF THEORY
Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age. The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies. Introduction "At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later." In response to Anderson's essay, Stewart Brand notes that:
The "crossing" that Anderson has named in his essay, has been developing in science for several years and in the Edge community in particular. For example, during the TED Conference in 2005, before, and during, the annual Edge Dinner, there were illuminating informal conversations involving Craig Venter (who pioneered the use of high volume genome sequencing using vast amounts of computational power), Danny Hillis (designer of the "Connection Machine", the massively parallel supercomputer), and Sergey Brin and Larry Page of Google: new and radical DNA sequencing techniques meet computational robots meet server farms in search of a synthetic source of energy. And in August, 2007, at the Edge event "Life: What A Concept", Venter made the following point:
Andrian Kreye, editor of the Feuilleton of Sueddeutsche Zeitung wrote on his paper's editorial pages that the event was "a crucial moment in history. After all, it's where the dawning of the age of biology was officially announced". In the July/August Seed Salon with novelist Tom Wolfe, neuroscientist Michael Gazzaniga explains how the conversation began to change when neuroscience took off in the '80s and '90s:
Now, thanks to Anderson, the new narrative has a name. "Welcome to the Petabyte Age". —JB
Chris Anderson's Edge Bio page THE REALITY CLUB: George Dyson, Kevin Kelly, Stewart Brand GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not. KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name. STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for. [Originally published the cover story, "The End of Science", Wired Magazine: Issue 16.07] |
THE END OF THEORY
"All models are wrong, but some are useful." So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don't have to settle for wrong models. Indeed, they don't have to settle for models at all. Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age. The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies. At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn't pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right. Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content. Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them." This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years. Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on. Now biology is heading in the same direction. The models we were taught in school about "dominant" and "recessive" genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton's laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility. In short, the more we learn about biology, the further we find ourselves from a model that can explain it. There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms. If the words "discover a new species" call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn't know what they look like, how they live, or much of anything else about their morphology. He doesn't even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species. This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It's just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation. This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM's Tivoli and open source versions of Google File System and MapReduce.1 Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software. Learning to use a "computer" of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all. There's no reason to cling to our old ways. It's time to ask: What can science learn from Google? |
Feeding the Masses: Data In, Crop Predictions Out
Farmer's Almanac is finally obsolete. Last October, agricultural consultancy Lanworth not only correctly projected that the US Department of Agriculture had overestimated the nation's corn crop, it nailed the margin: roughly 200 million bushels. That's just 1.5 percent fewer kernels but still a significant shortfall for tight markets, causing a 13 percent price hike and jitters in the emerging ethanol industry. When the USDA downgraded expectations a month after Lanworth's prediction, the little Illinois-based company was hailed as a new oracle among soft-commodity traders — who now pay the firm more than $100,000 a year for a timely heads-up on fluctuations in wheat, corn, and soybean supplies. |
Chasing the Quark: Sometimes You Need to Throw Information Away
The ultimate digital camera will be demo'd at the Large Hadron Collider near Geneva later this year. While proton beams race in opposite directions around a 17-mile underground ring, crossing and recrossing the Swiss-French border on each circuit, six particle detectors will snap a billion "photos" per second of the resulting impacts. The ephemeral debris from those collisions may hold answers to some of the most exciting questions in physics. |
Winning the Lawsuit: Data Miners Dig for Dirt
Way back in the 20th century, when Ford Motor Company was sued over a faulty ignition switch, its lawyers would gird for the discovery process: a labor-intensive ordeal that involved disgorging thousands of pages of company records. These days, the number of pages commonly involved in commercial litigation discovery has ballooned into the billions. Attorneys on the hunt for a smoking gun now want to see not just the final engineering plans but the emails, drafts, personal data files, and everything else ever produced in the lead-up to the finished product. |
Tracking the News: A Smarter Way to Predict Riots and Wars
|
Spotting the Hot Zones: Now We Can Monitor Epidemics Hour by Hour
|
Sorting the World: Google Invents New Way to Manage Data
1. Collect |
Watching the Skies: Space Is Really Big — But Not Too Big to Map
|
Scanning Our Skeletons: Bone Images Show Wear and Tear
|
Tracking Air Fares: Elaborate Algorithms Predict Ticket Prices
|
Predicting the Vote: Pollsters Identify Tiny Voting Blocs
|
Pricing Terrorism: Insurers Gauge Risks, Costs In the Aftermath of the September 11, 2001, attacks, Congress passed a law requiring commercial and casualty insurance companies to offer terrorism coverage. That was reassuring to jittery business owners but a major hassle for insurers, who, after all, are in the business of predicting risk. They might not know when something like a hurricane or earthquake will hit, but decades of data tell them where and how hard such an event is likely to be. But terrorists try to do the unexpected, and the range of what they might attempt is vast. A recent study published by the American Academy of Actuaries estimated that a truck bomb going off in Des Moines, Iowa, could cost insurers $3 billion; a major anthrax attack on New York City could cost $778 billion.
The Bottom Line |
Visualizing Big Data: Bar Charts for Words
|
On THE END OF THEORY By Chris Anderson Responses by George Dyson, Kevin Kelly, Stewart Brand, W. Daniel Hillis GEORGE DYSON: Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not. KEVIN KELLY: My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name. STEWART BRAND: Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for. W. DANIEL HILLIS: Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years. |
GEORGE DYSON [6.29.08] For a long time we have been stuck on the idea that the brain somehow contains a "model" of reality, and that Artificial Intelligence will be achieved when we figure out how to model that model within a machine. What's a model? We presume two requirements: a) Something that works; and b) Something we understand. You can have (a) without (b). Our large, distributed, petabyte-scale creations are starting to grasp reality in ways that work just fine but that we don't necessarily understand. Just as we may eventually take the brain apart, neuron by neuron, and never find the model, we may discover that true AI came into existence without anyone ever developing a coherent model of reality or an unambiguous theory of intelligence. Reality, with all its ambiguities, does the job just fine. It may be that our true destiny as a species is to build an intelligence that proves highly successful, whether we understand how it works or not. The massively-distributed collective associative memory that constitutes the "Overmind" (or Kevin's OneComputer) is already forming associations, recognizing patterns, and making predictions—though this does not mean thinking the way we do, or on any scale that we can comprehend. The sudden flood of large data sets and the opening of entirely new scientific territory promises a return to the excitement at the birth of (modern) Science in the 17th century, when, as Newton, Boyle, Hooke, Petty, and the rest of them saw it, it was "the Business of Natural Philosophy" to find things out. What Chris Anderson is hinting at is that Science will increasingly belong to a new generation of Natural Philosophers who are not only reading Nature directly, but are beginning to read the Overmind. |
KEVIN KELLY [6.29.08] There's a dawning sense that extremely large databases of information, starting in the petabyte level, could change how we learn things. The traditional way of doing science entails constructing a hypothesis to match observed data or to solicit new data. Here's a bunch of observations; what theory explains the data sufficiently so that we can predict the next observation? It may turn out that tremendously large volumes of data are sufficient to skip the theory part in order to make a predicted observation. Google was one of the first to notice this. For instance, take Google's spell checker. When you misspell a word when googling, Google suggests the proper spelling. How does it know this? How does it predict the correctly spelled word? It is not because it has a theory of good spelling, or has mastered spelling rules. In fact Google knows nothing about spelling rules at all. Instead Google operates a very large dataset of observations which show that for any given spelling of a word, x number of people say "yes" when asked if they meant to spell word "y." Google's spelling engine consists entirely of these datapoints, rather than any notion of what correct English spelling is. That is why the same system can correct spelling in any language. In fact, Google uses the same philosophy of learning via massive data for their translation programs. They can translate from English to French, or German to Chinese by matching up huge datasets of humanly translated material. For instance, Google trained their French/English translation engine by feeding it Canadian documents which are often released in both English and French versions. The Googlers have no theory of language, especially of French, no AI translator. Instead they have zillions of datapoints which in aggregate link "this to that" from one language to another. Once you have such a translation system tweaked, it can translate from any language to another. And the translation is pretty good. Not expert level, but enough to give you the gist. You can take a Chinese web page and at least get a sense of what it means in English. Yet, as Peter Norvig, head of research at Google, once boasted to me, "Not one person who worked on the Chinese translator spoke Chinese." There was no theory of Chinese, no understanding. Just data. (If anyone ever wanted a disproof of Searle's riddle of the Chinese Room, here it is.) If you can learn how to spell without knowing anything about the rules or grammar of spelling, and if you can learn how to translate languages without having any theory or concepts about grammar of the languages you are translating, then what else can you learn without having a theory? Chris Anderson is exploring the idea that perhaps you could do science without having theories.
There may be something to this observation. Many sciences such as astronomy, physics, genomics, linguistics, and geology are generating extremely huge datasets and constant streams of data in the petabyte level today. They'll be in the exabyte level in a decade. Using old fashioned "machine learning," computers can extract patterns in this ocean of data that no human could ever possibly detect. These patterns are correlations. They may or may not be causative, but we can learn new things. Therefore they accomplish what science does, although not in the traditional manner. What Anderson is suggesting is that sometimes enough correlations are sufficient. There is a good parallel in health. A lot of doctoring works on the correlative approach. The doctor may not ever find the actual cause of an ailment, or understand it if he/she did, but he/she can correctly predict the course and treat the symptom. But is this really science? You can get things done, but if you don't have a model, is it something others can build on? We don't know yet. The technical term for this approach in science is Data Intensive Scalable Computation (DISC). Other terms are "Grid Datafarm Architecture" or "Petascale Data Intensive Computing." The emphasis in these techniques is the data-intensive nature of computation, rather than on the computing cluster itself. The online industry calls this approach of investigation a type of "analytics." Cloud computing companies like Google, IBM, and Yahoo, and some universities have been holding workshops on the topic. In essence these pioneers are trying to exploit cloud computing, or the OneMachine, for large-scale science. The current tools include massively parallel software platforms like MapReduce and Hadoop (See: "A Cloudbook For The Cloud"), cheap storage, and gigantic clusters of data centers. So far, very few scientists outside of genomics are employing these new tools. The intent of the NSF's Cluster Exploratory program is to match scientists owning large databased-driven observations with computer scientists who have access and expertise with cluster/cloud computing. My guess is that this emerging method will be one additional tool in the evolution of the scientific method. It will not replace any current methods (sorry, no end of science!) but will compliment established theory-driven science. Let's call this data intensive approach to problem solving Correlative Analytics. I think Chris squanders a unique opportunity by titling his thesis "The End of Theory" because this is a negation, the absence of something. Rather it is the beginning of something, and this is when you have a chance to accelerate that birth by giving it a positive name. A non-negative name will also help clarify the thesis. I am suggesting Correlative Analytics rather than No Theory because I am not entirely sure that these correlative systems are model-free. I think there is an emergent, unconscious, implicit model embedded in the system that generates answers. If none of the English speakers working on Google's Chinese Room have a theory of Chinese, we can still think of the Room as having a theory. The model may be beyond the perception and understanding of the creators of the system, and since it works it is not worth trying to uncover it. But it may still be there. It just operates at a level we don't have access to. But the models' invisibility doesn't matter because they work. It is not the end of theories, but the end of theories we understand. George Dyson says this much better in his reponse to Chris Anderson's article (see above). So far Correlative Analytics, or the Google Way of Science, has primarily been deployed in sociological realms, like language translation, or marketing. That's where the zillionic data has been. All those zillions of data points generated by our collective life online. But as more of our observations and measurements of nature are captured 24/7, in real time, in increasing variety of sensors and probes, science too will enter the field of zillionics and be easily processed by the new tools of Correlative Analytics. In this part of science, we may get answers that work, but which we don't understand. Is this partial understanding? Or a different kind of understanding? Perhaps understanding and answers are overrated. "The problem with computers," Pablo Picasso is rumored to have said, "is that they only give you answers." These huge data-driven correlative systems will give us lots of answers — good answers — but that is all they will give us. That's what the OneComputer does — gives us good answers. In the coming world of cloud computing perfectly good answers will become a commodity. The real value of the rest of science then becomes asking good questions. [See "The Google Way of Science" on Kevin Kelly's Blog — The Technium] |
STEWART BRAND [6.29.08] Digital humanity apparently crossed from one watershed to another over the last few years. Now we are noticing. Noticing usually helps. We'll converge on one or two names for the new watershed and watch what induction tells us about how it works and what it's good for. |
W. DANIEL HILLIS [6.30.08] I am a big fan of Google, and I love looking for mathematical patterns in data, but Chris Anderson‘s essay "The End of Theory: Will the Data Deluge Makes the Scientific Method Obsolete?"sets up a false distinction. He claims that using a large collection of data to "view data mathematically first and establish a context for it later" is somehow different from "the way science has worked for hundreds of years." I disagree. Science always begins by looking for patterns in the data, and the first simple models are always just extrapolations of what we have seen before. Astronomers were able to accurately predict the motions of planets long before Newton’s theories. They did this by gathering lots of data and looking for mathematical patterns. The "new" method that Chris Anderson describes has always been the starting point: gather lots of data, and assume it is representative of other situations. This works well as long as we do not try to extrapolate too far from what has been observed. It is a very simple kind of model, a model that says, "what we will see next will be very much what we have seen so far". This is usually a good guess. Existing data always gives us our first hypothesis. Humans and other animals are probably hard-wired for that kind of extrapolation. Mathematical tools like differential equations and statistics were developed to help us do a better job of it. These tools of science have been used for centuries and computers have let us apply them to larger data sets. They have also allowed us to collect more data to extrapolate. The data-based methods that we apply to petabytes are the methods that we have always tried first. The experimental method (hypothesize, model, test) is what allows science to get beyond what can be extrapolated from existing data. Hypotheses are most interesting when they predict something that is different from what we have seen so far. For instance, Newton’s model could predict the paths of undiscovered planets, whereas the old-fashioned data-based models could not. Einstein’s model, in turn, predicted measurements that would have surprised Newton. Models are interesting precisely because they can take us beyond the data. Chris Anderson says that "this approach to science — hypothesize, model, test — is becoming obsolete". No doubt the statement is intended to be provocative, but I do not see even a little bit of truth in it. I share his enthusiasm for the possibilities created by petabyte datasets and parallel computing, but I do not see why large amounts of data will undermine the scientific method. We will begin, as always, by looking for simple patterns in what we have observed and use that to hypothesize what is true elsewhere. Where our extrapolations work, we will believe in them, and when they do not, we will make new models and test their consequences. We will extrapolate from the data first and then establish a context later. This is the way science has worked for hundreds of years. Chris Anderson is correct in his intuition that something is different about these new large databases, but he has misidentified what it is. What is interesting is that for the first time we have significant quantitative data about the variation of individuals: their behavior, their interaction, even their genes. These huge new databases give us a measure of the richness of the human condition. We can now look at ourselves with the tools we developed to study the stars. |