Computers Will Let Data Tell More Of Their Own Story

I am optimistic that the rapid growth in computing power will let measured data tell more of their own story—rather than tell the story of the "model" that someone imposes on the data. The slow but steady movement away from classical model-based science tracks the growth in computers and digital processors.

Almost all traditional science and engineering has been model-based. Equations define the simplest models or functional relationships among input and output variables. Usually some super-smart thinker first makes an inspired guess at the model equations. The grand examples are Newton's guess at the inverse-square law of gravity and then Einstein's later and even bolder guess at the non-Euclidean geometry of the spacetime continuum.

Most models have lesser scope and far humbler origins. The modeler often guesses at a linear or quadratic or other simple relationship between the inputs and outputs even though the world itself appears to be quite nonlinear in general and often nonstationary as well. A standard modeling trick is to let a random noise term account for the difference between the nonlinear and largely unknown world and the far simpler model. Thus the humble noise or nuisance term carries much of the model's explanatory burden. Then the modeler compares the model to some data and looks for a pattern match to some degree. Other models can compete with the first model based on their pattern matches with the data.

Model-based science has produced most of our technological achievements. And it will likely always be at the core of the science curriculum. But it does rely on an arcane ability to guess at nature with symbolic mathematics. It is hard to see a direct evolutionary basis for such refined symbol manipulation although there may be several indirect bases that involve language skills and spatial reasoning.

A more immediate issue is that we tend to over-teach models in the science and engineering curriculum. One reason for this is that it is easy to teach closed-form equations and related technical theorems. Just state and derive the model result and apply it to examples based on numbers or on other equations. Equations make for wonderful homework problems. And it is especially easy to test on model equations and their consequences. It is not so easy to teach or test on data-intensive problems that can involve large tables of numerical data.

Another reason for over-teaching models is that so many of our textbooks in science and engineering have their roots in the pre-computing era surrounding World War II. That was the Shannon era of great analytical minds and authors such as probabilist Joseph Doob and chemist Linus Pauling and economist Paul Samuelson and many others. Students performed computations with slide rulers and then later with pocket calculators. Science and engineering textbooks today still largely build on those earlier texts from the pre-computer age where so often mathematical assumptions trumped raw data.

Rising computer power led to the first large break with the math-model approach in various species of artificial intelligence. Computer scientists programmed expert-system search trees directly from words. Some put uncertainty math models on the trees but the tree structure itself used words or text strings. The non-numerical structure let experts directly encode their expertise in verbal rules. That removed the old math models but still left the problem of literally doing only what the expert or modeler said to do.

Adaptive fuzzy rule-based systems allowed experts to state rules in words while the fuzzy system itself remained numeric. Data could in principle overcome modeler bias by adapting the rule structure in new directions as the data poured in. That reduced expert or modeler input to little more than giving initial conditions and occasional updates to the inference structure. Still all these AI tree-based knowledge systems suffer from the curse of dimensionality in some form of combinatorial rule explosion.

Feedforward neural networks further reduced the expert to a supervisor who gives the network preferred input-output pairs to train its synaptic throughput structure. But this comes at the expense of having no logical audit trail in the network's innards that can explain what patterns the network encodes or what patterns it partly or wholly forgets when it learns new input-output patterns. Unsupervised neural networks tend to be less powerful but omit more modeler bias because the user does not give the network preferred outputs or teaching signals. All these AI systems are model-free in the sense that the user does not need to state a math model of the process at hand. But each still has some form of a functional math model that converts inputs to outputs.

Statistics has arguably been the real leader in the shift from models to data --even though classical linear regression has been imposing lines and planes on the world for over two centuries. Neural and fuzzy learning systems turn out ultimately to have the structure of nonlinear but still statistical approximators. Closed-form statistics also produced Bayesian models as a type of equation-based expert system where the expert can inject his favorite probability curve on the problem at hand. These models have the adaptive benefit that successive data often washes away the expert's initial math guesses just as happens in an adaptive fuzzy system. The AI systems are Bayesian in this sense of letting experts encode expertise directly into a knowledge structure—but again the knowledge structure itself is a model of sorts and thus an imposition on the data.

The hero of data-based reasoning is the bootstrap resample. The bootstrap has produced a revolution of sorts in statistics since statistician Bradley Efron introduced it in 1979 when personal computers were becoming more available. The bootstrap in effect puts the data set in a bingo hopper and lets the user sample from the data set over and over again just so long as the user puts the data back in the hopper after drawing and recording it. Computers easily let one turn an initial set of 100 data points into tens of thousands of resampled sets of 100 points each. Efron and many others showed that these virtual samples contain further information about the original data set. This gives a statistical free lunch except for the extensive computation involved—but that grows a little less expensive each day. A glance at most multi-edition textbook on statistics will show the growing influence of the bootstrap and related resampling techniques in the later editions.

Consider the model-based baggage that goes into the standard 95% confidence interval for a population mean. Such confidence intervals appear expressly in most medical studies and reports and appear implicitly in media poll results as well as appearing throughout science and engineering. The big assumption is that the data come reasonably close to a bell curve even if it has thick tails. A similar assumption occurs when instructors grade on a "curve" even the student grades often deviate substantially from a bell curve (such as clusters of good and poor grades). Sometimes one or more statistical tests will justify the bell-curve assumption to varying degrees — and some of the tests themselves make assumptions about the data. The simplest bootstrap confidence interval makes no such assumption. The user computes a sample mean for each of the thousands of virtual data sets. Then the user rank-orders these thousands of computed sample means from smallest to largest and picks the appropriate percentile estimates. Suppose there were a 1000 virtual sample sets and thus 1000 computed sample means. The bootstrap interval picks the 25th — largest sample mean for the lower bound of the 95% confidence interval and picks the 975th — largest sample mean for the upper bound. Done.

Bootstrap intervals tend to give similar results as model-based intervals for test cases where the user generates the original data from a normal bell curve or the like. The same holds for bootstrap hypothesis tests. But in the real world we do not know the "true" distribution that generated the observed data. So why not avoid the clear potential for modeler bias and just use the bootstrap estimate in the first place?

Bootstrap resampling has started to invade almost every type of statistical decision making. Statisticians have even shown how to apply it in complex cases of time-series and dependent data. It still tends to appear in statistics texts as a special topic after the student learns the traditional model-based methods. And there may be no easy way to give student scientists and engineers an in-class exam on bootstrap resampling with real data.

Still the trend is toward ever more data-based methods in science and engineering — and thus towards letting the data tell more of their own story (if there is a story to tell). Math models have tradition and human psychology on their side. But our math models grow at an approximate linear rate while data processing grows exponentially.

Computing power will out.