victoria_stodden's picture
Associate Professor of Information Sciences, University of Illinois at Urbana-Champaign

In statistical modeling the use of the Greek letter “epsilon” explicitly recognizes that uncertainty is intrinsic to our world. The statistical paradigm envisions two components: data or measurements drawn from the world we observe; and the underlying processes that generated these observed data. Epsilon appears in mathematical descriptions of these underlying processes and represents the inherent randomness with which the data we observe are generated. Through the collection and modeling of data we hope to make better guesses at the mathematical form of these underlying processes, with the idea that a better understanding of the data generating mechanism will allow us to do a better job modeling and predicting the world around us.

That use of epsilon is a direct recognition of the inability of data-driven research to perfectly predict the future, no matter the computing or data collection resources. It codifies that uncertainty exists in the world itself. We may be able to understand the structure of this uncertainty better over time, but the statistical paradigm asserts uncertainty as fundamental.

So we can never expect perfect predictions, even if we manage to take perfect measurements. This inherent uncertainty means doubt isn't a negative, or a weakness, but a mature recognition that our knowledge is imperfect. The statistical paradigm is increasingly being used as we continue to collect and analyze vast amounts of data, and as the output of algorithms and models grows as a source of information. We are seeing the impact across society: evidence-based policy; evidence-based medicine; more sophisticated pricing and market prediction models; social media customized to our online browsing patterns... The intelligent use of the information derived from statistical models relies on understanding uncertainty, as does policy making and our cultural understanding of this information source. The 21st century is surely the century of data, and correctly understanding its use has high stakes.