We humans are pretty impressive when it comes to being able to extract information, to discern patterns from lots of little itsy bitsy data points. Take a musician sitting down with a set of instructions on a piece of paper — sheet music — and being able to turn it into patterned sound. And one step further is the very well-trained musician who can sit and read through printed music, even an entire orchestral score, and hear it in his head, and even feel swept up in emotion at various points in the reading. Even more remarkable is the judge in a composition competition, reading through a work that she has never heard before, able to turn that novel combination of notes into sounds in her head that can be judged to be hackneyed and derivative, or beautiful and original.
And, obviously, we do it in the scientific realm in a pretty major way. We come to understand how something works by being able to make sense of how a bunch of different independent variables interact in generating some endpoint. Oh, so that's how mitochondria have evolved to solve that problem, that's what a temperate zone rain forest does to balance those different environmental forces challenging it. Now I know.
The trouble is that it is getting harder to do that in the life sciences, and this is where something is going to have to happen which will change everything.
The root of the problem is technology outstripping our ability to really make use of it. This isn't so much about the ability to get increasingly reductive biological information. It was relatively some time ago that scientists figured out how to sequence a gene, identify a mutation, get the crystallographic structure of a protein, or measure ion flow through a single channel in a cell.
What the recent development has been is to be able to get staggeringly large amounts of that type of information. We have not just sequenced genes, but sequenced our entire human genome. And we can compare it to that of other species, or can look at genome-wide differences between human populations, or even individuals, or information about tens of thousands of different genes. And then we can look at expression of those genes — which ones are active at which time in which cell types in which individuals in which populations in which species.
We can do epigenomics, where instead of cataloging which genes exist in an individual, we can examine which genes have been modified in a long-term manner to make it easier or harder to activate them (in each particular cell type). Or we can do proteomics, examining which proteins and in what abundance have been made as the end product of the activation of those genes, or post-translational proteomics, examining how those proteins have been modified to change their functions.
Meanwhile, the same ability to generate massive amounts of data has emerged in other realms of the life sciences. For example, it is possible to do near continuous samplings of blood glucose levels, producing minute-by-minute determinations, or do ambulatory cardiology, generating heart beat data 24/7 for days from an individual going about her business, or use state-of-the-art electrophysiological techniques to record the electrical activity of scores of individual neurons simultaneously.
So we are poised to be able to do massive genomo-epigenomo-proteonomo-glyco-endo-neurono-orooni-omic comparisons of the Jonas Brothers with Nelson Mandela with a dinosaur pelvis with Wall-E and thus better understand the nature of life.
The problem, of course, is that we haven't a clue what to do with that much data. By that, I don't mean "merely" how to store, or quantitatively analyze, or present it visually. I mean how to really think about it.
You can already see evidence of this problem in too many microarray papers (this is the approach where you can ask, "In this particular type of tissue, which genes are more active and which less active than usual under this particular circumstance"). With the fanciest versions of this approach, you've got yourself thousands of bits of information at the end. And far too often, what is done with all this suggests that the scientists have hit a wall in terms of being able to squeeze insight out of their study.
For example, the conclusion in the paper might be, "Eleventy genes are more active under this circumstance, whereas umpteen genes are less active, and that's how things work." Or maybe the punch line is, "Of those eleventy genes that are more active, an awful lot of them have something to do with, say, metabolism, how's about that?" Or in a sheepish tone, it might be, "So changes occurred in the activity of eleventy + umpteen different genes, and we don't know what most of them do, but here are three that we do know about and which plausibly have something to do with this circumstance, so we're now going to focus on those three that we already know something about and ignore the rest."
In other words, the technologies have outstripped our abilities to be insightful far too often. We have some crutches — computer graphics allow us to display a three-dimensional scatter plot, rotate it, change it over time. But we still barely hold on.
The thing that is going to change everything will have to wait for, probably, our grandkids. It will come from their growing up with games and emergent networks and who knows what else that (obviously) we can't even imagine. And they'll be able to navigate that stuff as effortlessly as we troglodytes can currently change radio stations while driving while talking to a passenger. In other words, we're not going to get much out of these vast data sets until we have people who can intuit in six-dimensions. And then, watch out.