Assistant Professor, Department of Sociology, University of Minnesota-Twin Cities; Faculty Member, Minnesota Population Center
Length-Biased Sampling

Here are three puzzles.

  •  American fertility fluctuated dramatically in the decades surrounding the Second World War. Parents created the smallest families during the Great Depression, and the largest families during the postwar Baby Boom. Yet children born during the Great Depression came from larger families than those born during the Baby Boom. How can this be?
  • About half of the prisoners released in any given year in the United States will end up back in prison within five years. Yet the proportion of prisoners ever released who will ever end up back in prison, over their whole lifetime, is just one third. How can this be?
  • People whose cancers are caught early by random screening often live longer than those whose cancers are detected later, after they are symptomatic. Yet those same random screenings might not save any lives. How can this be?

And here is a twist: these are all the same puzzle.

The solution is adopting the right perspective. Consider the family puzzle. One side is about what parents do; the other side is about what kids experience. If families all had the same number of kids, these perspectives would coincide: The context parents create is the context kids live in. But when families aren’t all the same size, it matters whose perspective you take.

Imagine trying to figure out the average family size in a particular neighborhood. You could ask the parents how many kids they have. Big families and small families will count equally. Or you could ask the children how many siblings they have. A family with five kids will show up in the data five times, and childless families won’t show up at all. The question is the same: How big is your family? But when you ask kids instead of parents, the answers are weighted by the size of the family. This isn’t a data error so much as a trick of reality: The average kid actually has a bigger family than the average parent does. And (as the great demographer Sam Preston has pointed out), during the Great Depression, when families were either very small or very large, this effect was magnified—so the average child came from a very large family even though the average adult produced a small family.

The recidivism puzzle is the family puzzle on a slant. When we look at released prisoners at a moment in time, we see the ones who leave prison most often—which are also the ones who return most often. We see, as Williams Rhodes and his colleagues recently pointed out, the repeat offenders. Meanwhile, the population that ever leaves prison has 2-to-1 odds of never going back.

Snapshots bias samples: When some people experience something—like a prison release—more often than others, looking at a random moment in time guarantees a non-random assortment of people.

And the cancer screenings? Screenings reveal cancer at an intermediary stage—when it is advanced enough to be detectable, but not so advanced that the patient would have shown up for testing without being screened. And this intermediary, detectable stage generally lasts longer for cancers that spread slowly. The more time the cancer spends in the detectable stage, the more likely it is to be detected. So the screenings disproportionately find the slower-growing, less-lethal cancers, whether or not early detection does anything to diminish their lethality. Assigning screenings randomly to people necessarily assigns screenings selectively to tumor types.

The twist in these puzzles is “length-biased sampling”: it’s when we see clusters in proportion to their size. Length-biased sampling reveals how lifespans—of people, of post-prison careers, of diseases—bundle time the way families bundle children.

All this may seem like a methodological point, and indeed, researchers go awry when we ask about one level but unwittingly answer about another. But length-biased sampling also explains how our social positions can give us very different experiences of the world—as when, if a small group of men each harasses many women, few men know a harasser, but many women are harassed.

Most fundamentally, length-biased sampling is the deep structure of nested categories. It’s not just that the categories can play by different rules, but that they must.

Consider again those differently sized families, now stretching out over generations. If we each had the same number of children as our parents did, small families would beget a small number of new small families—and large families would beget larger and larger numbers of families with many children of their own. With each passing generation, the larger a family is, the more common families of its size would become. The mushrooming of families with many kids would sprout into wild, unchecked population growth.

This implies that, as Preston’s analysis of family sizes showed us, population stability requires family instability: If the population is to stay roughly the same size, most children must grow up to have fewer of their own kids than their parents did, each generation rejecting tradition anew. And indeed, most of us do. Adages about the rebelliousness of youth may have their roots in culture or in developmental psychology, yes, but their truth is also demographic: Between a whole country and the families it comprises, stability can occur at one level or the other, but never both.

Categories nestle inside one another, tumor inside person inside family inside nation. They nestle not as Russian dolls, a regress of replicas, but rather layered like rock and soil, each layer composing the world differently. Whether we see equality or divergence, stasis or change, depends in part on the level at which we look. Length-biased sampling is the logic that links the levels into solid ground, and the tunnel that lets us walk between them.