two concepts about sampling that were tricky for me

Jun 28, 2017

Here’s a couple related points about statistics that it took me a long time to grasp, and which really improved my intuitive understanding of statistical arguments.

1. It's not the sample size, it's the sampling mechanism.

Well, OK. It's somewhat the sample size, obviously. My point is that most people who encounter a study's methodology are much more likely to remark on the sample size - and pronounce it too small - than to remark on the sampling mechanism. I can't tell you have often I've seen studies with an n = 100 that have been dismissed by commenters online as too small to take seriously. Depending on the design of the study, and the variables being evaluated, 100 can be a good-enough sample size. In fact, under certain circumstances (medical testing of rare conditions, say) an n of 30 is sufficient to draw some conclusions about populations.

We can't say with 100% accuracy what a population's average for a given trait is when we use inferential statistics. (We actually can't say that with 100% accuracy even when taking a census, but that's another discussion.) But we can say with a chosen level of confidence that the average lies in a particular range, which can often be quite small, and from which we can make predictions of remarkable accuracy - provided the sampling mechanism was adequately random. By random, we mean that every member of the population has an equivalent chance of being selected for the sample. If there are factors that make one group more or less likely to be selected for the sample, that is statistical bias (as opposed to statistical error).

It's important to understand the declining influence of sample size in reducing statistical error as sample size grows. Because calculating confidence intervals and margins of error involves placing the n under a square root sign, the power of sample size declines exponentially (fixed). Here's the formula for margin of error:

Z* σ/√(n)

where Z is a Z-value that you look up in a chart for a given confidence level (often 95% or 99%), σ is the standard deviation, and n is your number of observations. You see two clear things here: first, spread (standard deviation) is super important to how confident we can be about the accuracy of an average. (Report spread when reporting an average!) Second, that we get declining improvements to accuracy as we increase sample size.That means that after a point, adding hundreds of more observations gets you less power than you got from adding 10 at lower ns. Given the resources involved in data collection, this can make expanding sample size a low-value proposition.

Now compare a rigorously controlled study with an n = 30 which was drawn with a random sampling mechanism to, say, those surveys that ESPN.com used to run all the time. Those very often get sample sizes in the hundreds of thousands. But the sampling mechanism is a nightmare. They're voluntary response instruments that are biased in any number of ways: underrepresenting people without internet access, people who aren't interested in sports, people who go to SI.com instead of ESPN.com, on and on. The value of the 30 person instrument is far higher than that of the ESPN.com data. The sampling mechanism makes the sample size irrelevant.

Sample size does matter, but in common discussions of statistics, its importance is misunderstood, and the value of increasing sample size declines as n grows.

2. For any reasonable definition of a sample, population size relative to sample size is irrelevant for the statistical precision of findings.

A 1,000 person sample, if drawn with some sort of rigorous random sampling mechanism, is exactly as descriptive and predictive when drawn randomly from the ~570,000 person population of Wyoming as it is when drawn randomly from the ~315 million person population of the United States. (If intended as samples of people in Wyoming and people in the United States respectively, of course.)

I have found this one very hard to wrap my mind around, but it's the case. The formulas for margin of error, confidence intervals, and the like do not involve any reference to the size of the total population. You can think about it this way: each time you pull a sample at random from some population, the odds of your sample being unlike the population goes down regardless of the size of that population. The mistake lies in thinking that the point of increasing sample size lies in making it closer in proportion to population. In reality, the point is just to increase the number of attempts in order to reduce the possibility that previous attempts produced statistically unlikely results. Even if you had an infinite population, every time you draw a sample from that population you would be decreasing the chance that you're randomly pulling an unrepresentative sample.

The essential caveat lies in "for any reasonable definition of a sample." Yes, testing 900 out of a population of 1000 is more accurate than testing 900 out of a population of 1,000,000. But nobody would ever call 90% of a population a sample. You see different thresholds for where a sample begins and ends; some people say that anything larger than 1/100th of the total population is no longer a sample, but it varies. The point holds: when we're dealing with real-world samples, where the population we care about is vastly larger than any reasonable sample size, the population size is irrelevant to the calculation error in our statistical inferences. This one is quite counterintuitive and took me a long time to really grasp.

Freddie deBoer

two concepts about sampling that were tricky for me