Another interesting discussion from Daniel Kahneman’s “Thinking, Fast and Slow” book concerns the so-called ‘law of small numbers’. I like this real life example, as it’s a great demonstration of why confidence intervals are so important in statistics.
Kahneman cites a study finding that the counties in the US with the lowest incidence of kidney cancer are typically rural and sparsely populated. It’s an easy conclusion to draw that the low incidence of cancer is due to the ‘clean living’ rural lifestyle, which is lower stress than urban city living, and exposes the residents of these regions to fewer carcinogenic compounds.
However, the same study also shows that the counties with the highest incidence of kidney cancer are typically rural and sparsely populated. One could conclude that this is down to factors like poverty and poor education which are more prevalent in rural areas.
This result is simply down to the difference in sample sizes for sparsely and densely populated regions. As sample size goes up, the mean of the sample tends towards the mean of the general population. Hence, densely populated regions with a large sample size will typically show an incidence of kidney cancer somewhere near the mean for the entire US population. In contrast, when the sample size is small, the mean of the sample varies much more. Hence, the sparsely populated areas are more likely to show an incidence of cancer that is markedly different from that of the general population. That is,
“small samples yield extreme results more often than large samples do”
It can often be easy to construct an plausible explanation as to why a particular sample shows an unusually high (or low) incidence of whatever you’re measuring, and base decisions or policy on this explanation. But first we must stop and consider how confident we are in what the data is telling us.