Sparse Data Bias

SPARSE DATA BIAS:

A bias in a statistical estimate, that occurs due to using a statistical test (inappropriately), on a dataset that has too few data points (or no data), in the categories being evaluated by the test^1,2. For example, using logistic regression to draw conclusions about a relationship between an exposure and disease, when the number of data points in the exposure, disease, and/or confounder categories is too small (or zero).

Sparse Data Bias overestimates an association between an exposure and disease; such that a risk ratio (or other measure) will be further above, or further below the null value¹. The threat of Spare Data Bias is highest in research studies that use sophisticated statistical techniques to evaluate multiple levels of different variables at once (e.g. by stuffing regression models with multiple variables simultaneously). Sparse Data Bias may occur in these studies, even if the dataset appears large overall; since the dataset become thinner (sparse) as the number of comparisons increases. For this reason, studies that use machine learning methods may also be at risk of this bias. It has been suggested as a rule of thumb, that a dataset should contain at least ten events/data points per variable in any given analysis to minimize Sparse Data Bias³; although it may still exist, in some contexts, with more than ten. Also see: Small Sample Bias, Finite Sample Bias, Small Study Bias, Wrong Sample Size Bias, and Significance Bias.

References:

1. Greenland S, Mansournia MA, Altman DG. Sparse data bias: a problem hiding in plain sight. BMJ. 2016;352:i1981. (Link to Reference)

2. Rothman KJ, Mosquin PL. Sparse-data bias accompanying overly fine stratification in an analysis of beryllium exposure and lung cancer risk. Ann Epidemiol. 2013;23(2):43-8. (Link to Reference)

3. Richardson DB, Cole SR, Ross RK, Poole C, Chu H, Keil AP. Meta-Analysis and Sparse-Data Bias. Am J Epidemiol. 2021;190(2):336-40. (Link to Reference)