Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685–726. doi:10.1214/18-aoas1161sf
Meng gives a fundamental identity for the difference between \bar G_N, the population mean of some quantity G in a population of size N, and the sample mean \bar G_n for a sample of size n:
\bar G_n - \bar G_N = \rho_{R,G} \times \sqrt{\frac{1 - f}{f}} \times \sigma_G,
where \rho_{R,G} is the population correlation between sampling response (R = 0 or 1 for each potential respondent, depending on whether they respond to the survey) and G, f = n/N, and \sigma_G is the standard deviation of G. Hence the error factors into three pieces that Meng calls “Data Quality”, “Data Quantity”, and “Problem Difficulty”. Meng demonstrates that data quality matters a lot: a very small \rho_{R,G} can lead to huge relative error when n is large.
For example, if the expected correlation between inclusion and outcome is just 0.05, sampling half the population gives us an MSE for \bar G_n equivalent to a random sample of just 400 people. If that were a poll of eligible US voters, that comes out to a “99.999965% reduction of the sample size, or equivalently estimation efficiency.”
Note, however, that this identity is specifically for point estimation of means; it is not as useful if we are interested in estimating other quantities (such as relative comparisons between subgroups), or if we want to compare the quality of different surveys. See this preprint for details.
Because of my involvement with the COVID-19 Trends and Impact Survey, a large-scale survey collecting responses across the United States daily for over two years, I’m interested in understanding how survey biases can be understood across space and time, and perhaps corrected.
Gelman, A., Goel, S., Rivers, D., & Rothschild, D. (2016). The mythical swing voter. Quarterly Journal of Political Science, 11(1), 103–130. doi:10.1561/100.00015031
Shows that in conventional political tracking polls, apparent swings in voting intentions – such as after presidential debates – can be explained by swings in the sample composition, rather than actual changes in population voting intentions. These changes are masked by typical survey weighting.
Bisbee, J. (2019). BARP: Improving Mister P using Bayesian additive regression trees. American Political Science Review, 113(4), 1060–1065. doi:10.1017/s0003055419000480
As a poststratification approach, this makes sense for a survey like CTIS that has the large sample size (and large number of demographic variables) to support a flexible poststratification model.