Misclassification bias in statistical learning

In what way can we reduce misclassification bias in statistical learning so that we obtain more accurate classifier-based statistics?
There are two conflicting developments that affect the field of official statistics. On the one hand, there is an increasing demand for the swift availability of detailed and highly accurate statistical information. The current craving for accurate information about excess deaths due to COVID-19 is a striking example. On the other hand, national statistical institutes (NSIs) that produce official statistics on such topics have to endure budget cuts and are obliged to reduce the survey burden on companies and citizens. The consequence of these two conflicting developments is that NSIs will have to rely increasingly on new types of data (i.e., big data) that must be processed and analysed by new types of methods (viz. statistical learning methods).

This thesis focuses on a specific group of statistical learning methods, namely classifiers. When the output of a classifier is aggregated, one obtains classifierbased statistics. If a classifier is not perfect, the resulting classifier-based statistics suffer from misclassification bias. To correct for that bias, a test set containing perfect information on the true classifications is required. A key challenge is selecting a correction method, in particular when dealing with time series that are non-stationary (i.e., that suffer from concept drift). The following open problem in the literature is raised: no solid theoretical analyses of methods correcting for misclassification bias in finite populations exist. Hence, the problem statement is formulated as follows: In what way can we reduce misclassification bias in statistical learning so that we obtain more accurate classifier-based statistics?

The conclusion of this thesis is that statistical learning methods can be used in the field of official statistics as long as misclassification bias is adequately corrected for. Our recommendation is to implement statistical learning methods (and the correction methods for misclassification bias discussed in this thesis) either to create newofficial statistics or to improve existing ones. Finally, we argue that domain experts are of vital importance to the successful implementation of statistical learning methods within official statistics.

Meertens, Q. A. (2021). Misclassification bias in statistical learning. Dissertation, University of Amsterdam, handle:11245.1/4b031bbd-5a46-4181-b0f1-52b38a3b63a6