Imputation (statistics)

Imputation (statistics)

In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset can then be analysed using standard techniques for complete data. The analysis should ideally take into account that there is a greater degree of uncertainty than if the imputed values had actually been observed, however, and this generally requires some modification of the standard complete-data analysis methods. Many imputation techniques are available.

A once-common method of imputation was hot-deck imputation where a missing value was imputed from a randomly selected similar record. The term "hot deck" dates back to the storage of data on punched cards, and indicates that the information donors come from the same dataset as the recipients. The stack of cards was "hot" because it was currently being processed.

Cold-deck imputation, by contrast, selects donors from another dataset. Since computer power has advanced rapidly and punched cards are no longer used, more sophisticated methods of imputation have generally superseded the original random and sorted hot deck imputation techniques, such as the nearest neighbour hot deck imputation and the approximate Bayesian bootstrap.

Since standard analysis techniques do not reflect the additional uncertainty due to imputing for missing data, further adjustments (such as multiple imputation or a Rao–Shao correction) are necessary to account for this.

Contents

Alternatives to imputing missing data

Imputation is not the only method available for handling missing data. It usually gives better results than listwise deletion (in which all subjects with any missing values are omitted from the analysis) and may be competitive with a maximum likelihood approach in many circumstances. The expectation-maximization algorithm is a method for finding maximum likelihood estimates that has been widely applied to missing data problems. Other successful methods include computational intelligence methods.[1]

In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. That was shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.[citation needed]

See also

References

  1. ^ T. Marwala. (2009) Computational Intelligence for Missing Data Imputation, Estimation, and Management Knowledge Optimization Techniques. Information Science Reference, ISBN 978-1-60566-336-4..

External links