ArXiv Preprint
Missing data is common in datasets retrieved in various areas, such as
medicine, sports, and finance. In many cases, to enable proper and reliable
analyses of such data, the missing values are often imputed, and it is
necessary that the method used has a low root mean square error (RMSE) between
the imputed and the true values. In addition, for some critical applications,
it is also often a requirement that the logic behind the imputation is
explainable, which is especially difficult for complex methods that are for
example, based on deep learning. This motivates us to introduce a conditional
Distribution based Imputation of Missing Values (DIMV) algorithm. This approach
works based on finding the conditional distribution of a feature with missing
entries based on the fully observed features. As will be illustrated in the
paper, DIMV (i) gives a low RMSE for the imputed values compared to
state-of-the-art methods under comparison; (ii) is explainable; (iii) can
provide an approximated confidence region for the missing values in a given
sample; (iv) works for both small and large scale data; (v) in many scenarios,
does not require a huge number of parameters as deep learning approaches and
therefore can be used for mobile devices or web browsers; and (vi) is robust to
the normally distributed assumption that its theoretical grounds rely on. In
addition to DIMV, we also introduce the DPER* algorithm improving the speed of
DPER for estimating the mean and covariance matrix from the data, and we
confirm the speed-up via experiments.
Mai Anh Vu, Thu Nguyen, Tu T. Do, Nhan Phan, Pål Halvorsen, Michael A. Riegler, Binh T. Nguyen
2023-02-02