In Accident; analysis and prevention
The estimation of the effect of contributors to crash injury severity and the prediction of crash injury severity outcomes suffer often from biases related to missing data in crash datasets that contain incomplete records. As both estimation and prediction would greatly improve if the missing values were recovered, this study proposes a sequential approach to handle incomplete crash datasets and rank contributors to the injury severity of crashes on mountainous freeways in China. The sequential approach consists of two parts: (i) multivariate imputation by chained equations imputes the missing values of independent variables; (ii) a random forest classifier analyses the correlation between the dependent and the independent variables. The first part considers different imputation methods in light of the independent variables being either binary, categorical or continuous, whereas the second part classifies the correlations according to the random forest classifier. The proposed method was applied to the case-study about mountainous freeways in China and compared to the analysis of the raw dataset to evaluate its effectiveness, and the results illustrate that the method improves significantly the classification accuracy when compared with existing methods. Moreover, the classifier ranked the contributors to the injury severity of traffic crashes on mountainous freeways: in order of importance vehicle type, crash type, road longitudinal gradient, crash cause, curve radius, and deflection angles. Interestingly, a lower importance was found for environmental factors.
Li Linchao, Prato Carlo G, Wang Yonggang
Machine learning, Missing values, Mountainous roads, Multiple imputation, Traffic safety