In Methods of information in medicine
BACKGROUND : Public health emergencies leave little time to develop novel surveillance efforts. Understanding which preexisting clinical datasets are fit for surveillance use is high value. Covid-19 offers a natural applied informatics experiment to understand the fitness of clinical datasets for use in disease surveillance.
OBJECTIVES : This study evaluates the agreement between legacy surveillance time series data and discovers their relative fitness for use in understanding the severity of the Covid-19 emergency in the United States. Here fitness for use means the statistical agreement between events across series.
METHODS : 13 weekly clinical event series from before and during the Covid-19 era for the United States were collected and integrated into a (multi) time series event data model. The Centers for Disease Control and Prevention (CDC) Covid-19 attributable mortality, CDC's excess mortality model, national Emergency Medical System (EMS) calls and Medicare encounter level claims were the data sources considered in this study. Cases were indexed by week from January 2015 through June of 2021 and fit to distributed random forest models. Models returned the variable importance when predicting the series of interest from the remaining time series.
RESULTS : Model r2 statistics ranged from .78 to .99 for the share of the volumes predicted correctly. Prehospital EMS data was high value and cardiac arrest prior to EMS arrival was on average the best predictor (tied with study week). Covid-19 Medicare claims volumes can predict Covid-19 death certificates (agreement) while generic viral respiratory Medicare claim volumes cannot predict Medicare Covid-19 claims (disagreement).
CONCLUSIONS : Prehospital EMS data should be considered when evaluating the severity of Covid-19 because prehospital cardiac arrest known to EMS was the strongest predictor on average across indices. Key Words Random Forest Covid-19 Public Health Statistical methods Syndromic Surveillance 1.Introduction Creating long term, multi-source, national surveillance data services for emerging disease response is a complex topic which Covid-19 has given new importance1-5. Public health emergencies seldom leave surplus time or resources to stand up novel methods and respond; further essentializing (specific) disease preparedness6-8. More often than not epidemic response is managed using preexisting data services, often legacy data series from yesteryear's epidemics9-11. Epidemic preparedness in the United States is generally weak; and the Covid-19 response is largely drawn from preexisting pan-flu emergency plans12,13. During a public health emergency, the clinical knowledge needed to respond is developed by case surveillance drawn from preexisting data series. Covid-19 has presented an unusual opportunity to evaluate agreement across surveillance efforts within the United States. The ability to detect clinical findings from surveillance nets and epidemiology methods which were not necessarily designed to detect them in meaningful ways is high priority for the future management of emerging infectious diseases. Strikingly the difference in Covid-19 mortality for SARS impacted countries (China, South Korea, Australia) vs. the United States may come down to what emergency response plan was last implemented (SARS vs. Swine Flu) and the fitness of surveillance (case specific vs general population) rather than deeper cultural, economic, or racial differences, as have been proposed in popular media14-20. 2.Objectives In this study public health surveillance data is processed using a machine learning approach to discover the relative agreement of a surveillance event series when predicting surveillance event series. Towards objectives this study seeks to assess the agreement between event series and contrast the value of traditional surveillance methods (death certificates, influenza and respiratory infection claims volumes) with non-traditional sources such as national Emergency Medical Services (EMS) call volume data in the Covid-19 era in the United States. 3.Methods 3.1 Statistic of Interest Variable importance is the statistic of interest in this study. Variable importance means that when predicting the dependent variable, an independent variable which is of comparatively higher predictive value (association) than another is of higher (predictive) use value. When considering high variable importance with weekly event series data, series which help the machine learning models learn, predict or guess the correct dependent weekly event series could be co-occurring or mutually observed events. The high variable importance scores from different sources suggests that series are observing the same real world event across surveillance efforts as they support prediction better than noise and other candidate series (other independent variables). Of special interest are 'high variable importance, independent variables' from a different data source than the dependent variable. High same source variables are most likely high in value because they are similarly distributed across study weeks to their parent-sister series and in turn are not necessarily interesting. A series of events can be said to have 'agreement value' if it has high statistical agreement with other series from a different source. Low statistical agreement suggests 'out of era' events, or events which are not driven by the same causes as other series considered here. Towards noise and disagreement, influenza and respiratory infection claims volumes are considered below with Covid-19 claims volumes. Claims volumes are traditionally used in influenza surveillance. As a test of the efficacy of the models described here, Covid-19 volumes should be able to 'out perform' influenza volumes as the Covid-19 era is largely understood to be influenza sparse. In this way respiratory and influenza events could be understood as a control arm as well as a model output of independent interest.