Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

In Journal of medical Internet research ; h5-index 88.0

BACKGROUND : Assessment of the quality of medical evidence available online is a critical step in the systematic review of clinical evidence. Existing tools that automate parts of this task validate the quality of individual studies but not of entire bodies of evidence, and focus on a restricted set of quality criteria.

OBJECTIVE : We propose a quality assessment task that consists of providing an overall quality rating for each outcome, as well as finer-grained justification for different quality criteria according to the GRADE formalisation framework. For this, we construct a new dataset and develop a machine-learning baseline system (EvidenceGRADEr). Our goal is to work towards evaluating the quality of a body of evidence (BoE) for a specific clinical question, rather than assessing the quality of individual primary studies.

METHODS : We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews (CDSR). Each BoE is defined by a set of PICO criteria (population-intervention-comparison-outcome) and assigned a quality grade (high/moderate/low/very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of review text are extracted as support for grading each BoE. After pruning the resulting dataset with various quality checks, we used it to train several variants of a feature-rich neural model. The predictions were compared against the labels originally assigned by the authors of the systematic reviews.

RESULTS : Our quality assessment dataset, CDSR-QoE, contains 13,440 instances, or BoEs labelled for quality, originating from 2,252 systematic reviews published on the Internet in the years 2002--2020. Based on 10-fold cross-validation, the best neural binary classifiers for quality criteria detect risk of bias at .78 F1 (P: .68, R: .92) and imprecision at .75 F1 (P: .66, R: .86), while the performance on inconsistency, indirectness and publication bias criteria is lower (F1 in the range of .3-.4). The prediction of the overall quality grade into one of the four levels results in 0.5 F1. When casting the task as a binary problem by merging the GRADE classes (high+moderate vs. low+very low quality evidence), we attain .74 F1. We also find that the results vary depending on what supporting information is provided as input to the models.

CONCLUSIONS : There are different factors affecting the quality of evidence in the context of systematic reviews of medical evidence. Some of them (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.

Suster Simon, Baldwin Timothy, Lau Jey Han, Jimeno Yepes Antonio, Martinez Iraola David, Otmakhova Yulia, Verspoor Karin

2023-Jan-31