In Environment international ; h5-index 0.0
Systematic reviews involve mining literature databases to identify relevant studies. Identifying potentially relevant studies can be informed by computational tools comparing text similarity between candidate studies and selected key (i.e., seed) references. Challenge Using computational approaches to identify relevant studies for risk assessments is challenging, as these assessments examine multiple chemical effects across lifestages (e.g., human health risk assessments) or specific effects of multiple chemicals (e.g., cumulative risk). The broad scope of potentially relevant literature can make selection of seed references difficult. Approach We developed a generalized computational scoping strategy to identify human health relevant studies for multiple chemicals and multiple effects. We used semi-supervised machine learning to prioritize studies to review manually with training data derived from references cited in the hazard identification sections of several US EPA Integrated Risk Information System (IRIS) assessments. These generic training data or seed studies were clustered with the unclassified corpus to group studies based on text similarity. Clusters containing a high proportion of seed studies were prioritized for manual review. Chemical names were removed from seed studies prior to clustering resulting in a generic, chemical-independent method for identifying potentially human health relevant studies. We developed a case study that focused on identifying the array of chemicals that have been studied with respect to in utero exposure to test the recall of this novel literature searching strategy. We then evaluated the general strategy of using generic, chemical-independent training data with two previous IRIS assessments by comparing studies predicted relevant to those used in the assessments (i.e., total relevant). Outcome A keyword search designed to retrieve studies that examined the in utero effects of environmental chemicals identified over 54,000 candidate references. Clustering algorithms were applied using 1456 studies from multiple IRIS assessments with chemical names removed as training data or seeds (i.e., semi-supervised learning). Using a six-algorithm ensemble approach 2602 articles, or approximately 5% of candidate references, were "voted" relevant by four or more clustering algorithms and manual review confirmed nearly 50% of these studies were relevant. Further evaluations on two IRIS assessments, using a nine-algorithm ensemble approach and a set of generic, chemical-independent, externally-derived seed studies correctly identified 77-83% of hazard identification studies published in the assessments and eliminated the need to manually screen more than 75% of search results on average. Limitations The chemical-independent approach used to build the training literature set provides a broad and unbiased picture across a variety of endpoints and environmental exposures but does not systematically identify all available data. Variance between actual and predicted relevant studies will be greater because of the external and non-random origin of seed study selection. This approach depends on access to readily available generic training data that can be used to locate relevant references in an unclassified corpus. Impact A generic approach to identifying human health relevant studies could be an important first step in literature evaluation for risk assessments. This initial scoping approach could facilitate faster literature evaluation by focusing reviewer efforts, as well as potentially minimize reviewer bias in selection of key studies. Using externally-derived training data has applicability particularly for databases with very low search precision where identifying training data may be cost-prohibitive.
Cawley Michelle, Beardslee Renee, Beverly Brandy, Hotchkiss Andrew, Kirrane Ellen, Sams Reeder, Varghese Arun, Wignall Jessica, Cowden John