Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Predictive and interpretable models via the stacked elastic net.

In Bioinformatics (Oxford, England)

MOTIVATION : Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often want to know which features have effects, whether these effects are positive or negative, and how strong these effects are. Regression analysis includes this information in the coefficients but typically renders less predictive models than more advanced machine learning techniques.

RESULTS : Here we propose an interpretable meta-learning approach for high-dimensional regression. The elastic net provides a compromise between estimating weak effects for many features and strong effects for some features. It has a mixing parameter to weight between ridge and lasso regularisation. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. We do this in a way that increases predictivity without sacrificing interpretability.

AVAILABILITY AND IMPLEMENTATION : The R package starnet is available on GitHub: https://github.com/rauschenberger/starnet.

SUPPLEMENTARY INFORMATION : Supplementary data are available at Bioinformatics online.

Rauschenberger Armin, Glaab Enrico, van de Wiel Mark

2020-May-21

General General

Optimizing an eDNA protocol for estuarine environments: Balancing sensitivity, cost and time.

In PloS one ; h5-index 176.0

Environmental DNA (eDNA) analysis has gained traction as a precise and cost-effective method for species and waterways management. To date, publications on eDNA protocol optimization have focused primarily on DNA yield. Therefore, it has not been possible to evaluate the cost and speed of specific components of the eDNA protocol, such as water filtration and DNA extraction method when designing or choosing an eDNA protocol. At the same time, these two parameters are essential for the experimental design of a project. Here we evaluate and rank 27 different eDNA protocols in the context of Chinook salmon (Oncorhynchus tshawytscha) eDNA detection in an estuarine environment. We present a comprehensive evaluation of multiple eDNA protocol parameters, balancing time, cost and DNA yield. We collected samples composed of 500 mL estuarine water from Deverton Slough (38°11'16.7"N 121°58'34.5"W) and 500 mL from tank water containing 1.3 juvenile Chinook Salmon per liter. Then, we compared extraction methods, filter types, use of inhibitor removal kit for DNA yield, processing time, and protocol cost. Lastly, we used an MCMC algorithm together with machine learning to understand the DNA yield of each step of the protocol as well as the interactions between those steps. Glass fiber filtration was to be the most resilient to high turbidites, filtering the samples in 2.32 ± 0.08 min instead of 14.16 ± 1.86 min and 6.72 ± 1.99 min for nitrocellulose and paper filter N1, respectively. The filtration DNA yield percentages for paper filter N1, glass fiber, and nitrocellulose were 0.00045 ± 0.00013, 0.00107 ± 0.00013, 0.00172 ± 0.00013. The DNA extraction yield percentage for QIagen, dipstick, NaOH, magnetic beads, and direct dipstick ranged from 0.047 ± 0.0388 to 0.475 ± 0.0357. For estuarine waters, which are challenging for eDNA studies due to high turbidity, variable salinity, and the presence of PCR inhibitors, we found that a protocol combining glass filters, magnetic beads, and an extra step for PCR inhibitor removal, is the method that best balances time, cost, and yield. In addition, we provide a generalized decision tree for determining the optimal eDNA protocol for other studies in aquatic systems. Our findings should be applicable to most aquatic environments and provide a clear guide for determining which eDNA protocol should be used under different study constraints.

Sanches Thiago M, Schreier Andrea D

2020

General General

Mangrove forest classification and aboveground biomass estimation using an atom search algorithm and adaptive neuro-fuzzy inference system.

In PloS one ; h5-index 176.0

BACKGROUND : Advances in earth observation and machine learning techniques have created new options for forest monitoring, primarily because of the various possibilities that they provide for classifying forest cover and estimating aboveground biomass (AGB).

METHODS : This study aimed to introduce a novel model that incorporates the atom search algorithm (ASO) and adaptive neuro-fuzzy inference system (ANFIS) into mangrove forest classification and AGB estimation. The Ca Mau coastal area was selected as a case study since it has been considered the most preserved mangrove forest area in Vietnam and is being investigated for the impacts of land-use change on forest quality. The model was trained and validated with a set of Sentinel-1A imagery with VH and VV polarizations, and multispectral information from the SPOT image. In addition, feature selection was also carried out to choose the optimal combination of predictor variables. The model performance was benchmarked against conventional methods, such as support vector regression, multilayer perceptron, random subspace, and random forest, by using statistical indicators, namely, root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2).

RESULTS : The results showed that all three indicators of the proposed model were statistically better than those from the benchmarked methods. Specifically, the hybrid model ended up at RMSE = 70.882, MAE = 55.458, R2 = 0.577 for AGB estimation.

CONCLUSION : From the experiments, such hybrid integration can be recommended for use as an alternative solution for biomass estimation. In a broader context, the fast growth of metaheuristic search algorithms has created new scientifically sound solutions for better analysis of forest cover.

Pham Minh Hai, Do Thi Hoai, Pham Van-Manh, Bui Quang-Thanh

2020

General General

Drought index prediction using advanced fuzzy logic model: Regional case study over Kumaon in India.

In PloS one ; h5-index 176.0

A new version of the fuzzy logic model, called the co-active neuro fuzzy inference system (CANFIS), is introduced for predicting standardized precipitation index (SPI). Multiple scales of drought information at six meteorological stations located in Uttarakhand State, India, are used. Different lead times of SPI were computed for prediction, including 1, 3, 6, 9, 12, and 24 months, with inputs abstracted by autocorrelation function (ACF) and partial-ACF (PACF) analysis at 5% significance level. The proposed CANFIS model was validated against two models: classical artificial intelligence model (e.g., multilayer perceptron neural network (MLPNN)) and regression model (e.g., multiple linear regression (MLR)). Several performance evaluation metrices (root mean square error, Nash-Sutcliffe efficiency, coefficient of correlation, and Willmott index), and graphical visualizations (scatter plot and Taylor diagram) were computed for the evaluation of model performance. Results indicated that the CANFIS model predicted the SPI better than the other models and prediction results were different for different meteorological stations. The proposed model can build a reliable expert intelligent system for predicting meteorological drought at multi-time scales and decision making for remedial schemes to cope with meteorological drought at the study stations and can help to maintain sustainable water resources management.

Malik Anurag, Kumar Anil, Salih Sinan Q, Kim Sungwon, Kim Nam Won, Yaseen Zaher Mundher, Singh Vijay P

2020

General General

Innovation in Chinese internet companies: A meta-frontier analysis.

In PloS one ; h5-index 176.0

The innovation of a particular company benefits the whole industry when innovation technology transfers to others. Similarly, the development and innovation in internet companies influence the development and innovation of the industry. This investigation has applied a unique approach of meta-frontier analysis to estimate and analyze the innovation in internet companies in China. A unique dataset of Chinese internet companies from 2000 to 2017 has been utilized to estimate and compare the innovation over the period of study. The change in technology gap ratio (TGR) and a shift in production function have translated into innovation which was overlooked by previous studies. It is found that the production function of internet companies is moving upward in the presence of external factors such as smartphones invention, mobile internet, mobile payments, and artificial intelligence, etc. Consequently, a sudden increase in TGR is captured due to the innovation of some companies. Hence, the average TE of the industry falls caused by the increased distance of other companies form industry production function. However, the innovation advantage defused when other companies start imitating and the average TE elevates. A steady increase in the TGR index revealed that the continuous innovation-based growth of some companies lifting the production frontier upward. This provides the opportunity for other companies to imitate and provides continuous growth in the industry. This study provides a novel methodological approach to measure innovation and also provide practical implication by empirical estimation of innovation in Chinese internet companies.

Hafeez Sadaf, Arshad Noreen Izza, Rahim Lukman Bin A B, Shabbir Muhammad Farooq, Iqbal Jawad

2020

General General

Machine learning provides evidence that stroke risk is not linear: The non-linear Framingham stroke risk score.

In PloS one ; h5-index 176.0

Current stroke risk assessment tools presume the impact of risk factors is linear and cumulative. However, both novel risk factors and their interplay influencing stroke incidence are difficult to reveal using traditional additive models. The goal of this study was to improve upon the established Revised Framingham Stroke Risk Score and design an interactive Non-Linear Stroke Risk Score. Leveraging machine learning algorithms, our work aimed at increasing the accuracy of event prediction and uncovering new relationships in an interpretable fashion. A two-phase approach was used to create our stroke risk prediction score. First, clinical examinations of the Framingham offspring cohort were utilized as the training dataset for the predictive model. Optimal Classification Trees were used to develop a tree-based model to predict 10-year risk of stroke. Unlike classical methods, this algorithm adaptively changes the splits on the independent variables, introducing non-linear interactions among them. Second, the model was validated with a multi-ethnicity cohort from the Boston Medical Center. Our stroke risk score suggests a key dichotomy between patients with history of cardiovascular disease and the rest of the population. While it agrees with known findings, it also identified 23 unique stroke risk profiles and highlighted new non-linear relationships; such as the role of T-wave abnormality on electrocardiography and hematocrit levels in a patient's risk profile. Our results suggested that the non-linear approach significantly improves upon the baseline in the c-statistic (training 87.43% (CI 0.85-0.90) vs. 73.74% (CI 0.70-0.76); validation 75.29% (CI 0.74-0.76) vs 65.93% (CI 0.64-0.67), even in multi-ethnicity populations. The clinical implications of the new risk score include prioritization of risk factor modification and personalized care at the patient level with improved targeting of interventions for stroke prevention.

Orfanoudaki Agni, Chesley Emma, Cadisch Christian, Stein Barry, Nouh Amre, Alberts Mark J, Bertsimas Dimitris

2020