Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Hub gene identification and prognostic model construction for isocitrate dehydrogenase mutation in glioma.

In Translational oncology

Our study attempted to identify hub genes related to isocitrate dehydrogenase (IDH) mutation in glioma and develop a prognostic model for IDH-mutant glioma patients. In a first step, ten hub genes significantly associated with the IDH status were identified by weighted gene coexpression analysis (WGCNA). The functional enrichment analysis demonstrated that the most enriched terms of these hub genes were cadherin binding and glutathione metabolism. Three of these hub genes were significantly linked with the survival of glioma patients. 328 samples of IDH-mutant glioma were separated into two datasets: a training set (N = 228) and a test set (N = 100). Based on the training set, we identified two IDH-mutant subtypes with significantly different pathological features by using consensus clustering. A 31 gene-signature was identified by the least absolute shrinkage and selection operator (LASSO) algorithm and used for establishing a differential prognostic model for IDH-mutant patients. In addition, the test set was employed for validating the prognostic model, and the model was proven to be of high value in classifying prognostic information of samples. The functional annotation revealed that the genes related to the model were mainly enriched in nuclear division, DNA replication, and cell cycle. Collectively, this study provided novel insights into the molecular mechanism of IDH mutation in glioma, and constructed a prognostic model which can be effective for predicting prognosis of glioma patients with IDH-mutation, which might promote the development of IDH target agents in glioma therapies and contribute to accurate prognostication and management in IDH-mutant glioma patients.

Jia Yanfei, Yang Wenzhen, Tang Bo, Feng Qian, Dong Zhiqiang


Glioma, IDH mutation, Machine learning, Prognosis, WGCNA

General General

Are Classification Deep Neural Networks Good for Blind Image Watermarking?

In Entropy (Basel, Switzerland)

Image watermarking is usually decomposed into three steps: (i) a feature vector is extracted from an image; (ii) it is modified to embed the watermark; (iii) and it is projected back into the image space while avoiding the creation of visual artefacts. This feature extraction is usually based on a classical image representation given by the Discrete Wavelet Transform or the Discrete Cosine Transform for instance. These transformations require very accurate synchronisation between the embedding and the detection and usually rely on various registration mechanisms for that purpose. This paper investigates a new family of transformation based on Deep Neural Networks trained with supervision for a classification task. Motivations come from the Computer Vision literature, which has demonstrated the robustness of these features against light geometric distortions. Also, adversarial sample literature provides means to implement the inverse transform needed in the third step above mentioned. As far as zero-bit watermarking is concerned, this paper shows that this approach is feasible as it yields a good quality of the watermarked images and an intrinsic robustness. We also tests more advanced tools from Computer Vision such as aggregation schemes with weak geometry and retraining with a dataset augmented with classical image processing attacks.

Vukotić Vedran, Chappelier Vivien, Furon Teddy


Deep Learning, digital watermarking, feature extraction

General General

Machine learning model for predicting malaria using clinical information.

In Computers in biology and medicine

BACKGROUND : Rapid diagnosing is crucial for controlling malaria. Various studies have aimed at developing machine learning models to diagnose malaria using blood smear images; however, this approach has many limitations. This study developed a machine learning model for malaria diagnosis using patient information.

METHODS : To construct datasets, we extracted patient information from the PubMed abstracts from 1956 to 2019. We used two datasets: a solely parasitic disease dataset and total dataset by adding information about other diseases. We compared six machine learning models: support vector machine, random forest (RF), multilayered perceptron, AdaBoost, gradient boosting (GB), and CatBoost. In addition, a synthetic minority oversampling technique (SMOTE) was employed to address the data imbalance problem.

RESULTS : Concerning the solely parasitic disease dataset, RF was found to be the best model regardless of using SMOTE. Concerning the total dataset, GB was found to be the best. However, after applying SMOTE, RF performed the best. Considering the imbalanced data, nationality was found to be the most important feature in malaria prediction. In case of the balanced data with SMOTE, the most important feature was symptom.

CONCLUSIONS : The results demonstrated that machine learning techniques can be successfully applied to predict malaria using patient information.

Lee You Won, Choi Jae Woo, Shin Eun-Hee


Case reports, Diagnosis, Machine learning, Malaria, Patient information

General General

Utilizing a multi-class classification approach to detect therapeutic and recreational misuse of opioids on Twitter.

In Computers in biology and medicine

BACKGROUND : Opioid misuse (OM) is a major health problem in the United States, and can lead to addiction and fatal overdose. We sought to employ natural language processing (NLP) and machine learning to categorize Twitter chatter based on the motive of OM.

MATERIALS AND METHODS : We collected data from Twitter using opioid-related keywords, and manually annotated 6988 tweets into three classes-No-OM, Pain-related-OM, and Recreational-OM-with the No-OM class representing tweets indicating no use/misuse, and the Pain-related misuse and Recreational-misuse classes representing misuse for pain or recreation/addiction. We trained and evaluated multi-class classifiers, and performed term-level k-means clustering to assess whether there were terms closely associated with the three classes.

RESULTS : On a held-out test set of 1677 tweets, a transformer-based classifier (XLNet) achieved the best performance with F1-score of 0.71 for the Pain-misuse class, and 0.79 for the Recreational-misuse class. Macro- and micro-averaged F1-scores over all classes were 0.82 and 0.92, respectively. Content-analysis using clustering revealed distinct clusters of terms associated with each class.

DISCUSSION : While some past studies have attempted to automatically detect opioid misuse, none have further characterized the motive for misuse. Our multi-class classification approach using XLNet showed promising performance, including in detecting the subtle differences between pain-related and recreation-related misuse. The distinct clustering of class-specific keywords may help conduct targeted data collection, overcoming under-representation of minority classes.

CONCLUSION : Machine learning can help identify pain-related and recreational-related OM contents on Twitter to potentially enable the study of the characteristics of individuals exhibiting such behavior.

Fodeh Samah Jamal, Al-Garadi Mohammed, Elsankary Osama, Perrone Jeanmarie, Becker William, Sarker Abeed


Classification, Deep learning, Opioid abuse, Opioid misuse, Pain, Recreational opioid misuse, Twitter, Word2Vec

General General

Language models are an effective representation learning technique for electronic health record data.

In Journal of biomedical informatics ; h5-index 55.0

Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.

Steinberg Ethan, Jung Ken, Fries Jason A, Corbin Conor K, Pfohl Stephen R, Shah Nigam H


Electronic health record, Machine learning, Representation learning, Risk stratification, Transfer learning

General General

Annotating social determinants of health using active learning, and characterizing determinants using neural event extraction.

In Journal of biomedical informatics ; h5-index 55.0

Social determinants of health (SDOH) affect health outcomes, and knowledge of SDOH can inform clinical decision-making. Automatically extracting SDOH information from clinical text requires data-driven information extraction models trained on annotated corpora that are heterogeneous and frequently include critical SDOH. This work presents a new corpus with SDOH annotations, a novel active learning framework, and the first extraction results on the new corpus. The Social History Annotation Corpus (SHAC) includes 4,480 social history sections with detailed annotation for 12 SDOH characterizing the status, extent, and temporal information of 18K distinct events. We introduce a novel active learning framework that selects samples for annotation using a surrogate text classification task as a proxy for a more complex event extraction task. The active learning framework successfully increases the frequency of health risk factors and improves automatic extraction of these events over undirected annotation. An event extraction model trained on SHAC achieves high extraction performance for substance use status (0.82-0.93 F1), employment status (0.81-0.86 F1), and living status type (0.81-0.93 F1) on data from three institutions.

Lybarger Kevin, Ostendorf Mari, Yetisgen Meliha


Active learning, Machine learning, Natural language processing, Social determinants of health