Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

General General

Graph-based regularization for regression problems with alignment and highly-correlated designs.

In SIAM journal on mathematics of data science

Sparse models for high-dimensional linear regression and machine learning have received substantial attention over the past two decades. Model selection, or determining which features or covariates are the best explanatory variables, is critical to the interpretability of a learned model. Much of the current literature assumes that covariates are only mildly correlated. However, in many modern applications covariates are highly correlated and do not exhibit key properties (such as the restricted eigenvalue condition, restricted isometry property, or other related assumptions). This work considers a high-dimensional regression setting in which a graph governs both correlations among the covariates and the similarity among regression coefficients - meaning there is alignment between the covariates and regression coefficients. Using side information about the strength of correlations among features, we form a graph with edge weights corresponding to pairwise covariances. This graph is used to define a graph total variation regularizer that promotes similar weights for correlated features. This work shows how the proposed graph-based regularization yields mean-squared error guarantees for a broad range of covariance graph structures. These guarantees are optimal for many specific covariance graphs, including block and lattice graphs. Our proposed approach outperforms other methods for highly-correlated design in a variety of experiments on synthetic data and real biochemistry data.

Li Yuan, Mark Benjamin, Raskutti Garvesh, Willett Rebecca, Song Hyebin, Neiman David


General General

Gaussian Embedding for Large-scale Gene Set Analysis.

In Nature machine intelligence

Gene sets, including protein complexes and signaling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein-protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumors, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a previously unknown clinical prognostic and predictive subnetwork around NEFM in sarcoma, which we validate in independent cohorts.

Wang Sheng, Flynn Emily R, Altman Russ B


General General

Simulating realistic fetal neurosonography images with appearance and growth change using cycle-consistent adversarial networks and an evaluation.

In Journal of medical imaging (Bellingham, Wash.)

Purpose: We present an original method for simulating realistic fetal neurosonography images specifically generating third-trimester pregnancy ultrasound images from second-trimester images. Our method was developed using unpaired data, as pairwise data were not available. We also report original insights on the general appearance differences between second- and third-trimester fetal head transventricular (TV) plane images. Approach: We design a cycle-consistent adversarial network (Cycle-GAN) to simulate visually realistic third-trimester images from unpaired second- and third-trimester ultrasound images. Simulation realism is evaluated qualitatively by experienced sonographers who blindly graded real and simulated images. A quantitative evaluation is also performed whereby a validated deep-learning-based image recognition algorithm (ScanNav®) acts as the expert reference to allow hundreds of real and simulated images to be automatically analyzed and compared efficiently. Results: Qualitative evaluation shows that the human expert cannot tell the difference between real and simulated third-trimester scan images. 84.2% of the simulated third-trimester images could not be distinguished from the real third-trimester images. As a quantitative baseline, on 3000 images, the visibility drop of the choroid, CSP, and mid-line falx between real second- and real third-trimester scans was computed by ScanNav® and found to be 72.5%, 61.5%, and 67%, respectively. The visibility drop of the same structures between real second-trimester and simulated third-trimester was found to be 77.5%, 57.7%, and 56.2%, respectively. Therefore, the real and simulated third-trimester images were consider to be visually similar to each other. Our evaluation also shows that the third-trimester simulation of a conventional GAN is much easier to distinguish, and the visibility drop of the structures is smaller than our proposed method. Conclusions: The results confirm that it is possible to simulate realistic third-trimester images from second-trimester images using a modified Cycle-GAN, which may be useful for deep learning researchers with a restricted availability of third-trimester scans but with access to ample second trimester images. We also show convincing simulation improvements, both qualitatively and quantitatively, using the Cycle-GAN method compared with a conventional GAN. Finally, the use of a machine learning-based reference (in the case ScanNav®) for large-scale quantitative image analysis evaluation is also a first to our knowledge.

Xu Yangdi, Lee Lok Hin, Drukker Lior, Yaqub Mohammad, Papageorghiou Aris T, Noble Alison J


cycle-consistent adversarial network, quantitative evaluation, realistic simulation, second-trimester scan, third-trimester scan, transventricular plane

Radiology Radiology

Vessel wall MR imaging of intracranial atherosclerosis.

In Cardiovascular diagnosis and therapy

Intracranial atherosclerotic disease (ICAD) is one of the most common causes of ischemic stroke worldwide. Along with high recurrent stroke risk from ICAD, its association with cognitive decline and dementia leads to a substantial decrease in quality of life and a high economic burden. Atherosclerotic lesions can range from slight wall thickening with plaques that are angiographically occult to severely stenotic lesions. Recent advances in intracranial high resolution vessel wall MR (VW-MR) imaging have enabled imaging beyond the lumen to characterize the vessel wall and its pathology. This technique has opened new avenues of research for identifying vulnerable plaque in the setting of acute ischemic stroke as well as assessing ICAD burden and its associations with its sequela, such as dementia. We now understand more about the intracranial arterial wall, its ability to remodel with disease and how we can use VW-MR to identify angiographically occult lesions and assess medical treatment responses, for example, to statin therapy. Our growing understanding of ICAD with intracranial VW-MR imaging can profoundly impact diagnosis, therapy, and prognosis for ischemic stroke with the possibility of lesion-based risk models to tailor and personalize treatment. In this review, we discuss the advantages of intracranial VW-MR imaging for ICAD, the potential of bioimaging markers to identify vulnerable intracranial plaque, and future directions of artificial intelligence and its utility for lesion scoring and assessment.

Song Jae W, Wasserman Bruce A


Black blood MR imaging, intracranial atherosclerosis, ischemic stroke, vessel wall MR imaging (VW-MR imaging)

Radiology Radiology

Cardiovascular/stroke risk predictive calculators: a comparison between statistical and machine learning models.

In Cardiovascular diagnosis and therapy

Background : Statistically derived cardiovascular risk calculators (CVRC) that use conventional risk factors, generally underestimate or overestimate the risk of cardiovascular disease (CVD) or stroke events primarily due to lack of integration of plaque burden. This study investigates the role of machine learning (ML)-based CVD/stroke risk calculators (CVRCML) and compares against statistically derived CVRC (CVRCStat) based on (I) conventional factors or (II) combined conventional with plaque burden (integrated factors).

Methods : The proposed study is divided into 3 parts: (I) statistical calculator: initially, the 10-year CVD/stroke risk was computed using 13 types of CVRCStat (without and with plaque burden) and binary risk stratification of the patients was performed using the predefined thresholds and risk classes; (II) ML calculator: using the same risk factors (without and with plaque burden), as adopted in 13 different CVRCStat, the patients were again risk-stratified using CVRCML based on support vector machine (SVM) and finally; (III) both types of calculators were evaluated using AUC based on ROC analysis, which was computed using combination of predicted class and endpoint equivalent to CVD/stroke events.

Results : An Institutional Review Board approved 202 patients (156 males and 46 females) of Japanese ethnicity were recruited for this study with a mean age of 69±11 years. The AUC for 13 different types of CVRCStat calculators were: AECRS2.0 (AUC 0.83, P<0.001), QRISK3 (AUC 0.72, P<0.001), WHO (AUC 0.70, P<0.001), ASCVD (AUC 0.67, P<0.001), FRScardio (AUC 0.67, P<0.01), FRSstroke (AUC 0.64, P<0.001), MSRC (AUC 0.63, P=0.03), UKPDS56 (AUC 0.63, P<0.001), NIPPON (AUC 0.63, P<0.001), PROCAM (AUC 0.59, P<0.001), RRS (AUC 0.57, P<0.001), UKPDS60 (AUC 0.53, P<0.001), and SCORE (AUC 0.45, P<0.001), while the AUC for the CVRCML with integrated risk factors (AUC 0.88, P<0.001), a 42% increase in performance. The overall risk-stratification accuracy for the CVRCML with integrated risk factors was 92.52% which was higher compared all the other CVRCStat.

Conclusions : ML-based CVD/stroke risk calculator provided a higher predictive ability of 10-year CVD/stroke compared to the 13 different types of statistically derived risk calculators including integrated model AECRS 2.0.

Jamthikar Ankush, Gupta Deep, Saba Luca, Khanna Narendra N, Araki Tadashi, Viskovic Klaudija, Mavrogeni Sophie, Laird John R, Pareek Gyan, Miner Martin, Sfikakis Petros P, Protogerou Athanasios, Viswanathan Vijay, Sharma Aditya, Nicolaides Andrew, Kitas George D, Suri Jasjit S


10-year risk, Atherosclerosis, cardiovascular disease (CVD), integrated models, machine learning-based calculator, statistical risk calculator, stroke

General General

Classifications of Neurodegenerative Disorders Using a Multiplex Blood Biomarkers-Based Machine Learning Model.

In International journal of molecular sciences ; h5-index 102.0

Easily accessible biomarkers for Alzheimer's disease (AD), Parkinson's disease (PD), frontotemporal dementia (FTD), and related neurodegenerative disorders are urgently needed in an aging society to assist early-stage diagnoses. In this study, we aimed to develop machine learning algorithms using the multiplex blood-based biomarkers to identify patients with different neurodegenerative diseases. Plasma samples (n = 377) were obtained from healthy controls, patients with AD spectrum (including mild cognitive impairment (MCI)), PD spectrum with variable cognitive severity (including PD with dementia (PDD)), and FTD. We measured plasma levels of amyloid-beta 42 (Aβ42), Aβ40, total Tau, p-Tau181, and α-synuclein using an immunomagnetic reduction-based immunoassay. We observed increased levels of all biomarkers except Aβ40 in the AD group when compared to the MCI and controls. The plasma α-synuclein levels increased in PDD when compared to PD with normal cognition. We applied machine learning-based frameworks, including a linear discriminant analysis (LDA), for feature extraction and several classifiers, using features from these blood-based biomarkers to classify these neurodegenerative disorders. We found that the random forest (RF) was the best classifier to separate different dementia syndromes. Using RF, the established LDA model had an average accuracy of 76% when classifying AD, PD spectrum, and FTD. Moreover, we found 83% and 63% accuracies when differentiating the individual disease severity of subgroups in the AD and PD spectrum, respectively. The developed LDA model with the RF classifier can assist clinicians in distinguishing variable neurodegenerative disorders.

Lin Chin-Hsien, Chiu Shu-I, Chen Ta-Fu, Jang Jyh-Shing Roger, Chiu Ming-Jang


Alzheimer’s disease, Parkinson’s disease, biomarkers, classification, deep learning model, frontotemporal dementia, linear discriminant analysis, multivariate imputation by chained equations, neurodegenerative disorders