ArXiv Preprint
This paper presents a federated learning (FL) approach to train an AI model
for SARS-Cov-2 coronavirus variant identification. We analyze the SARS-CoV-2
spike sequences in a distributed way, without data sharing, to detect different
variants of the rapidly mutating coronavirus. A vast amount of sequencing data
of SARS-CoV-2 is available due to various genomic monitoring initiatives by
several nations. However, privacy concerns involving patient health information
and national public health conditions could hinder openly sharing this data. In
this work, we propose a lightweight FL paradigm to cooperatively analyze the
spike protein sequences of SARS-CoV-2 privately, using the locally stored data
to train a prediction model from remote nodes. Our method maintains the
confidentiality of local data (that could be stored in different locations) yet
allows us to reliably detect and identify different known and unknown variants
of the novel coronavirus SARS-CoV-2. We compare the performance of our approach
on spike sequence data with the recently proposed state-of-the-art methods for
classification from spike sequences. Using the proposed approach, we achieve an
overall accuracy of $93\%$ on the coronavirus variant identification task. To
the best of our knowledge, this is the first work in the federated learning
paradigm for biological sequence analysis. Since the proposed model is
distributed in nature, it could scale on ``Big Data'' easily. We plan to use
this proof-of-concept to implement a privacy-preserving pandemic response
strategy.
Prakash Chourasia, Taslim Murad, Zahra Tayebi, Sarwan Ali, Imdad Ullah Khan, Murray Patterson
2023-02-17