ArXiv Preprint
The UK COVID-19 Vocal Audio Dataset is designed for the training and
evaluation of machine learning models that classify SARS-CoV-2 infection status
or associated respiratory symptoms using vocal audio. The UK Health Security
Agency recruited voluntary participants through the national Test and Trace
programme and the REACT-1 survey in England from March 2021 to March 2022,
during dominant transmission of the Alpha and Delta SARS-CoV-2 variants and
some Omicron variant sublineages. Audio recordings of volitional coughs,
exhalations, and speech were collected in the 'Speak up to help beat
coronavirus' digital survey alongside demographic, self-reported symptom and
respiratory condition data, and linked to SARS-CoV-2 test results. The UK
COVID-19 Vocal Audio Dataset represents the largest collection of SARS-CoV-2
PCR-referenced audio recordings to date. PCR results were linked to 70,794 of
72,999 participants and 24,155 of 25,776 positive cases. Respiratory symptoms
were reported by 45.62% of participants. This dataset has additional potential
uses for bioacoustics research, with 11.30% participants reporting asthma, and
27.20% with linked influenza PCR test results.
Jobie Budd, Kieran Baker, Emma Karoune, Harry Coppock, Selina Patel, Ana Tendero Cañadas, Alexander Titcomb, Richard Payne, David Hurley, Sabrina Egglestone, Lorraine Butler, Jonathon Mellor, George Nicholson, Ivan Kiskin, Vasiliki Koutra, Radka Jersakova, Rachel A. McKendry, Peter Diggle, Sylvia Richardson, Björn W. Schuller, Steven Gilmour, Davide Pigoli, Stephen Roberts, Josef Packham, Tracey Thornley, Chris Holmes
2022-12-15