The COVID-19 Open Research Dataset (CORD-19) was released in March 2020 to allow the machine learning and wider research community to develop techniques to answer scientific questions on COVID-19. The data set consists of a large collection of scientific literature, including over 100,000 full text papers. Annotating training data to normalise variability in biological entities can improve the performance of downstream analysis and interpretation. To facilitate and enhance the use of the CORD-19 data in these applications, in late March 2020 we performed a comprehensive annotation process using named entity recognition tool, TERMite, along with a number of large reference ontologies and vocabularies including domains of genes, proteins, drugs and virus strains. The additional annotation has identified and tagged over 45 million entities within the corpus made up of 62,746 unique biomedical entities. The latest updated version of the annotated data, as well as older versions, is made openly available under GPL-2.0 License for the community to use at: https://github.com/SciBiteLabs/CORD19 .
Giles, O.; Huntley, R.; Karlsson, A.; Lomax, J.; Malone, J.