In Studies in health technology and informatics ; h5-index 23.0
Sample size is an important indicator of the power of randomized controlled trials (RCTs). In this paper, we designed a total sample size extractor using a combination of syntactic and machine learning methods, and evaluated it on 300 Covid-19 abstracts (Covid-Set) and 100 generic RCT abstracts (General-Set). To improve the performance, we applied transfer learning from a large public corpus of annotated abstracts. We achieved an average F1 score of 0.73 on the Covid-Set testing set, and 0.60 on the General-Set using exact matches. The F1 scores for loose matches on both datasets were over 0.74. Compared with the state-of-the-art tool, our extractor reports total sample sizes directly and improved F1 scores by at least 4% without transfer learning. We demonstrated that transfer learning improved the sample size extraction accuracy and minimized human labor on annotations.
Lin Fengyang, Liu Hao, Moon Paul, Weng Chunhua
Natural Language Processing, Randomized Controlled Trial, Sample Size