ArXiv Preprint
Keyphrase generation is the task consisting in generating a set of words or
phrases that highlight the main topics of a document. There are few datasets
for keyphrase generation in the biomedical domain and they do not meet the
expectations in terms of size for training generative models. In this paper, we
introduce kp-biomed, the first large-scale biomedical keyphrase generation
dataset with more than 5M documents collected from PubMed abstracts. We train
and release several generative models and conduct a series of experiments
showing that using large scale datasets improves significantly the performances
for present and absent keyphrase generation. The dataset is available under
CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.
Mael Houbre, Florian Boudin, Beatrice Daille
2022-11-22