Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

In Data in brief

Breast cancer is one of the leading causes of death in women worldwide. The main reason could be inheritance, change in environmental conditions or the mutation in certain genes that cause cancer. These genes are not negligible, on the contrary, a wide range of genes have their involvement in the development and progression of different stages of breast cancer. In this article, we are going to explore the association of breast cancer genes and classify them into different association classes viz. positive, negative and neutral. Among all the available biomedical literature resources for a disease, HuGE Navigator is a major resource comprising continually updated human genome epidemiology data controlled by the Centers for Disease Control and Prevention. However the literature finder module of HuGE Navigator only yields PubMed IDs for a specific disease, which are explored further to retrieve abstract data from PubMed. These abstracts are filtered out to include those reference sentences which have at least one gene and disease term. This reference sentence data has been taken as a reference to apply double-fold cross-validation to compile the most comprehensive list and then classify them into different association classes viz, positive, negative or neutral along with the reference sentences confirming the association of the disease with the gene. The positively associated data generated here can be used for breast cancer modelling or meta-analysis of breast cancer. The data generated in the present work can be used as standard reference data for the training of text mining-based biological literature classifiers to predict the class of published literature not only in breast cancer but in other diseases as well.

Raj Sushrutha, Anil Athira P, Shukla Anshita, Anoosha Kadambala, Srivastava Alok

2022-Dec

Breast cancer association, Breast cancer genes, Double cross validation, Machine learning, Meta-analysis, System modeling, Text classifier, Text mining