Evaluating the effect of unbalanced data in biomedical document classification

More Information | Back to archive
Full Text of this article Full article [PDF] (845,00 kB)
doi doi:10.2390/biecoll-jib-2011-177
submission July 14, 2011
published September 16, 2011
NCBI PubMed PubMed ID 21926440

Rosalía Laza, Reyes Pavón, Miguel Reboiro-Jato and Florentino Fdez-Riverola

Correspondence should be addressed to:
Rosalía Laza
ESEI, Escuela Superior de Ingeniería Informática, University of Vigo, Edificio Politécnico, Campus Universitario As Lagoas s/n, 32004, Ourense, Spain
se.ogivu@nullazalr


Abstract

Nowadays, document classification has become an interesting research field. Partly, this is due to the increasing availability of biomedical information in digital form which is necessary to catalogue and organize. In this context, machine learning techniques are usually applied to text classification by using a general inductive process that automatically builds a text classifier from a set of pre-classified documents. Related with this domain, imbalanced data is a well-known problem in many practical applications of knowledge discovery and its effects on the performance of standard classifiers are remarkable. In this paper, we investigate the application of a Bayesian Network (BN) model for the triage of documents, which are represented by the association of different MeSH terms. Our results show that BNs are adequate for describing conditional independencies between MeSH terms and that MeSH ontology is a valuable resource for representing Medline documents at different abstraction levels. Moreover, we perform an extensive experimental evaluation to investigate if the classification of Medline documents using a BN classifier poses additional challenges when dealing with class-imbalanced prediction. The evaluation involves two methods, under-sampling and cost-sensitive learning. We conclude that BN classifier is sensitive to both balancing strategies and existing techniques can improve its overall performance.

Reference

Rosalía Laza, Reyes Pavón, Miguel Reboiro-Jato and Florentino Fdez-Riverola. Evaluating the effect of unbalanced data in biomedical document classification. Journal of Integrative Bioinformatics, 8(3):177, 2011. Online Journal: http://journal.imbio.de/index.php?paper_id=177
imprint | sitemap | credits | top