Improving imbalanced scientific text classification using sampling strategies and dictionaries

More Information | Back to archive
Full Text of this article Full article [PDF] (2,23 MB)
doi doi:10.2390/biecoll-jib-2011-176
submission July 13, 2011
published September 15, 2011
NCBI PubMed PubMed ID 21926439

Lourdes Borrajo, Rubén Romero, Eva Lorenzo Iglesias and Carmen María Redondo Marey

Correspondence should be addressed to:
Eva Iglesias
Univ. of Vigo, Computer Science Dept., Campus As Lagoas s/n, 32004 Ourense, Spain
se.ogivu@nullave


Abstract

Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those obtained by other authors using the TREC Genomics 2005 public corpus.

Reference

Lourdes Borrajo, Rubén Romero, Eva Lorenzo Iglesias and Carmen María Redondo Marey. Improving imbalanced scientific text classification using sampling strategies and dictionaries. Journal of Integrative Bioinformatics, 8(3):176, 2011. Online Journal: http://journal.imbio.de/index.php?paper_id=176
imprint | sitemap | credits | top