A Multistage Feature Selection Model for Document Classification Using Information Gain and Rough Set
Journal Title: International Journal of Advanced Research in Artificial Intelligence(IJARAI) - Year 2014, Vol 3, Issue 11
Abstract
Huge number of documents are increasing rapidly, therefore, to organize it in digitized form text categorization becomes an challenging issue. A major issue for text categorization is its large number of features. Most of the features are noisy, irrelevant and redundant, which may mislead the classifier. Hence, it is most important to reduce dimensionality of data to get smaller subset and provide the most gain in information. Feature selection techniques reduce the dimensionality of feature space. It also improves the overall accuracy and performance. Hence, to overcome the issues of text categorization feature selection is considered as an efficient technique . Therefore, we, proposed a multistage feature selection model to improve the overall accuracy and performance of classification. In the first stage document preprocessing part is performed. Secondly, each term within the documents are ranked according to their importance for classification using the information gain. Thirdly rough set technique is applied to the terms which are ranked importantly and feature reduction is carried out. Finally a document classification is performed on the core features using Naive Bayes and KNN classifier. Experiments are carried out on three UCI datasets, Reuters 21578, Classic 04 and Newsgroup 20. Results show the better accuracy and performance of the proposed model.
Authors and Affiliations
Mrs. Leena. Patil, Dr. Mohammed Atique
A New Technique to Manage Big Bioinformatics Data Using Genetic Algorithms
The continuous growth of data, mainly the medical data at laboratories becomes very complex to use and to manage by using traditional ways. So, the researchers start studying genetic information field which increas...
Vicarious Calibration Data Screening Method Based on Variance of Surface Reflectance and Atmospheric Optical Depth Together with Cross Calibration
Vicarious calibration data screening method based on the measured atmospheric optical depth and the variance of the measured surface reflectance at the test sites is proposed. Reliability of the various calibration...
Analysis of Gumbel Model for Software Reliability Using Bayesian Paradigm
In this paper, we have illustrated the suitability of Gumbel Model for software reliability data. The model parameters are estimated using likelihood based inferential procedure: classical as well as Bayesian. The quasi...
Identification Filtering with fuzzy estimations
A digital identification filter interacts with an output reference model signal known as a black-box output system. The identification technique commonly needs the transition and gain matrixes. Both estimation cases are...
Implementation of Computer Assisted CIPP Model for Evaluation Program of HIV/AIDS Countermeasures in Bali
One of the fact within economical development of tourism in Bali is indicated by established tourism facilities in order to support Bali tourism industry. Consquently, It has brought up effect that large numbers of...