Feature Selection And Vectorization In Legal Case DocumentsUsing Chi-Square Statistical Analysis And Naïve BayesApproaches
Journal Title: IOSR Journals (IOSR Journal of Computer Engineering) - Year 2015, Vol 17, Issue 2
Abstract
Abstract : Most machine learning techniques employed in the area of text classification require the features ofthe documents to be effectively selected owing to the large chunk of data encountered in the classificationprocess and term weights built from document vectors for proper infusing into the respective classifieralgorithms. Effective selection of the most important features from the raw documents is achieved byimplementing more extensive pre-processing techniques and the features obtained were ranked using the chisquarestatistical approach for the elimination of irrelevant features and proper selection of more relevantfeatures in the entire corpus. The most relevant ranked features obtained are converted to word vectors which isbased on the number of occurrences of words in the documents or categories concerned, using the probabilisticcharacteristics of Naïve Bayes as a vectorizer for machine learning classifiers. This hybrid vector space modelwas experimented on legal text categories and the study revealed better discovered features using the preprocessingand ranking technique, while better term weights from the documents was successfully built formachine learning classifiers used in the text classification process.
Authors and Affiliations
Obasi, Chinedu Kingsley , Ugwu, Chidiebere
Performance Comparison of K-means Codebook Optimization using different Clustering Techniques
Vector quantization is a compression technique which is used to compress the image data in the spatial domain. Since it is a lossy technique, so maintaining the image quality and the compression ratio is a diffic...
Data mining Algorithm’s Variant Analysis
Abstract: The Data Mining is extricating or mining information from extensive volume of information.Information mining frequently includes the examination of information put away in an information distributioncente...
Alternate Sort
Sorting algorithms are the main concepts of the subject Data Structures and It’s Applications. These algorithms are designed in arranging the data elements in the sorted order. If the data elements are arranged...
Virtualization: A Sustainable Resource Management Strategy inComputing Practices
Abstract: Many computing practitioners are challenged with resource inefficiencies and insufficienciesemanating from poor management strategy. In order to reduce complexity and risk while improvingproductivity, pra...
Effect of Gamma Irradiation on the Structural and Optical Properties of ZnO Thin Films
Zinc oxide (ZnO) thin films were prepared by DC- magnetron sputtering technique on glass at 500 0C substrates temperature. The effect of γ-irradiation on the structure and optical properties of the films was invest...