Efficient Document Clustering System Based on Probability Distribution of K-Means (PD K-Means) Model

Abstract

In document clustering system, some documents with the same similarity scores may fall into different clusters instead of same cluster due to calculate similarity distance between pairs of documents based on geometric measurements. To tackle this point, probability distribution of K-Means (PD K-Means) algorithm is proposed. In this system, documents are clustered based on proposed probability distribution equation instead of similarity measure between objects. It can also solve initial centroids problems of K-Means by using Systematic Selection of Initial Centroid (SSIC) approach. So, it not only can generate compact and stable results but also eliminates initial cluster problem of K-Means. According to the experiment, F-measure values increase about 0.28 in 20 NewsGroup dataset, 0.26 in R8 and 0.14 in R52 from Reuter21578 datasets. The evaluations demonstrate that the proposed solution outperforms than original method and can be applied for various standard and unsupervised datasets.

Authors and Affiliations

Tin Thu Zar Win, Nang Aye Aye Htwe, Moe Moe Aye

Keywords

Related Articles

German Management of Innovation and its Impact on New Product Development and New Markets

The role of innovation has been important for the sustainability of firms for more than two decades. This study concentrates on management vision on innovation that will affect firms’ new product development and new mark...

Next Generation M-Government in Mobile Economy: Transformation Framework and Recommendations

Mobile communication technology is playing a vital role in transforming various fields of government operations, especially those geared toward efficient public services, raising transparency and good governance. The uti...

Synthesis of Indicators of Scientific Literacy Components for Junior High School Students: Via Online System

The objective of this research study aimed to synthesize indicators of scientific literacy components for junior high school students: via online system. The source of data is related documents and research studies assoc...

Evaluation of an Innovative Leadership Development Program for Not-for-Profit Services of a Employer’s Association in Thailand

Evaluation of an Innovative Leadership Development Program for not-for-profit Services of a Employer’s Association in Thailand this applied research was designed to determine the effectiveness of employe...

Palm Leaf Manuscript Segmentation and Reading by Using Artificial Neural Networks

Palm leaf manuscript is considered as a kind of cultural heritage and the record of local wisdom of ancestors that should be transformed into digital format for educational and research benefits of the next generation. T...

Download PDF file
  • EP ID EP598664
  • DOI -
  • Views 165
  • Downloads 0

How To Cite

Tin Thu Zar Win, Nang Aye Aye Htwe, Moe Moe Aye (2018). Efficient Document Clustering System Based on Probability Distribution of K-Means (PD K-Means) Model. International Journal of the Computer, the Internet and Management, 26(1), 15-20. https://europub.co.uk./articles/-A-598664