A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm

Abstract

Tasks such as clustering and classification assume the existence of a similarity measure to assess the similarity (or dissimilarity) of a pair of observations or clusters. The key difference between most clustering methods is in their similarity measures. This article proposes a new similarity measure function called PWO “Probability of the Weights between Overlapped items ”which could be used in clustering categorical dataset; proves that PWO is a metric; presents a framework implementation to detect the best similarity value for different datasets; and improves the F-tree clustering algorithm with Semi-supervised method to refine the results. The experimental evaluation on real categorical datasets, such as “Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch” shows that PWO is more effective in measuring the similarity between categorical data than state-of-the-art algorithms; clustering based on PWO with pre-defined number of clusters results a good separation of classes with a high purity of average 80% coverage of real classes; and the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5% of data size.

Authors and Affiliations

Mahmoud A. Mahdi, Samir E. Abdelrahman, Reem Bahgat

Keywords

Related Articles

Secure and Efficient Routing Mechanism in Mobile Ad-Hoc Networks

Securing crucial information is always considered as one of the complex, critical, and a time-consuming task. This research investigates a significant threat to the security of a network, i.e., selective forwarding attac...

A Novel Architecture for Information Security using Division and Pixel Matching Techniques

The computer users have to safeguard the information which they are handling. An information hiding algorithm has to make sure that such information is undecipherable since it may have some sensitive information. This pa...

Factors Influencing Users’ Intentions to Use Mobile Government Applications in Saudi Arabia: TAM Applicability

M-government applications in Saudi Arabia are still at an early stage. In this study, a modified technology acceptance model (TAM) was used to identify and measure the factors that influence users’ intentions to use m-go...

Image Sharpness Metric Based on Algebraic Multi-grid Method

In order to improve Mean Square Error of its reliance on reference images when evaluating image sharpness, the no-reference metric based on algebraic multi-grid is proposed. The proposed metric first reconstructs the ori...

TOWARDS A SEAMLESS FUTURE GENERATION NETWORK FOR HIGH SPEED WIRELESS COMMUNICATIONS

The MIMO technology towards achieving future generation broadband networks design criteria is presented. Typical next generation scenarios are investigated. The MIMO technology is integrated with the OFDM technology for...

Download PDF file
  • EP ID EP319134
  • DOI 10.14569/IJACSA.2018.090565
  • Views 107
  • Downloads 0

How To Cite

Mahmoud A. Mahdi, Samir E. Abdelrahman, Reem Bahgat (2018). A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm. International Journal of Advanced Computer Science & Applications, 9(5), 496-509. https://europub.co.uk./articles/-A-319134