A Grammatical Inference Sequential Mining Algorithm for Protein Fold Recognition

Abstract

Protein fold recognition plays an important role in computational protein analysis since it can determine protein function whose structure is unknown. In this paper, a Classified Sequential Pattern mining technique for Protein Fold Recognition (CSPF) is proposed. CSPF technique consists of two main phases: the sequential mining pattern phase and the fold recognition phase. In the sequential mining pattern phase, Mix & Test algorithm is developed based on Grammatical Inference, which is used as a training phase. Mix & Test algorithm minimizes I/O costs by one database scan, discovers subsequence combinations directly from sequences in memory without searching the whole sequence file, has no database projection, handles gaps, and works with variant length sequences without having to align them. In addition, a parallelized version of Mix & Test algorithm is applied to speed up Mix & Test algorithm performance. In the fold recognition phase, unknown protein folds are predicted via a proposed testing function. To test the performance, 36 SCOP protein folds are used, where the accuracy rate is 75.84% for training data and 59.7% for testing data.

Authors and Affiliations

Taysir Soliman, Ahmed Eldin, Marwa Ghareeb, Mohammed Marie

Keywords

Related Articles

An Effective Framework for Tweet Level Sentiment Classification using Recursive Text Pre-Processing Approach

With around 330 million people around the globe tweet 6000 times per second to express their feelings about a product, policy, service, or an event. Twitter message majorly consists of thoughts. Thoughts are mostly expre...

Anomaly Detection with Machine Learning and Graph Databases in Fraud Management

In this paper, the task of fraud detection using the methods of data analysis and machine learning based on social and transaction graphs is considered. The algorithms for feature calculation, outlier detection and ident...

An Efficient Deep Learning Model for Olive Diseases Detection

Worldwide, plant diseases adversely influence both the quality and quantity of crop production. Thus, the early detection of such diseases proves efficient in enhancing the crop quality and reducing the production loss....

Quantifiable Analysis of Energy Efficient Clustering Heuristic

One of the important aspects of MANET is the restraint of quantity of available energy in the network nodes that is the most critical factor in the operation of these networks. The tremendous amount of energy using the m...

Effect of Threshold Values Used for Road Segments Detection in SAR Images on Road Network Generation

In this study, the effect of threshold values used for road segments detection in synthetic aperture radar (SAR) images of road network generation is examined. A three-phase method is applied as follows: image smoothing,...

Download PDF file
  • EP ID EP116496
  • DOI 10.14569/IJACSA.2014.051214
  • Views 85
  • Downloads 0

How To Cite

Taysir Soliman, Ahmed Eldin, Marwa Ghareeb, Mohammed Marie (2014). A Grammatical Inference Sequential Mining Algorithm for Protein Fold Recognition. International Journal of Advanced Computer Science & Applications, 5(12), 97-106. https://europub.co.uk./articles/-A-116496