Improving Data Collection on Article Clustering by Using Distributed Focused Crawler

Journal Title: Data Science: Journal of Computing and Applied Informatics - Year 2017, Vol 1, Issue 1

Abstract

Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.

Authors and Affiliations

Dani Gunawan, Amalia Amalia, Atras Najwan

Keywords

Related Articles

Time Series And Data Envelopment Analysis On The Performance Efficiency Of Dmmmsu-South La Union Campus

This study entitled “Time Series and Data Envelopment Analysis (DEA) on the Performance Efficiency of DMMMSU-South La Union Campus” determined the performance of the Don Mariano Marcos Memorial State University -South La...

Implementation and comparison of Berry-Ravindran and Zhu- Takaoka exact string matching algorithms in Indonesian-Batak Toba dictionary

Indonesia has a variety of local languages, which is the Batak Toba language. This time, there are still some Batak Toba people who do not know speak Batak Toba language fluently. Nowadays, desktop based dictionary is on...

Using random search and brute force algorithm in factoring the RSA modulus

Abstract. The security of the RSA cryptosystem is directly proportional to the size of its modulus, n. The modulus n is a multiplication of two very large prime numbers, notated as p and q. Since modulus n is public, a c...

A Framework to Ensure Data Integrity and Safety

The technology development allows people to more easily communicate and convey information. The current communication media can facilitate its users to send and receive digital data, such as text, sound or digital image....

Efficiency of Local Government Units in North Western Philippines as to the Attainment of the Millennium Development Goals

This study entitled “Efficiency of Local Government Units in Northwestern Philippines as to the Attainment of the Millennium Development Goals” determined the performance of the four provinces and eight cities in Region...

Download PDF file
  • EP ID EP435197
  • DOI 10.32734/jocai.v1.i1-82
  • Views 62
  • Downloads 0

How To Cite

Dani Gunawan, Amalia Amalia, Atras Najwan (2017). Improving Data Collection on Article Clustering by Using Distributed Focused Crawler. Data Science: Journal of Computing and Applied Informatics, 1(1), 1-12. https://europub.co.uk./articles/-A-435197