Parallel web crawling: Difference between revisions

From Computer Science Wiki
(Created page with "right|frame|Web Science<ref>http://www.flaticon.com/</ref> == Parallel web crawling == A parallel crawler is a crawler that runs multiple processes...")
 
 
Line 8: Line 8:


== Standards ==
== Standards ==
 
These standards are used from the IB Computer Science Subject Guide<ref>IB Diploma Programme Computer science guide (first examinations 2014). Cardiff, Wales, United Kingdom: International Baccalaureate Organization. January 2012.</ref>


* Discuss the use of parallel web crawling.
* Discuss the use of parallel web crawling.

Latest revision as of 15:08, 11 January 2018

Web Science[1]


Parallel web crawling[edit]

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.[2]

Standards[edit]

These standards are used from the IB Computer Science Subject Guide[3]

  • Discuss the use of parallel web crawling.

References[edit]

  1. http://www.flaticon.com/
  2. https://en.wikipedia.org/wiki/Web_crawler#Parallelization_policy
  3. IB Diploma Programme Computer science guide (first examinations 2014). Cardiff, Wales, United Kingdom: International Baccalaureate Organization. January 2012.