Parallel web crawling

From Computer Science Wiki
Web Science[1]


Parallel web crawling[edit]

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.[2]

Standards[edit]

These standards are used from the IB Computer Science Subject Guide[3]

  • Discuss the use of parallel web crawling.

References[edit]

  1. http://www.flaticon.com/
  2. https://en.wikipedia.org/wiki/Web_crawler#Parallelization_policy
  3. IB Diploma Programme Computer science guide (first examinations 2014). Cardiff, Wales, United Kingdom: International Baccalaureate Organization. January 2012.