Parallel web crawling: Difference between revisions

Revision as of 16:58, 9 January 2018

Web Science^[1]

Parallel web crawling[edit]

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.^[2]

Standards[edit]

Discuss the use of parallel web crawling.

References[edit]

[1] ttp://www.flaticon.com/

[2] ttps://en.wikipedia.org/wiki/Web_crawler#Parallelization_policy

[1]

[2]