Parallel web crawling

From Computer Science Wiki
Revision as of 15:58, 9 January 2018 by Mr. MacKenty (talk | contribs) (Created page with "right|frame|Web Science<ref>http://www.flaticon.com/</ref> == Parallel web crawling == A parallel crawler is a crawler that runs multiple processes...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Web Science[1]


Parallel web crawling[edit]

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.[2]

Standards[edit]

  • Discuss the use of parallel web crawling.

References[edit]