Parallel web crawling

From Computer Science Wiki
Jump to: navigation, search
Web Science[1]


Parallel web crawling

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.[2]

Standards

These standards are used from the IB Computer Science Subject Guide[3]

  • DiscussOffer a considered and balanced review that includes a range of arguments, factors or hypotheses. Opinions or conclusions should be presented clearly and supported by appropriate evidence. the use of parallel web crawling.

References

  1. http://www.flaticon.com/
  2. https://en.wikipedia.org/wiki/Web_crawler#Parallelization_policy
  3. IB Diploma Programme Computer science guide (first examinations 2014). Cardiff, Wales, United Kingdom: International Baccalaureate Organization. January 2012.