Web crawler functions

Web Science^[1]

A Web crawler, sometimes called a spider, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).

Web search engines and some other sites use Web crawling or spidering software to update their web content or indices of others sites' web content. Web crawlers copy pages for processing by a search engine which indexes the downloaded pages so users can search more efficiently.^[2]

web crawler = web spider = bot = web robot

how a web crawler functions[edit]

Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human alone. The largest use of bots is in web spidering (web crawler), in which an automated script fetches, analyzes and files information from web servers at many times the speed of a human. More than half of all web traffic is made up of bots.

The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web.

In pseudocode, we might imagine a web crawler working like this:

queue = LoadSeed();
while (queue is not empty)
{
    dequeue url
    request document
    store document for later processing
    parse document for links
    add unseen links to queue
}

Do you understand this?[edit]

Web crawler and meta-data[edit]

Parallel web crawling[edit]

Standards[edit]

Describe how a web crawler functions.
Discuss the relationship between data in a meta-tag and how it is accessed by a web crawler.
Discuss the use of parallel web crawling.

References[edit]

[1] ttp://www.flaticon.com/

[2] ttps://en.wikipedia.org/wiki/Web_crawler

[1]

[2]