Principles of searching algorithms used by search engines

From Computer Science Wiki
Jump to: navigation, search
Web Science[1]
Most popular search algorithms establish a "page rank" based on how many other pages link to it. Search algorithms weight the links between pages. A page which has 10 links to it has a higher weight than a page which has 2 links to it. Not all links are the same. Search algorithms work by measuring the quantity and quality of links to a page rather than the actual content on the page.

Note: from the IB: Students will be expected to understand only the principles of the PageRank and HITS algorithms


PageRank

PageRank is a search algorithm used by google works by counting the numberA unit of abstract mathematical system subject to the laws of arithmetic. and quality of links to a page to determineObtain the only possible answer. a rough estimateFind an approximate value for an unknown quantity. of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.[2]
The image below[3] (click to enlarge) is a graphical representation of page rank. Note circle B is large because many other pages link to it. Please look at circle C. Why is C so large with so few links (answer below).

400px-PageRanks-Example.png

HITS

This is also known as "hubs and authorities". The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.[4]
Hyperlink-Induced Topic Search assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.

HITS identifies good authorities and hubs for a topic by assigning two numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight occurs if the page points to many pages with high authority weights. [5]
HITSExample.png

Do you understand this?

Circle C is larger because it is linked to from an authoritative source. Compare this to circle A, which isn't linked from an authoritative source.

Standards

These standards are used from the IB Computer Science Subject Guide[6]

  • OutlineGive a brief account. the principles of searching algorithms used by search engines.

References