Principles of searching algorithms used by search engines

From Computer Science Wiki
Web Science[1]
Most popular search algorithms establish a "page rank" based on how many other pages link to it. Search algorithms weight the links between pages. A page which has 10 links to it has a higher weight than a page which has 2 links to it. Not all links are the same. Search algorithms work by measuring the quantity and quality of links to a page rather than the actual content on the page.

Note: from the IB: Students will be expected to understand only the principles of the PageRank and HITS algorithms

PageRank[edit]

PageRank is a search algorithm used by google works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.[2]
The image below[3] (click to enlarge) is a graphical representation of page rank. Note circle B is large because many other pages link to it. Please look at circle C. Why is C so large with so few links (answer below).

400px-PageRanks-Example.png

HITS[edit]

This is also known as "hubs and authorities". The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.[4]
A hub has outgoing links, an authority has incoming links.
Hyperlink-Induced Topic Search assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.

HITS identifies good authorities and hubs for a topic by assigning two numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight occurs if the page points to many pages with high authority weights. [5]

A great video is here


HITSExample.png

Do you understand this?[edit]

Circle C is larger because it is linked to from an authoritative source. Compare this to circle A, which isn't linked from an authoritative source.

Standards[edit]

These standards are used from the IB Computer Science Subject Guide[6]

  • Outline the principles of searching algorithms used by search engines.

References[edit]