Principles of searching algorithms used by search engines: Difference between revisions

From Computer Science Wiki
No edit summary
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
[[file:Connection.png|right|frame|Web Science<ref>http://www.flaticon.com/</ref>]]
[[file:Connection.png|right|frame|Web Science<ref>http://www.flaticon.com/</ref>]]


  Most popular search algorithms establish a "page rank" based on how many other pages link to it. Search algorithms weight the links between pages. A page which has 10 links to it has a higher weight than a page which has 2 links to it. Not all links are the same.  
  Most popular search algorithms establish a "page rank" based on how many other pages link to it. Search algorithms weight the links between pages. A page which has 10 links to it has a higher weight than a page which has 2 links to it. Not all links are the same. '''Search algorithms work by measuring the quantity and quality of links to a page rather than the actual content on the page'''.


''Note: from the IB: Students will be expected to understand only the principles of the PageRank and HITS algorithms''
''Note: from the IB: Students will be expected to understand only the principles of the PageRank and HITS algorithms''
Line 7: Line 7:


=== PageRank ===  
=== PageRank ===  
PageRank is a  search algorithm used by google) works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.<ref>https://en.wikipedia.org/wiki/PageRank</ref>
PageRank is a  search algorithm used by google works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.<ref>https://en.wikipedia.org/wiki/PageRank</ref>
<br />
<br />
The image below (click to enlarge) is a graphical representation of page rank. Note cirlce B is huge because many other pages link to it. Please look at circle C. Why is C so large with so few links (answer below).  
The image below<ref>https://en.wikipedia.org/wiki/PageRank#/media/File:PageRanks-Example.svg</ref> (click to enlarge) is a graphical representation of page rank. Note circle B is large because many other pages link to it. Please look at circle C. Why is C so large with so few links (answer below).  


[[File:400px-PageRanks-Example.png|400px]]
[[File:400px-PageRanks-Example.png|400px]]
Line 15: Line 15:
=== HITS ===
=== HITS ===
This is also known as "hubs and authorities". The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.<ref>https://en.wikipedia.org/wiki/HITS_algorithm</ref>
This is also known as "hubs and authorities". The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.<ref>https://en.wikipedia.org/wiki/HITS_algorithm</ref>
<br />
A hub has outgoing links, an authority has incoming links.
<br />
<br />
Hyperlink-Induced Topic Search assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
Hyperlink-Induced Topic Search assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
Line 21: Line 23:
<ref>http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html</ref>
<ref>http://www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html</ref>


[https://www.youtube.com/watch?v=-kiKUYM9Qq8 A great video is here]


 
<br />
=== Difference between HITS and PageRank ===
[[File:HITSExample.png|400px]]
 


== Do you understand this? ==
== Do you understand this? ==
Line 31: Line 33:
Circle C is larger because it is linked to from an authoritative source. Compare this to circle A, which isn't linked from an authoritative source.  
Circle C is larger because it is linked to from an authoritative source. Compare this to circle A, which isn't linked from an authoritative source.  
== Standards ==
== Standards ==
These standards are used from the IB Computer Science Subject Guide<ref>IB Diploma Programme Computer science guide (first examinations 2014). Cardiff, Wales, United Kingdom: International Baccalaureate Organization. January 2012.</ref>


* Outline the principles of searching algorithms used by search engines.
* Outline the principles of searching algorithms used by search engines.

Latest revision as of 14:49, 22 November 2022

Web Science[1]
Most popular search algorithms establish a "page rank" based on how many other pages link to it. Search algorithms weight the links between pages. A page which has 10 links to it has a higher weight than a page which has 2 links to it. Not all links are the same. Search algorithms work by measuring the quantity and quality of links to a page rather than the actual content on the page.

Note: from the IB: Students will be expected to understand only the principles of the PageRank and HITS algorithms

PageRank[edit]

PageRank is a search algorithm used by google works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.[2]
The image below[3] (click to enlarge) is a graphical representation of page rank. Note circle B is large because many other pages link to it. Please look at circle C. Why is C so large with so few links (answer below).

400px-PageRanks-Example.png

HITS[edit]

This is also known as "hubs and authorities". The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.[4]
A hub has outgoing links, an authority has incoming links.
Hyperlink-Induced Topic Search assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.

HITS identifies good authorities and hubs for a topic by assigning two numbers to a page: an authority and a hub weight. These weights are defined recursively. A higher authority weight occurs if the page is pointed to by pages with high hub weights. A higher hub weight occurs if the page points to many pages with high authority weights. [5]

A great video is here


HITSExample.png

Do you understand this?[edit]

Circle C is larger because it is linked to from an authoritative source. Compare this to circle A, which isn't linked from an authoritative source.

Standards[edit]

These standards are used from the IB Computer Science Subject Guide[6]

  • Outline the principles of searching algorithms used by search engines.

References[edit]