Until the early 1990’s, search engines used text based ranking systems in order to decide which pages were most relevant to a given query. These text based ranking systems would take the user-entered keyword and count the frequency with which that word appeared in websites across the Internet database. The pages that contained the keyword most often were labeled as the most relevant, and they would appear at the top of the “suggested” websites list. The pages that had less frequency of the word would appear towards the end of the list, and the pattern continued this way. However, there were a number of problems that came with this approach. For example, say I wanted to learn more about Purdue University, and I typed the word “Purdue” into my search engine. Ideally, a source listing vast and comprehensive information regarding the university should be the prioritized web page. Yet, what if I had hypothetically created a website with solely the word “Purdue” listed thousands of times, accompanied by no other information? This would likely be my top page suggestion, although it contains minimal to no actual material about the school. The underlying problem with the text based ranking systems is that it does not correctly measure the relevance of the web page to the user’s interests. One page may contain a much greater deal of vital information than another and still yet mention the actual keyword name less.
Researchers harped earnestly over the topic, eager to find an algorithm that would prove more efficient than the current text based ranking system. In 1998, Stanford graduate students Larry Page and Sergey Brin proposed the “PageRank” algorithm, an alternate method of effectively matching websites to keywords. The PageRank algorithm was based off of the assumption that a website’s relevance is directly correlated to the quality and the amount of the pages that link to it. Essentially, more important websites are likely to receive more links from other websites. For example, if we create a certain web page a and include a hyperlink to web page b, than we consider b as important and relevant to our topic. If there are lots of pages linking to b, it seems to be the common consensus that b will store meaningful information and will likely be a top suggestion. We can iteratively assign a rank to each web page based off of the ranks of the pages pointing to it. In understanding this algorithm and its application to linear algebra, it helps to picture the world wide web as a directed graph network, with the web pages as nodes and the edges as links between them. Let us conduct an example of the “PageRank” algorithm by imagining a miniscule Internet consisting of just four websites connected to each other through hyperlinks on each page. For simplicity, we can call them “Page 1”, “Page 2”, “Page 3”, and “Page 4”. They are each referenced in the way described by the picture below.
We can translate this picture of interconnected websites into a directed graph with each website as a node. Edges are created using the hyperlinks listed on each web page. For example, if I have a link to page 1 on my current page 3, I will create an edge stemming from page 3 and directed towards page 1. After all of these relationships have been correctly identified and constructed, the directed graph should look like the picture below, with numbers 1-4 corresponding to pages 1-4.
The final step in creating this directed graph is to assign edge weights to each edge. Edge weights can be described as “the likelihood of an edge forming”, or in our case, “the importance of the edge relative to the node it’s directed from”. We will operate under the condition that when on a certain webpage, you must choose a certain hyperlink to navigate to and to each with equal probability. For example, because my web page #1 has links to pages 2,3, and 4, the probability of going from 1 to each of the other web pages in a uniform 1/3rd. Because node 2 only has links to pages 3 and 4, the edge weights assigned to each respective edge is ½. In other words, node 2 transfers half of its importance to node 4 and the other half to node 3. Filling in these edge weights will yield a graph corresponding to the figure below.
Once more, this graph can be translated into a transition matrix between the nodes. This transition matrix is an alternate way of describing the relationships between nodes and will make it convenient when calculations are needed to be executed. Notice how there are all values of 0 across the diagonal. This is because it is impossible for a web page to link to itself, corresponding to an edge weight of 0. All other entries in the matrix are listed as they were in the directed graph. The nodes listed horizontally at the top are “from” nodes and the vertical nodes listed to the left are the “to” nodes (i.e., the edge weight from node 3 to node 1 is 1, due to page 1 being the sole hyperlink on page 3).
...(download the rest of the essay above)
About this essay:
This essay was submitted to us by a student in order to help you with your studies.
If you use part of this page in your own work, you need to provide a citation, as follows:
Essay Sauce, Review and Summary of Cornell’s “PageRank” Algorithm. Available from:<https://www.essaysauce.com/information-technology-essays/review-and-summary-of-cornells-pagerank-algorithm/> [Accessed 18-10-19].
Review this essay:
Please note that the above text is only a preview of this essay.