Discover Music with Muszum: Our Focused Crawler & Ranking Algorithm

Muszum our music search engine uses focused web crawlers in order to download relevant

web pages. URL servers pass HTML documents to the focused crawlers our focused crawlers

are trained to identify music related webpages before actually downloading. URL queue

stores the crawled webpages and compressing the full HTML of each page then these are

stored in the repository where each document is prefixed by docID, length and URL

respectively.

The indexer fetches documents from the repository which are then uncompressed and parsed.

Documents are converted into hits which are then passed into a set of barrels. The sorter takes

the barrel which is sorted by docID and rearranges them with the wordID to generate the

inverted index.

Another important function of the indexer is to parse out all the links in every webpage which

are then stored in anchor files it holds information such as the links source and destination

and text of the link. The URLresolver reads the anchors file and converts relative URLs into

absolute URLs when are then converted into docIDs which are then passed back into the

forward index associated with that docID. The URLresolver generates the linkage graph

which computes the page ranks for our all documents.

The doc index stores information about all our documents which we will go into detail in the

next setion. Muszum utilises DumpLexicon to take the list created by the sorter and the

lexicon produced by the indexer to generate a new lexicon which will be used by the

searcher.

The ranking algorithm we will use:

PR(x) PR(y)

PR(u) = ———- + ———–

N(x) N(y)

The page rank of (u) will be equal to the page rank of (x) and (y) divided by the amount of

outbound links within (x) and (y). We can never know the back links to all our crawled

documents but only the forward links. In order to deal with dangling links, we decided to

distribute the page rank of these pages to top documents with similar wordID’s rather than

removing the webpage completely. In cases where we encounter rank sinks we will use the E

vector to distribute any redundant rank back into the system.

Our query handler is responsible for parsing the query by stemming and removing stop words

and converting those words into wordIDs. Utilizing cache history, we can offer suggested

searches and provide stored HTTP responses for fast retrieval which can reduce the amount

of requests made to the server. Using localization, we can provide more relevant information

to the user. Once the query is passed to the system to begins to seek the start of the doclist in

the short barrel for every word until there is a match relating to the search terms. The rank of

those documents is calculated for the query, sorted and returned.

Using a knowledge graph, we can begin to gain a better understanding of what our users like

and it will help us to provide them with relevant ads or suggestions. User profiles can help

show users with similar searches in order to suggests songs of interests and news relating to

artists or genres of music. Using user profiles, we can show only ads that will interest the

user. We will utilise knowledge of how long a user spends on a specific webpage and

whether or not it was the last page they visited will help us to determine the relevance of a

webpage which we can then incorporate into our ranking algorithm.

Focused Crawler: Our focused crawler aims to search only for a subset of the web relating

to a specific topic in our case we are targeting music related terminology to ensure we’ve the

best music search engine. Using priority based focused crawlers we can store retrieve pages

in a priority queue based on a configuration the crawler is set to. It will distinguish between

important and unimportant information.

Repository: The repository stored the complete HTML of every crawled webpage. Here each

of our webpages are compressed, due to memory restrictions we decided to use bzip to

compress our documents as it produces better compression. Once we grow we will then look

to improve the compression speed by using either zlib or gzip. In the repository the

documents are prefixed by docID, length and URL, like so:

docid ecode urllen pagelength url page

No other alternative data structures are need to access it which enhances the data consistency

and makes development easier.

Doc Index: Doc index holds information on each document. Using ISAM (Index sequential

access mode) as a static index structure to achieve fixed index nodes. When an ISAM file is

created, index nodes are fixed, and their pointers do not change during inserts and deletes.

ISAM also handles when nodes exceed capacity new records are stored in overflow chains.

The information stored contains the document status, a pointer to into the repository, a

document checksum and other statistics. Documents which have been crawled contain a

pointer into a variable width file called docinfo which stores the title and URL. Our

URLresolver converts URLs into docIDs which are then stored in the doc index.

Essay: Discover Music with Muszum: Our Focused Crawler & Ranking Algorithm

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: