Muszum our music search engine uses focused web crawlers in order to download relevant
web pages. URL servers pass HTML documents to the focused crawlers our focused crawlers
are trained to identify music related webpages before actually downloading. URL queue
stores the crawled webpages and compressing the full HTML of each page then these are
stored in the repository where each document is prefixed by docID, length and URL
respectively.
The indexer fetches documents from the repository which are then uncompressed and parsed.
Documents are converted into hits which are then passed into a set of barrels. The sorter takes
the barrel which is sorted by docID and rearranges them with the wordID to generate the
inverted index.
Another important function of the indexer is to parse out all the links in every webpage which
are then stored in anchor files it holds information such as the links source and destination
and text of the link. The URLresolver reads the anchors file and converts relative URLs into
absolute URLs when are then converted into docIDs which are then passed back into the
forward index associated with that docID. The URLresolver generates the linkage graph
which computes the page ranks for our all documents.
The doc index stores information about all our documents which we will go into detail in the
next setion. Muszum utilises DumpLexicon to take the list created by the sorter and the
lexicon produced by the indexer to generate a new lexicon which will be used by the
searcher.
The ranking algorithm we will use:
PR(x) PR(y)
PR(u) = ———- + ———–
N(x) N(y)
The page rank of (u) will be equal to the page rank of (x) and (y) divided by the amount of
outbound links within (x) and (y). We can never know the back links to all our crawled
documents but only the forward links. In order to deal with dangling links, we decided to
distribute the page rank of these pages to top documents with similar wordID’s rather than
removing the webpage completely. In cases where we encounter rank sinks we will use the E
vector to distribute any redundant rank back into the system.
Our query handler is responsible for parsing the query by stemming and removing stop words
and converting those words into wordIDs. Utilizing cache history, we can offer suggested
searches and provide stored HTTP responses for fast retrieval which can reduce the amount
of requests made to the server. Using localization, we can provide more relevant information
to the user. Once the query is passed to the system to begins to seek the start of the doclist in
the short barrel for every word until there is a match relating to the search terms. The rank of
those documents is calculated for the query, sorted and returned.
Using a knowledge graph, we can begin to gain a better understanding of what our users like
and it will help us to provide them with relevant ads or suggestions. User profiles can help
show users with similar searches in order to suggests songs of interests and news relating to
artists or genres of music. Using user profiles, we can show only ads that will interest the
user. We will utilise knowledge of how long a user spends on a specific webpage and
whether or not it was the last page they visited will help us to determine the relevance of a
webpage which we can then incorporate into our ranking algorithm.
3.
Focused Crawler: Our focused crawler aims to search only for a subset of the web relating
to a specific topic in our case we are targeting music related terminology to ensure we’ve the
best music search engine. Using priority based focused crawlers we can store retrieve pages
in a priority queue based on a configuration the crawler is set to. It will distinguish between
important and unimportant information.
Repository: The repository stored the complete HTML of every crawled webpage. Here each
of our webpages are compressed, due to memory restrictions we decided to use bzip to
compress our documents as it produces better compression. Once we grow we will then look
to improve the compression speed by using either zlib or gzip. In the repository the
documents are prefixed by docID, length and URL, like so:
docid ecode urllen pagelength url page
No other alternative data structures are need to access it which enhances the data consistency
and makes development easier.
Doc Index: Doc index holds information on each document. Using ISAM (Index sequential
access mode) as a static index structure to achieve fixed index nodes. When an ISAM file is
created, index nodes are fixed, and their pointers do not change during inserts and deletes.
ISAM also handles when nodes exceed capacity new records are stored in overflow chains.
The information stored contains the document status, a pointer to into the repository, a
document checksum and other statistics. Documents which have been crawled contain a
pointer into a variable width file called docinfo which stores the title and URL. Our
URLresolver converts URLs into docIDs which are then stored in the doc index.