Generalize Recommendation System Using Web Graph Mining

Abstract: Lots of information gets extracted from web every day and this extraction is increasing in huge amount. When queries are submitted to search engines they are generally in natural languages and just of one or two words. Search engine are unable to recognize natural language and thus it becomes difficult to extract the proper information from web according to user’s interest. Here, the recommendation technique comes into picture. There are number of recommendation techniques that are available now a day. Every technique has its advantages and disadvantages. Recommendation techniques are designed in such a way that they support various type or data sources. These data sources are in the form of text, images, audio, video etc. Easy way to deal with all this type of data sources is to model them in the form of graph and then it is possible to apply recommendation algorithm on it. The proposed system uses algorithms for predicting user’s interest and after that it combines outcome of all algorithms to provide efficient results. The idea of graph construction for data sources due to which it is possible to handle large amount of data easily.
Keywords: Recommendation; web mining; web graph; data warehouse; learning system; recommender.
1. Introduction
Web Mining is the term which specifies extraction of interesting patterns from the web data. Data available on web is generally in the form of content, structure or usage. It leads to content mining, structure mining and usage mining. Content mining is a process of text extraction it mainly focuses on unstructured data. Web structure mining extract data from hyperlinks, it just extracts the summary of the web pages. Then web usage mining extract the data from log files in the form of patterns. But data available on web is very huge and extracting interesting information from such a data is very difficult task as this data is in heterogeneous form. Various data sources, platforms, tools and techniques are used for implementing these data. Thus there is need of recommendation techniques which solves all these compatibility problems. Generally recommendation is carried out by giving queries to search engine .There are some problems related to queries. Sometimes queries are in one or two words and hence it is difficult to find semantically relevant data. Results return by these queries are based on the ranking given to the pages and it may not contain data related to user’s interest. In addition to this problem search engines may not take into account the personalization feature, means they do not focus on the historical data i.e. previously access data by that user representing users interest and according to that give the relevant results to the user. Generally for solving these problems different algorithms is used but there is need of one generalize method which solves all these problems. Designing such generalize method is very difficult task, since data available on the web is in heterogeneous format.
2. Literature Survey
This Section presents an overview of different recommendation techniques and algorithms related to it. Collaborative filtering [1]-[3] is the method which aggregate rating or preferences on items and by using this historical data recommendation is done. This method also shares its rating information between different users which helps the other users to find the data of their interest. Consider an example user A and B gives similar rating to item I or have similar behavior like purchasing, watching movie etc. Then they may have same area of interest thus a system can suggest items to user A which are previously referred by user B or vice versa. There are two types of algorithms that are used for collaborative filtering first is memory based and second one is the model based algorithm. Memory based algorithms are based on total ratings given by the user on database. This is further classified into two type user based rating and item based ratings. In user based rating algorithms consider the user having same interest. In item based rating algorithm it calculates similarity between two items and according to that it makes the group of it. For recommendation this system uses user-item rating matrix, but as data available on the web is huge and diverse collection of this user-item rating matrix becomes difficult. Many times collaborative filtering algorithm does not give good performance if data size increases. These two challenges limit the use of this method. Image Recommendation [4]-[5] technique is also the interesting recommendation application on the web. This technique mainly focuses on recommending interesting images to users based on their preferences. Normally, these systems first ask users to rate some images as they like or dislike, and then recommend images to the users based on the tastes of the users. In this the quality of recommendations depends upon the number of dimensions used. Only accuracy of recommendations are not sufficient for predicting the user’s interest. Here personalization feature comes into picture. As image data on web is increasing tremendously mining images for recommendation is becoming difficult. Contexseer [7] is the method developed to handle this huge amount of data. This method uses tags and canonical images which act as a supplementary information for recommendation. This method uses re-ranking and cannoG algorithm to improve the quality of recommendation and find canonical images without clustering. In this for feature selection wc-tf-idf method is used. Content based filtering [8] selects items depending on the relation between item and users preferences. This method is based on the user previous rating preferences. Suppose there is a set of items to be recommended to user then this data is compared with item which are preferred by that user previously and comparing those best suited items are recommended to the user. For providing best results users profiles are created. User profile contains the information about the items which are preferred by that user. Some time item profiles are also created which contain the information regarding the rating, features of that item. This data for creating user and item profile is collected by taking the feedback from the user for different items. This system does not give good recommendation if rated data or feedback does not contain enough information about that item. This system also fails when the no of items increase because at that time no of items in the same category increase so it decreases the effectiveness of the system.
3. Recommendation System Architecture
Architecture of recommendation system and its elements are explained below. In this system data is extracted from web and that data is stored in data warehouse. The data pre-processing is carried out which include duplicates, special symbols etc and after that data is sent for recommendation to recommender engine.
3.1. Web Usage Warehouse
It is a central repository of data which is created by integrating data from multiple data sources. Warehouse store current as well as historical data. It also maintains copy of information from the source transaction system. It integrates the data from multiple system which gives centralized view of data.
3.2. Recommender
It collects the data from the web and stores the information in bipartite records.
3.3. Recommendation engine
It collects data sets as a input and generate recommendation set for the user by matching the users current activity against the discovered pattern. It is online process hence its efficiency and scalability are important factors.
3.4. Learning module
It periodically analyzes all recorded data for identifying patterns to generate recommendation. It also uses feedback of user to improvement quality of recommendation.
4. Proposed System
The proposed system works in four stages as data extraction, pre-processing data, graph construction and performance analysis. Fig. 2 shows the working of proposed system
4.1. Data extraction
The first step in proposed system is data set extraction. Click through data records the activities of web user’s it collects the information related to interest of user, the semantic relationships between users, queries and clicked web documents. A dataset specifies queries and metadata related to queries. Every line of click through data contains: a user ID, a query issued by the user, a URL on which the user clicked, the rank of that URL, and the time at which the query was submitted for search.
4.2. Data Pre-Processing
Data set is the raw data recorded by the search engine, and contains a lot of noise which will affect the effectiveness of our query suggestion algorithm. This module keeps frequent well formatted data. In pre-processing method there is need of removing the noisy data from the dataset. Stemming of words, tag elimination, splitting words, stop words. etc. Here are some of the examples of removing noise from data. Here, Stemming is to create a term that projects the common meaning behinds the words for e.g. computation, compute, and computer. Tag Elimination is designed to removing unnecessary tagging and un-tagging operation from automatically generated programs. Split word is used to split the paragraph into word and this word is used for next Pre-Processing methods. Stop word is used to filter out articles, prepositions, conjunctions and pronouns words that occur in the document. Such words have no values for retrieval purpose.
Fig. 2. Working of proposed system
4.3. Graph Construction
Pre-processed data is given to the graph [9]-[10] construction algorithm. This technique handles this data easily and more effectively, as it gives logical relationship between objects. Modelling the web in the form of graph makes it easy to manipulate this huge data. For modelling this data in form of graph it requires to connect node and edges. Web graph connect these nodes to edges having direction. Web page and hyperlink are the component of the graph. Web pages are the documents on the web which act as resources for the search engines. Each web page contains textual data as content or the hyperlinks which are connecting other web pages. Each web page has unique URL and they are accessed by the web browser. Hyperlink is the reference of another place in the same document or another web page. It is used for navigation purpose from one web page to another. Here, the web pages are act as nodes of the graph and hyperlink act as edge between the nodes. Fig 3 shows the example of web graph in which web pages like privacy.htm, people.aspx, about.htm, productes.aspx, people.aspx web pages are connected by the directed hyperlink. Trough hyperlinks user may navigate as one web page to other. Web pages act as like information resources which are in heterogeneous format.
Fig.3 Example of Web Graph
4.4. Apply algorithm
After constructing graphs algorithms are applied on it to find out top n recommendation. Item to item base collaborative filtering algorithm, Pearson correlation base collaborative filtering algorithm are applied for finding similarities between item and users respectively. Slope one algorithm is used to find out the rating of un-rated items
4.5. Top n recommendation
After applying algorithms recommendation sets are generated by recommendation engine they are recommended to user.
4.6. Learning Module
This module analyses the feedback given by user collect that and give this data for processing to recommendation engine. Feedback contain field like good, bad, average related to websites working, results generated by it etc.
The proposed system performs following functions.
4.7. User Profile
When user first login to the site, he is provided with some topics, according to interest user gives ratings to those topics. By predicting interest of user according to rating or preferences given by user recommendations are given to the user.
4.8. Rating Prediction
In this step rating of unrated item is evaluated by using slope one collaborative filtering algorithm. Rating value is estimated by using available rating values of other users.
4.9. Graph construction
After extracting the data from data warehouse the first task is to create the graph of available data. Due to this graph the efficiency of the algorithm increases because instead of taking whole data set for processing graphs and there sub-graphs are created. Apply further algorithms on sub-graph, it takes lesser time for execution.
4.10. Recommendation
Recommendation technique is used for predicting user’s interest according to preferences and rating given by user. It is estimated by using collaborative filtering algorithm. There are different types of collaborative filtering algorithm for determining relationship between item to item, items to person etc.
4.11. Image recommendation
Another interesting recommendation application on the web is image recommendation. In this focus is given on recommending interesting images to web users based on users’ preference. Normally, these systems first ask users to rate some images, as they like or dislike. Then recommend images to the users based on the tastes of the users images are recommended to user.
4.12. Personalization feature
For this module two algorithms are used i.e. collaborative filtering and content based filtering. Collaborative filtering has three methods. Collaborative filtering gives us the relationship between user and item and also the relationship between item and item. Content based filtering provides personalization feature. It gives relationship between user and user.
5. Collaborative Filtering Algorithm
This technique is divided into two parts as follows
5.1. Modeled based algorithm
Memory-based recommendation systems are not always as fast and scalable as required, especially in the context of actual systems that generate real-time recommendations on the basis of very large datasets. To achieve these goals, model-based recommendation systems are used. Model-based recommendation systems builds model on highly rated data. In other words, extract some information from the dataset, and use that as a ‘model’ to make recommendations without having to use the complete dataset every time. This approach potentially offers the benefits like speed and scalability.
5.1.1. Item-based collaborative filtering
Item-based collaborative filtering is a model-based algorithm for making recommendations. In this algorithm, the similarities between different items in the dataset are calculated by using vector based similarity measures, and then these similarity values are used to predict ratings for user-item pairs not which are present in the dataset.
Input: Rating given by different users to item
Output: Similarity between two items
For each item in product catalogue, I1
For each customer C who purchased I1
For each item I2 purchased by
Customer C
Record that a customer purchased I1
And I2
For each item I2
Compute the similarity between I1 and I2
Similarity Measurements
As vector-based similarity, this formulation views two items and their ratings as vectors, and defines the similarity between them as the angle between these vectors. It is shown in equation (1)
(1)
5.2. Content based filtering
This approach evolves relationship between users. Active user’s preferences are compared with other user. Similarity between users is calculated as follows
.
Input: Rating given by different user to different items
Output: Similarity between two users
For each item in user catalogue, U1
For each Product P purchased by U1
For each user U2 purchased
Product P
Record that a Product purchased by U1
And U2
For each User U2
Compute the similarity between U1 and U2
Similarity Measurements
Correlation between different users is needed to find out how similar different users are. Correlation factor calculate the values which are -1 to 1. Here, 1 means users have rated the item in same way and -1 means not similar. The vector similarity treats two users as vectors in n-dimensional space, where n is the number of items in the database. For any two vectors, system compares the angle between them. If the two vectors generally point in the same direction, they get a positive similarity, if they point in opposite directions, they get a negative similarity. By taking the cosine of the angle between two vectors, value from -1 to 1 is obtained.
Pearson correlation coefficient
The Pearson correlation coefficient is the basic correlation formula for samples adapted for rating information. It tries to measure how much two users vary together from their normal votes that is, the direction/magnitude of each is vote in comparison to their voting average. If they vary in the same way on the items they have rated in common, they will get a positive correlation; otherwise, they will get a negative correlation. It is calculated by using equation (2). Here x and y are rating values given by different users an n means no of common item rated by that users.
(2)
5.3. Slope one rating prediction
This is the rating based collaborative filtering technique. This technique finds out rating of un-rated items by using the rating of other user as follows. It considers two items and two users at time to predict the rating of unrated item. We can consider simple example as follows suppose there are two user U1, U2 and item I1, I2. If user U1 gave rating 1 to item I1 and 2 to I, user U2 gave rating 3 to I1 Then by using slope one algorithm U2 may give rating to item I2 is (2-1+3=4) 4.
6. Method of experiment and results
The proposed system has been implemented using Java in eclipse environment. It can be executed on windows or Linux platform. The results are obtained as follows after applying Pearson correlation coefficient algorithm and item to item collaborative filtering algorithm.
6.1. Pearson Correlation Coefficient
Here, two users are compared with each other with five different cases of ranking of each user. To find out the relationship between them Pearson correlation coefficient formula is used. Values of correlation factor decide how these users are similar to each other. If value of this factor is positive it means they rate the item in same way and if value is negative they rate the items in opposite way. Here, U represent user and I represent item. Table 1 presents results for the five cases.
Table 1: Pearson correlation factor
I1 I2 I3 I4 PCC
Case 1 U1 5 4 3 2 1.0
U2 5 2 3 4
Case 2 U1 5 4 3 2 -1.0
U2 1 2 3 4
Case 3 U1 5 4 3 1 -0.95
U2 1 1 2 5
Case 4 U1 4 3 4 4 0.87
U2 4 2 3 4
Case 5 U1 5 4 3 1 0.66
U2 3 5 1 1
6.2. Cosine Similarity
Here, items are compared with each other. Relationship between these items is finding out by using cosine based similarity formula. Here we set threshold value as 0.80. Hence, those items which have CS value above threshold value are similar to each other and vice versa. Table 2 presents results for cosine similarity
Table 2: Cosine similarity
U1 U2 U3 CS
Case 1 I1 5 4 3 1.0
I2 5 4 3
Case 2 I1 5 4 3 0.83
I2 1 2 3
Case 3 I1 1 1 3 0.76
I2 5 4 3
Case 4 I1 1 2 1 0.97
I2 5 4 3
Case 5 I1 5 5 5 1.0
I2 4 4 4
7. Conclusion
Recommendation system gives the result according to user’s interest. This system first find out the rating of un-rated item and after that apply algorithm on it. Thus, it gives better results. Item based collaborative filtering algorithm determines the relationship between item to item and content based filtering algorithm determines the relationship between users to user. Results of these two algorithms are used for determining the interest of user and according to that items are recommended to user.
REFERENCES
[1] G. Linden, B. smith, and J. York, O. Young, ‘Amazon.com Recommendations: item-to-item Collaborative filtering,’ IEE internet computing, vol. 7, no.1, pp.76-80, Jan /feb.2003.
[2] J.S. Breese, D. Heckerman, and C. Kadie, ‘Empirical Analysis of Predictive Algorithms for Collaborative Filtering,’ Proc. 14th Conf. Uncertainty in Artificial Intelligence (UAI), 1998.
[3] A.s. Das, M. Datar, A. Garg and S. Rajaram, ‘Google News Personalization: Scalable Online Collaborative Filtering,’ WWW’07:Proc. 16th Int;l conf. World wide web, pp.271-280,2007.
[4] L. von Ahn and L. Dabbish,’ Labeling Images with a Computer Game,’ CHI ’04: Proc. SIGCHI Conf. Human Factors in Computing Systems, pp. 319-326, 2004.
[5] G. Pass, A. Chowdhury, and C. Torgeson, ‘A picture of search’,In The First International Conference on Scalable Information Systems Kong,Hong Kong, June 2006.
[6] Y.-H. Yang, P.-T. Wu, C.-W. Lee, K.-H. Lin, W.H. Hsu, and H. Chen, ‘ContextSeer: Context Search and Recommendation at Query Time for Shared Consumer Photos,’ Proc. 16th ACM Int’l Conf. Multimedia, pp. 199-208, 2008.
[7] Robin van Meteren and Maarten van Someren.’Using Content-Based Filtering for Recommendation’ .NetlinQ Group, Gerard Brandtstraat Amsterdam, 2010.
[8] Hao Ma, Irwin King and Michael R. Lyu, ‘Mining Web Graphs for Recommendations’, IEEE transaction on knowledge and data engineering, 2012.
[9] Danil Nemirovsky ‘Web Graph and PageRank algorithm,’ Department of Technology of Programming, Faculty of Applied Mathematics and Control Processes, St. Petersburg State University,Russia,2009.
[10] M. Deshpande and G. Karypis, ‘Item-Based Top-n Recommendation,’ ACM Trans. Information Systems, vol. 22, no. 1, pp. 143-177, 2004.

Essay: Generalize Recommendation System Using Web Graph Mining

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: