Automated Web Usage Mining using Hybrid K-Nearest Neighbor and Support Vector Machine

Automated Web Usage Mining with Hybridation of K-Nearest Neighbor and Support Vector Machine

Jyoti Gupta, Ritu ma’am

Abstract: Nowadays, the growth of World Wide Web has exceeded a lot with more expectations. The internet is growing day by day, so online users are also rising. The interesting information for knowledge of extracting from such huge data demands for new logic and the new method. Every user spends their most of the time on the internet and their behaviour is different from one and another. This usually results into time consuming task in finding out the information or right product on the site. This paper focuses on the study of the automatic web usage data mining and recommendation system which is based on behaviour of current user through his/her click stream data. The objective of this paper is to solve the problem if the data goes out of bound it was difficult to perform classification and to improve the accuracy of prediction by using hybridation of Data mining algorithm knn with a svm for better results.

Keywords: web usage mining, pre-processing, knn, svm, log files.

Table of Contents

Introduction:

Data Mining: Data mining is the process which comes under the category of computer science in order to enquire large data sets which belongs to the pattern. Here large data set stands for Big Data. Data mining is an automatic process which is used to take out meaningful information from the data storage and further use this information for various purposes. Take out of meaningful data can be performed by matching patterns and it is acquired by cluster analysis, anomalies analysis, and dependencies analysis. Spatial indices are used to perform all above functions or processes. The matched pattern is a form of brief summary of data which is stored in the data warehouse and these patterns are used for future prediction and various decision making systems to take right decision.

For example, in case of machine learning systems this acquired information can be used for prediction analysis. Another example, data mining is a process which obtain or enquires various groups of correlated data in the database which further can be used for predictive analysis in near future. Data analysis, data collection, compilation of data is not a connected to the data mining but still included in the process of KDD. The iterative process consists of the following steps:

• Data cleaning: It is a phase in which noisy and unwanted data are removed from the collection.

• Data integration: At this stage, multiple data sources, mostly heterogeneous, may be combined in a common source.

• Data selection: At this step, the data which is relevant to the analysis is decided on and took out from the data collection.

• Data transformation: Data transformation is also called as Data Consolidation. It is a phase in which the prefer data is transformed into appropriate forms for the mining procedure.

• Data mining: It is the essential step in which clever techniques are applied to obtain patterns which are potentially useful.

• Pattern evaluation: In this step, strictly elegant patterns representing knowledge are identified based on the given measures.

• Knowledge representation: It is the final phase of KDD in which the uncover knowledge is visually represented to the user. This essential step uses visualization techniques to help the users to understand and interpret the data mining results.

WEB MINING IN BRIEF

Based on the different significance and different methods to obtain information, web mining can be divided into three major parts: web Content mining, Web Structure mining and Web Usage mining. Web contents mining can be defined as the automatic search and acquire of information and resources available from millions of sites and online databases through search engines/ web spiders. Web structure mining operates on the Web’s hyperlink structure.

A. Web Usage Mining; –

Web usage mining is an application of data mining techniques to find out useful pattern from web data, in order to understand and effective serve the need of Web – based applications. It can be described as the analysis and discovery of user access.

B. How to perform Web Usage mining: –

Web Usage Mining consists of three phases pre-processing, pattern discovery and pattern analysis. In Web Usage Mining all the secondary data is discovered which is derived from the interaction of the users while surfing on the web. It gathers useful data thoroughly, filter out unrelated usages data, establish the web site. Web usage mining can be classified data according to kinds of usage data examined. In our context, the usage data is web log data, which contain the information regarding the user navigation. As our work focus on web usage mining, it is the application of data mining techniques to find out usage patterns from web data. Data is generally collected from user's interaction with the web, like web/proxy server logs. Usage mining tools finds and predict user behaviour, by these the designer get idea to improve the web site, to attract visitors, or to give regular users a personalized and adaptive service. The major problem with Web Usage Mining is the volume of data they deal with. With the growth of internet, Web Data has become huge in nature and a lot of transactions are taking place in seconds. Not only the volume of data, the data is also not completely structured. It is in a semi structured format so there are initial needs to pre- processing the data before the actual extraction of the required information.

B.1 Data Collection: –

The web log files on the web server are the prime source of data for Web Usage Mining. Data can be collected from the following three locations.

• Web Servers

• Web proxy servers

• Client browsers

Web servers can be configured to write different fields into the log file in different formats. The most common field used by web servers are the followings: IP Address, Login Name, User Name, Request Type, Status, Bytes transferred, Referrer, Visiting Path, Path traversed, Timestamp, page last visited, success rate, User agent, URL etc.

B.2 Data pre-processing: –

The data collected from web server log is often incomplete and create uncertainty. In web usage mining process Pre-processing is important phase in order to clean, correct and complete input data and to mine the knowledge effectively. Pre-processing phase takes 80% time of whole process.

1. Data Cleaning: In the process of data cleaning, removal of irrelevant items can be accomplished by removing unwanted view of images, graphics, multi media etc. This process minimizes the file to a great extent.

2. User Identification and Session Generation: After data cleaning, unique users must be identified. User identification is done by

I. User login information.

II. to use cookies for identifying the visitors of a web-site by storing a unique ID.

III. The same IP but different user agent means a new user.

The user agents are said to be different if it represents different web browsers or operating systems in terms of type and version. Identification of the user sessions is very important because it largely affects the quality of pattern discovery result. Session is time duration spent on the web pages. It is using timestamp details of web pages.

3. Data Conversion: It is conversion of the log file data into the format needed using mining algorithms.

B.3 Pattern Discovery: –

Discovery of desired patterns and to take out understandable knowledge from pre- processing data is a difficult task. Here we will briefly discuss some techniques to find out patterns from processed data.

1. Association Rules: Association rule generation can be used to relate pages that are most often referenced together in a single session. It can discover the correlations between pages that are most often referenced together in a single server session/user session.

2. Sequential Patterns: By using this approach, Web marketers can be able to predict the future visiting patterns which will be placing advertisements aimed at certain user groups. The disadvantage of it is difficult to find out the engaging patterns if huge amount of data is present.

3. Clustering: Clustering is a technique which groups a set of items together having similar characteristics. In the Web Usage domain, there are two kinds of clusters can be located i.e. usage clusters and page clusters. Clustering of users refers to the groups of users whose have exhibiting similar browsing patterns. On the other hand, clustering of pages will be able to discover groups of pages which have related content. This information is useful for Internet search engines and Web assistance providers.

4. Classification: Classification is done to identify the characteristics that indicate the group to which each case belongs. This pattern can be used in both the cases i.e. to understand the existing data and to predict how new instances will behave. A classification technique follows three approaches statistical, machine learning and neural network.

B.4 Pattern Analysis: – The final step of the entire process of Web Usage Mining is Pattern Analysis.

The objective of this procedure is to select the interesting patterns and filter out uninteresting patterns. The patterns are analysed using techniques such as Data and Knowledge Querying, OLAP techniques and Usability analysis. Data and Knowledge Querying methods use a tool like SQL. In OLAP techniques, the results of pattern discovery into data cube after that OLAP operation are performed. After this, visualization techniques such as graphing patterns or assigning colours to different values are used to highlight overall patterns.

BACKGROUND OF SUPPORT VECTOR MACHINES AND k-NEAREST NEIGHBOR CLASSIFICATION APPROACHES

Support Vector Machines (SVM) is one of the discriminative classification approaches. It is commonly recognized to be more accurate. SVM classification approach is based on the principle of Structural Risk Minimization (SRM) from statistical learning theory. It is an inductive principle for model selection. It is used for learning from finite training data and provides a method for controlling the generalization ability of learning machines that uses a small size training data. The uniqueness of this principle is to find a hypothesis to guarantee the lowest true error. SVM uses both positive and negative training datasets which are not available in any of the other classification methods. SVM needs positive and negative training sets for seeking the decision surface, also known as hyper plane, which best separate the positive from the negative data in the n-dimensional space. Support vectors are the document representatives which are closest to the decision surface. The performance of SVM classification keep unchanged even if documents that do not belong to the support vectors are removed from the set of training data.

SVM maps input vectors to a higher dimensional vector space where an optimal separating hyper plane is constructed. Firstly, our discussion starts here with the simplest case with linear SVM on two categories of separable data. In a linear classifier, two groups of separable data can be divided by a hyper plane. In Fig. 1, white dots show data points of one category while black dots show data points of another category. The hyper plane which maximizes the margin is called the optimal separating hyper plane. The margin is calculated by the sum of distances of the hyper plane to the closest training vectors of each category.

The hyper planes that separate two groups of data can be expressed by Eq. 1.

In the equation above, x represents a set of training vectors while w represents vectors perpendicular to the separating hyper plane and b represents the offset parameter which allows the increase of the margin. There are actually an infinite number of hyper planes that could separate data in a vector space into two groups, as represented by dashed lines illustrated in Fig. 1. this optimal separating hyper plane and the margin in this case is d1+d2.

k-nearest neighbour classification approach

k-nearest neighbour (KNN) classification method is an instant-based learning algorithm. In KNN the objects are categorized based on closest feature space in the training set. The training data is mapped into multi-dimensional feature space. In this feature space is partitioned into regions based on the category of the training set. A particular point is assigned to a particular category in the feature space if it is the most frequent category among the k nearest training data. During the classifying stage, KNN classification approach finds the k closest labelled training samples for an unlabelled input sample and assigns the input sample to the category that appears most frequently within the k subset. As KNN better than the other classification approaches by its simplicity. KNN requires a small training set with small number of training samples, an integer which specifies the variable of k and a metric to measure closeness.

Fig. 4: Feature space of a 3-dimensional KNN classifier (dots, blue triangles and squares represent data points from ω1, ω2 and ω3, respectively; Xu represents an unlabelled input sample)

shows an example of the feature space of a KNN classifier with three categories, where ω1, ω2 and ω3 represent three different categories with associated training samples and dots, blue triangles and squares represent data points from ω1, ω2 and ω3, respectively. Xu represents an unlabelled input sample to be classified

Euclidean distance is used for computing the distance between the vectors. The key element of this method is the availability of a similarity measure for identifying neighbours of a particular document. The training phase consists only of storing categories and the feature vectors of the training set. In the initial classifying phase, distances between the input sample and all stored vectors are computed and k closest samples are selected. The category of annotated object of an input sample is predicted based on the nearest point which has been assigned to a particular category. In the case of N-dimensions, the Euclidean distance between two points, p and q is computed by using the formula as illustrated in Eq.

where, pi (or qi) is the coordinate of p (or q) in dimension i. Based on the example as shown in Fig. 4, the classification task is to annotate an input sample, Xu to its right category.

In this case, the value of k has been assigned as 5 and Euclidean distance is used to compute the distance between Xu to the training samples in the feature space. Of the 5 closest neighbours, 4 belong to ω1 and 1 belongs to ω3. As the result of KNN classification, Xu is annotated to the category of ω1, which is the primary category.

KNN classification approach is outstanding with its simplicity. The performance of KNN classification approach works well even in handling the classification tasks with multi-categorized documents. The major drawback of KNN is that it uses all features in distance computation and causes the method computationally intensive, especially when the size of training set grows. Besides this, the accuracy of k-nearest neighbour classification is severely degraded by the presence of noisy or irrelevant features, especially when the number of attributes grows.

Proposed work

We proposed an integration of nearest neighbour classifier and support vector machine to classify objects.

KNN algorithm is used in KNN-SVM hybrid classification approach as the beginning process of its classification task. In the situation where all k neighbours are belong to a same class there is no conflict in the classification of a given input data, so a straightforward classification task is performed by the simple KNN algorithm without needing the involvement of SVM approach. On the other hand, if the k neighbours are not belonging to the same class, the hybrid classification approach (KNN-SVM) is used to perform the classification task. In the proposed KNN-SVM hybrid classification approach, K-Nearest Neighbour (NN) algorithm is used to prune the original training set, so that SVM which acts as a classification engine of this hybrid model does not need to be train with the original training set, but with a pruned training set which insufficient training samples have been eliminated. This is to ensure a faster training process for SVM with a smaller training set, without sacrificing the sufficiency of classification performance.

Flow of data:

Comparative Analysis:

KNN

KNN-SVM

Accuracy

Less

Error-Rate

Less

Conclusion:

As all information is available on the Internet but it is not easy for every user to find relevant information in short span of time. In order to overcome this problem recommendation system introduced in Web world. In this paper, the problem of K- Nearest Neighbor based clustering techniques if the data is going to be varied the clustering approach and if data goes out of bound it was difficult to perform classification for solving this where the data doesn’t remain within the bound we use the hybridisation of k-Nearest Neighbor and Support Vector Machine. In future, the implementation of this proposed model will be provided.

Essay: Automated Web Usage Mining using Hybrid K-Nearest Neighbor and Support Vector Machine

Essay details and download:

Text preview of this essay:

Introduction:

Conclusion:

About this essay:

Essay details and download:

Text preview of this essay:

Introduction:

Conclusion:

About this essay:

Essay Categories: