Convert web documents to clustered documents (outline)

The volume of data in digital world is growing increasingly, which has badly impact on forensic analysis. So there is a need to find the quick method that can group the required documents. Numbers of algorithms like k-mean, agglomerative clustering are used for clustering purpose. So, system is pre-process unstructured format to structured format, then extract 4 important features of each document like numeric words, proper nouns title sentences and term weights. This makes it much simpler than any other methods. Then system neglecting unwanted extension’s considering only extensions which are rich in text like .pdf, .doc, .txt. As the final step of clustering, system creates a score matrix of all the documents by comparing with one another to yield a score matrix which contains aggregate feature score. The grouping of these scored values represents the most accurate clustered documents.
This system first creates an interactive web crawler which eventually parses the web pages and collects the data and saves in .txt file format. Then the folder in which these web data is stored is given as the input to the system which then preprocess this data to extract the features And then fuzzy logic is applied to get the feature scores classification pattern and then this is feed to the weighted matrix method to create semantic clusters for the web page documents.
Figure1: Overall System Diagram
Main aim is to convert many web documents to clustered documents. In this web document cluster contains web crawler, data preprocessing, feature extraction and weighted score matrix. Web crawler contains many web pages that will be converted into clustered information. In data preprocessing contains special symbol removing, stop word removing, stemming. Feature extraction starts from an initial set of measured data and builds derived values (features) intended to be informative, non redundant, facilitating the subsequent learning and generalization steps, in some cases leading to better human interpretations. Feature extraction is related to dimensionality reduction. When the input data to an algorithm is too large to be processed and it is suspected to be redundant (e.g. the same measurement in both feet and meters, or the repetitiveness of images presented as pixels), then it can be transformed into a reduced set of features (also named a “features vector”). This process is called feature extraction. The extracted features are expected to contain the relevant information from the input data, so that the desired task can be performed by using this reduced representation instead of the complete initial data.
Fuzzy logic can be used as an interpretation model for the properties of neural networks, as well as for giving a more precise description of their performance. We will show that fuzzy operators can be conceived as generalized output functions of computing units. Fuzzy logic can also be used to specify networks directly without having to apply a learning algorithm. An expert in a certain field can sometimes produce a simple set of control rules for a dynamical system with less effort than the work involved in training a neural network.
Weighted score matrix used to define the level of importance of criteria. Assigning meaning to weighting factors is subjective. For this reasons, keep the number of weighting factors small.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances among the cluster members, obdurate dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem.
2015-12-2-1449038515

Essay: Convert web documents to clustered documents (outline)

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: