Rank Summarization for Large Scale Documents Using Active Learning



Summarization for Ranking using Active Learning

M.S. Anbarasi

Assistant Professor, Department of Information Technology, Pondicherry Engineering College

Pondicherry Engineering College as PEC

Pondicherry, India

anbarasims@pec.edu

Abirami A, Hemalatha S, Meenatchi K, Akila N

Final Year Student, Department of Information Technology, Pondicherry Engineering College

Pondicherry Engineering College as PEC

Pondicherry, India

abirami12th01@pec.edu

Abstract: Set of documents are being created and shared at an unprecedented rate. So, document sets are overwhelming at all aspects of file transfer. For both end-users and data analysts, it is a nightmare to plow through millions of documents which contain enormous amount of information to be searched and occurrence of redundancy. With summarization methods static small-scale data sets are focused were as dynamic fast arriving data sets are difficult to handle. In this paper we propose the system to design and deal with dynamic, fast arriving, large-scale document sets.

Keywords: Summarization, ranking, pair wise ranking active learning

I. INTRODUCTION

First of all what is summary? A summary is a shortened version of an original text. It includes the thesis and major supporting points, and should reveal the relationship between the major points and the thesis. How long is a Summary? It may be any length, from 25% of the original to one sentence. The main step in summarization is Look for the major divisions of the text. Summarize each division in one sentence. That may mean summarizing each paragraph, but often several paragraphs go together. Make a list of all major points.

Document Clustering is one of the possible approaches to overcome information overload problem. For example, 4 Billion URLs indexed by Google 200 TB of data on the Web .So the best way is document summarization technique. Summarization represents a set of documents by a summary consisting of several sentences. Intuitively; a good summary should cover the main topics (or subtopics) and have diversity among the sentences to reduce redundancy.

Traditional document summarization approaches, however, are not as effective in the context of document given both the large volume of documents as well as the fast and continuous nature of their arrival. Document summarization, therefore, requires functionalities which significantly differ from traditional summarization. Implementing Document summarization is however not an easy task, since a large number of tweets are meaningless, irrelevant and noisy in nature. A good solution for

Document summarization has to address the following main issues: (1) Efficiency—documents are always very large in scale, hence the summarization algorithm that is document cluster vector algorithm should be highly efficient; (2) Topic evolution—it should automatically detect sub-topic changes and the moments that they happen.

These summarization algorithms employ one main data structures to keep important sentences of document information in clusters. That is novel compressed structure is called the document cluster vector (DCV). DCVs are considered as potential sub-topic delegates and maintained dynamically in memory during document processing.

I. SUMMARIZATION

High-level summarization

The high-level summarization module provides two types of summaries: online and historical summaries. An online summary describes what is currently discussed among the public. Thus, the input for generating online summaries is retrieved directly from the current clusters maintained in memory. On the other hand, a historical summary helps people understand the main happenings during a specific period, which means we need to eliminate the influence of tweet contents from the outside of that period. As a result ,retrieval of the required information for generating historical summaries is more complicated, and this shall be our focus in the following discussion. The high level summarization is carried out in two levels. One is Online Summarization and other is Historical Summarization.

Document Summarization

Document summarization can be categorized along two different dimensions: abstract-based and extract-based. An extract-summary consists of sentences extracted from the document while an abstract-summary may employ words and phrases that do not appear in the original document. The summarization task can also be categorized as either generic or query-oriented. A query-oriented summary presents the information that is most relevant to the given queries, while a generic summary gives an overall sense of the document’s content.

Multi document Summarization Specifications

Multi-document summarization methods decompose the documents into sentences and work directly in the sentence space using a term-sentence matrix. However, the knowledge on the document side, i.e. the topics embedded in the documents, can help the context understanding and guide the sentence selection in the summarization procedure.

After this, clustered documents are passed into second module that is summarization. The document from first module contains lot of information for selecting the most informative data from those clustered data; the summarization process is carried out. The summarization process selects only necessary information needed for ranking. For ranking process, this system wants the product details and investment details. The summarization of document done in two ways, one is online summarization and other is historical summarization. In online summarization current data or current information are summarized but in historical summarization, old information or previous information’s are summarized. In this system, share market details are previous data then summarization is carried out else it is in current or new information then online summarization is carried out.

Goal of extractive text summarization is selecting the most relevant sentences of the text. The Proposed method uses statistical and Linguistic approach to find most relevant sentence. Summarization system consists of 3 major steps, Preprocessing ,Extraction of feature terms and algorithm for ranking the sentence based on the optimized feature weights.

Pre-processing

This step involves Sentence segmentation, Sentence tokenization, Stop word Removal and Stemming.

Sentence Segmentation

It is the process of decomposing the given text document into its constituent sentences along with its word count. In English, sentence is segmented by identifying the boundary of sentence which ends with full stop ( . ) , question mark (?), exclamatory mark(!).

Tokenization

It is the process of splitting the sentences into words by identifying the spaces, comma and special symbols between the words. So list of sentences and words are maintained for further processing.

Stop Word Removal

Stop words are common words that carry less important meaning than keywords .This words should be eliminated otherwise sentence containing them can influence summary generated.

Stemming

A word can be found in different forms in the same document. These words have to be converted to their original form for simplicity. The stemming algorithm is used to transform words to their canonical forms. In this work, Porter’s stemmer is used that splits a word into its root form using a predefined suffix list.

Fig High level Architecture diagram

ACTIVE LEARNING TO RANK

In order to get better performances, active learning puts limited human resources on labeling the most informative examples among the unlabeled ones. This kind of active learning is called as selective sampling. On the one hand, active learning consists in learning a ranking function from a training set built during the learning. The quality of the ranking function is highly related with the amount of partially labeled data which is used to train the function. On the other hand, in order to build the training set of the model it proposes to the user optimal selection strategies. The typical one is the query-by-committee (QBC) algorithm which consists of two steps. The first consists in building a committee which is formed by a set of diverse hypotheses trained on currently labeled data.

The second focus to select the optimal queries by measuring their in-formativeness and by calculating the disagreement among the committee members on their ranking. In the context of DR ranking problem in active learning approach, which is in principle extensible to any other partly (or totally) ordered ranking task has been presented in. The invention of their approach lies in relying on standard loss minimization for rank learning through the use of normalizes ranking loss estimation. Long et al proposed a two-stage optimization that minimizes the expected DCG loss in which he integrated both query and also the document selection into active learning to rank. To perform the task of the text summarization Truong has proposed an active learning method which has been suggested within the framework of the ranking of alternatives and also proposed several strategies to select instances to label.

II. DCV-RANK SUMMARIZATION ALGORITHM

Given an input document cluster set, it denotes its corresponding DCV (Document Cluster Vector) set as D(c). A document set D consists of the entire document in the ft _ sets in D( c ).The multi document summarization problems to extract K documents from D, so that they can cover as many documents contents as possible.

Formulae’s and derivations

Let us first describe this problem formally. Denote set of documents as f= { D1,D2,D3,…,Dn} as the collection of non-empty subsets of D, where a subset Di represents a sub-topic and | Di | means the number of its related document. Suppose for each Di there is a document which represents the content of Ti ‘sub topic. Then, selecting K documents is equivalent to selecting k subsets. Now, the problem can be defined as: given a number k and a collection of sets F.

• Three important aspects that characterize research on summarizations

• Summaries may be produced from a single document or multiple documents,

• Summaries should preserve important information.

• Summaries should be short.

III. EXPECTED LOSS OPTIMIZATION

QUERY LEVEL RANKING

In the case of ranking, the “action” in ELO framework is something different than before because here approaches are not directly interested in predicting the scores, but instead also we want to produce a ranking. So the set of actions is the set of permutations of length n and for a given permutation p, the rank of then it document . The expected loss for a given p can be written as:

Where quantifies the loss in ranking according to p if the true labels are given by Y . The next section will detail the computation of the expected loss where ‘ is the DCG loss. As before, the ELO principle for active learning tells us to select the queries with the highest expected losses:

EL (Q) = min ∫y P(Y|Xq,D)dY.

As an aside, note that the ranking minimizing the loss is not necessarily the one obtained by sorting the documents according to their mean predicted scores. This has already been noted for instance.

DOCUMENT LEVEL RANKING

Selecting the most informative document among many documents is a bit more complex because the loss function in ranking is defined at the query level and not at the document level. We can still use the expected loss algorithm is different but only consider the predictive distribution for the document of interest and consider the scores for the other documents fixed. Then we take an expectation over the scores of the other documents. This leads to:

EL (q,i) = min ∫y l(π,Y)P(Y\Xq,D)Dy

This has already been noted for instance.

TABLE I. TABLE TYPE STYLES

Table Head

Data Set Number of examples

1. Base set 2k ~2,000

2. Base set 4k ~4,000

3. Base set 8k ~8,000

4. AL set ~16,000

5. Test set ~18,000

The sample data set and also with graph are given in this paper which will be useful for future work. The size of data set is varied for different examples .

Figure 1. Example of a ranking

The document level active learning methods in terms of DCG-10 of the resulting ranking functions on the test set. Those ranking functions are trained with base data set and the selected examples. X-axis denotes number of examples selected by the active learning algorithm. For all three methods, the DCG increases with the number of added examples. This agrees with the intuition that the quality of a ranking function is positively correlated with the number of examples in the training set.

II. CONCLUSION

From literature survey short-text messages such as tweets are summarized, in this system based on our domain input document are summarize [1]. Clustering is an essential task in Data Mining process which is used for the purpose to make groups or clusters of the given data set based on the similarity between them [4]. K-Means clustering is a clustering method in which the given data set is divided into K number of clusters. This work makes an attempt at studying the feasibility of K-means clustering algorithm in data mining using the Ranking Method.

One exception to this uses expected loss optimization (ELO) to estimate which queries should be selected but is limited to rankers that predict absolute graded relevance [5]. Ranking Engine – The component is mainly the ranking algorithm operating on the current data, which is indexed By the crawler, to be able to provide some order of relevance, for the web documents, with respect to the user query [6].

ACKNOWLEDGMENT

The idea was proposed by final year students of Pondicherry Engineering College under the guidance of Dr. M S Anbarasi B.E., M.E., Ph.D , Assistant Professor of Department of Information Technology, and India.

References

[1] Zhenhua Wang, LidanShou, Ke Chen, Gang Chen, and SharadMehrotra “On Summarization and Timeline Generation for Evolutionary Tweet Streams”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, No. 5,pg:1301-1315 MAY 2015.

[2] Ehsan Elhamifar Guillermo Sapiro Allen Yang, S. Shankar Sastry ,” A Convex Optimization Framework for Active Learning”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 216, ISSN :1550-5499, , 1-8 Dec. 2013.

[3] BeyaBoushih_ Nahla Ben Amor ,“Rank Aggregation Using Active Learning inMeta-searching”, IEEE, Page(s):43 – 48,Print ISBN:978-1-4799-4648-8, 12 Nov. 2014.

[4] NavjotKaur, JaspreetKaurSahiwal, NavneetKaur “EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING”, International Journal of Advanced Research in Computer Engineering & Technology Volume 1, Issue 3,pg:85-91, ISSN: 2278 – 1323 , May2012.

[5] Mustafa Bilgic, Paul N. Bennett” Active Query Selection For Learning Rankers”, SIGIR’12, August 12–16, 2012, Portland, Oregon, USA. ACM 978-1-4503-1472-5/12/08.

[6] N. V. Pardakhe1, Prof. R. R. Keole2 , “Analysis of Various Web Page Ranking Algorithms in Web Structure Mining”, International Journal of Advanced Research in Computer and Communication EngineeringVol.2, Issue 12, , ISSN (Print) : 2319-5940ISSN (Online) : 2278-1021December 2013.

[7] J. A. Aslam, E. Kanoulas, V. Pavlu, S. Savev, and E. Yilmaz, “Document selection methodologies for efficient and effective learning-to-rank,” in Proc. 32nd Int. ACM SIGIR Conf. Res. Develop. Inform. Retrieval, 2009, pp. 468–475

[8] Dingding Wang 1 Shenghuo Zhu 2 Tao Li 1 Yihong Gong 2,“ Multi-Document Summarization using Sentence-based Topic Models “,Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 297–300,Suntec, Singapore, 4 August 2009. c 2009 ACL and AFNLP

[9] WenbinCai, Muhan Zhang, And Ya Zhang, “Active Learning for Web Search Ranking via Noise Injection”, ACM Trans. Web 9, 1, Article 3 (January 2015), 31 pages.

Authors Profile

Dr. Anbarasi M. S , Received M.Tech (Software Engineering) from College of Engineering ,Guindy , Anna University, Chennai. Received Ph.D in Data Mining has completed Ph.D using Clustering Technique in Cloud Environment at present Guiding Ph. D scholars in Data mining, Cloud Computing and Software Engineering.

Abirami A, Hemalatha S, Meenatchi K, Akila N were doing B.Tech ( final year ) in Information Technology , Pondicherry Engineering College. We have already communicated for the 6th Annual International Conference on Computer Science Education: Innovation and Technology (CSEIT 2015).

Essay: Rank Summarization for Large Scale Documents Using Active Learning

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay Categories: