In recent years, the growth of Arabic content and numbers of users on the Internet has greatly increased as can be seen from the table 2.1 and figure 2.1 of top ten languages on the Internet. Arabic is a widely spoken language with more than 375 million speakers and over 155 million, or over forty percent of these Arabic-speaking people use the Internet. This represents nearly five percent of all the Internet users in the world. The number of Arabian speaking Internet users has grown by a factor of sixty in the last fifteen years (2000-2015). This growth in usage has outpaced the growth in information retrieval systems, summarization of Arabic text (such as documents and web pages), query processes and natural language processors.
W3C proposed the XML as standard that used in web applications, transactions, documentations, database management systems and to exchange information between systems over the Internet. XML allows storing different data regardless of how it will be displayed. XML has been used to create, update and query databases. Create and write clear human-readable XML documents as well as machine-readable are easy, so it’s easy to create applications that process these XML documents. Generally, all kinds of information can be expressed as XML documents.
This thesis described the RAX System which has been designed for ranking Arabic documents based on content similarity. Our model was applicable to documents stored in different formats and written in the Arabic language. The design and implementation were based on existing text processing frameworks and referent Arabic grammar. The main focus of the research was on evaluating different similarity measures used for classifying Arabic documents from different domains and different categories.
In the preparation stage, the RAX system was used to process Arabic text taking in account the character encoding for the Arabic language (UTF-8, Windows-1256 etc). In the implementation stage, the RAX system managed XML documents via an XML database management system using XPath and XQuery languages. The RAX system uses cosine similarity to measure the similarity metric in n-dimensional space. This is based on the finding that when two vectors are similar in rate and direction from the origin to their end points, they will be close to each other in the vector space, with a small angular separation, and vice versa. The cosine value lies between 1 and -1. Therefore, the cosines of small angles are close to 1, which means high similarity, while the cosines of large angles are close to -1, which means low similarity.
From appendix table 1, appendix table 2 and figure 7.4 it can be seen that where query1 had two terms the result matched 60% of the collection. The top ranked documents which contained both terms were D35, D63, D58,D20, D50, D54, and D11.
From appendix table 3, appendix table 4 and figure 7.5 it can be seen that where query2 had 3 terms there were no documents in the collection which match one of the terms i.e. Computer. In this case, it was impossible to calculate IDF due to the denominator dfj being equal to zero. In this case the RAX system excluded the term Computer from further consideration and query2 became a query of two terms. The query2 result then matched 47% of the collection. The top ranked documents which contained both terms were D62, D29, D33, D16, D58, D11, D50, D56, D53, and D36.
From appendix table 5, appendix table 6 and figure 7.6 it can be seen that where query3 had 3 terms the query3 result matched 79% of the collection. The top ranked documents which contained three terms were D20, D51, D56, D59, D87, D23, D31, D38, D12, D48, D66, D24, D55, D63, D10, and D62.
From appendix table 7, appendix table 8 and figure 7.7 it can be seen that where query4 had 4 terms the query4 result matched 93% of the collection. The top ranked documents which contained four terms were D55, D14, D43, D96, D71, and D18.
We conclude that the Arabic text was fully represented in the processing of Arabic documents.
The preparation stage of the processing of Arabic text was established in 4 steps: extraction of full text from documents; normalization (remove diacritics, remove non-letters and remove punctuation marks); removal of stopwords from the normalized text and stemming (remove prefixes, remove suffixes and finally extract roots or stems words). The well-formed Arabic XML document was created from the stemmed text and loaded into XDBMS which manages end user queries over a collection of XML documents. The Arabic text in queries was processed in 3 steps: normalization, removal of stopwords and stemming (implementation stage).
Furthermore, the total of the term frequencies of the documents and the weights of query1, query2, query3 and query4 were equal to the totals of the whole collection in appendix table 9. There was a proportional relationship between the number of terms of a query and its result. The RAX system excludes terms which are not matched. Some factors such as the position of nodes in the XML tree and the query expressions (structure of expressions) could affect the operation of the RAX system. System performance could be improved by changing the type of stemmer.
There are two main advantages of the RAX system. Firstly, the query results are more comprehensive and wider when using the roots of words or stems. Secondly, the similarity measures are calculated after the completion of the query process i.e. comparing the collection of terms extracted from the collection of XML Arabic documents and the query terms. So, the ranking established according to this comparison.
In section 9, the thesis proposed a survey, which studies the security issues of the RAX system. These issues combined between XML security (XML digital signature and XML encryption) and the SOAP message to create a secret environment between an end user and the RAX system model (see figure 9.1) as well as study the security attacks and countermeasures.
Regarding the hypotheses in section 6.2, we conclude that:
1. Different forms of a word have caused problems in text processing, document summarization, and information retrieval systems. So, the first hypothesis is true.
2. In every summarization proceeds, there was information loss that directly proportional to the size of the document (the summarized document was less than the original document). So the second hypothesis is true.
3. The well-summarized document contains the whole important information, but with a big document, it’s impossible to get well-summarized document without losing important information. So, the third hypothesis is false.
4. The summarized document always has a smaller size than the original. However, the parsers can process it in the small amount of memory. So, the fourth hypothesis is true, because the summarization process minimizes the consumption of the memory.
5. As we can see from the results of the RAX system that the documents are ordered according to the similarity measures, and this ranking is helpful in information retrieval systems. So, the fifth hypothesis is true.
6. The similarity measures between XML documents and their summarized documents are close. So, the sixth hypothesis is true.
7. The similarity between a query and its result depending on the terms of query and content of summarized XML document. So, the seventh hypothesis is true.
8. If the security attacks and the countermeasures are taking into account. So, the security issues will be powerful and the eighth hypothesis is true.
9. The XML digital signature and the encryption do not affect the summarized XML document. So, the ninth hypothesis is false.
As regards future work, the RAX system could be improved in various ways. We plan to work on making it more efficient. This will mean that the stemmer will need to be improved and enhanced in capabilities and effectiveness to deal with the huge volume of Arabic roots in large data sets (stopword list, compatibility between prefixes and suffixes in stemming process, etc). We also aim to use DTD and XML schema to create XML documents as well as to enhance their summarization. Finally, we plan to upgrade the RAX system to find and replace any query term which has a zero term frequency.
...(download the rest of the essay above)