2.1 Preface
There are many current web search engines that depend on inverted index, which is a data structure of document information store. Inverted index is a map from individual document words to their respective locations, but such data structure destroy semantic links between the words, and thus cannot implement structural user queries [8]. In other words, these systems can only retrieve the documents that contain user specified words. But now we need to create semantic links between the words which are found in inverted index, and in this way we can build a semantic network [9]. This network will preserve the internal structure of the stored documents, and will enable the users to implement structural queries [2]. Both structural-saving indexation and structural user search query allow to save semantic speech meaning of the text during the search process. All popular search engines operate with inverted index [5], which provides the reasons for high quality based on keyword search. The main idea is to build a mapping from every token with a list of its positions in the documents, which are indexed by the search engine. Both page ranking and linguistic algorithms can give acceptable results for the users, and this is the main idea of non-linked keywords processing extracted from texts, and this means we do not use non-semantic search[10]. However, today's semantic networks, implemented by commercial companies and open communities, are not fully utilized by the search engines. Powerful linguistic and statistic functions, implemented in modern search engines are not used to their full extent [11].
2.1Preserving Semantic Links
The main idea behind our present work is to store not inverted index, but sentence structure with a link to its source page position with a focus on the Arabic language. The base sentence structure consists of three parts: predicate, subject and object, which are called a triplet[10]. Each page is parsed to get linked tokens, constituting the elements to be saved to the database with sentence links and source page positions. The tokens constitute an oriented graph or a semantic network. Subjects in this graph link objects with other subjects and vice versa.
This structure is similar to RDF (Resource Description Framework)[10], which describes knowledge using a directed graph.
The search process is implemented with RDF queries over the semantic network. The user inputs a triplet in the form of three words, which is searched in the database of linked documents (semantic network). Later in the future the user will be able to use natural language as a query language. In this case, the system can deal with not only triplet words but also other syntactic structures, and this implies that the indexer will need to perform the source documents using an extended RDF scheme, which would contain also adjectives, adverbs, and other parts of speech (POS)[12].
Triplet query can be searched in the database using simple direct comparison or with the methods used in many popular search engines, such as synonyms dictionary and TF-IDF[13].TF-IDF can be implemented as a coefficient of relevance, which can affect the document position in the resulting list which contains requested pages.
Modern search algorithms do not mean replacing traditional inverted index search engines, but they are implemented within additional module or serve as a basis for a specialized fact search engine in a knowledge graph[13].
TF-IDF is a numerical statistic dimension defining document relevancy in a collection. The result depends on word frequency in the current document, and inverse frequency in other documents.
2.3 Basic Search Engine Functions
Our search engine indexing mechanism (robot or spider) solves the following problems [14]:
1. Detecting document external links.
2. Useful content detection.
3. Semantic structure parsing.
There are some differences between traditional search engine and semantic search engine. The differences are:
2.3.1 Traditional Search Engine [30]:
' Do not understand polysemy and synonymy.
' Unknowing meaning of the terms.
' Do not take into consideration stop words such as a, and, is, on, of, or, the, was, with, by, after, the.
' When looking at a web page, a conventional search engine looks for the distribution of words within the web page to try and find how relevant it is to the user's search query. mainly, that means that a web page with similar words to those the user types into a search engine will be thought to be more relevant, and will appear at a higher position in the search results page.
' Unable handle long-tail queries.
2.3.2 Semantic Search Engine:
' Understand polysemy and synonymy.
' Knowing meaning of the terms.
' Take into consideration stop words such as a, and, is, on, of, or, the, was, with, by, after, the.
' Is designed to try and understand the context that the words are used within the web page to try and match it more accurately to the user's search query.
' Able handle long-tail queries
2.4 Structure Parsing Algorithm
As a common work algorithm to get a parsed sentence, the system performs the following steps:
1. Anaphora resolution: is method to resolving the problem of what pronoun or noun phrase refers to.
2. Sentence segmentation :segmentation the sentence into words, one word per line.
3. Token boundaries identification: determine where sentence is beginning and where end.
4. Part-of-speech tagging of the tokens array: is called word category is determine the category of each word; noun, pronoun or adjective and put each category in array.
5. Syntactic parsing of the POS-tagged sequences: provide rules to put together words to form components of sentence and put together these components to form sentence.
During components selection we tried to use the subsystems, containing the English language and preferentially Russian language model[15].
2.5 Identifying Sentence and Token Borders
POS(Part of Speech) tagging is a spider system that is implemented as a separate unit with a separate API (Application Programming Interface). Due to sentence and token borders identification, there are many available ready solutions, which is a topic that is widely covered in the literature. At the beginning, our system will work and focus onthe English language, but later we extend to some of the other supported languages, such as Arabic. We selected an extensible open source Java Tree Tagger system for implementation. Tree Tagger is fast, and use slow RAM and CPU consumption with availability of various language models. Tree Tagger can be quickly replaced with any other tokenizer [11].
2.6 Dependency Parsing
The main sentence processing instrument that we will try to use is Stanford NLP (Natural Language Processing Group) Parser, but we will be faced with theneed for a high RAM and CPU consumption. Furthermore, Stanford Parser uses constituency grammars that do not reflect well the structure of languages with relaxed word order, such as Russian. For such languages, dependency grammars are usually considered more appropriate. We have chosen Malt Parser for best parsing quality from the list of available parsers[16].
To train the parsing system to recognize any particular language, a deeply annotated text corpus (a tree bank) uses constituency grammar, which is based on Chomsky's generative grammar. Parsers that depend on constituency grammars try to divide sentences into smaller word groups where the individual tokens are identified. The example of phrase structure (constituency) parsing that is needed is shown in Figure 4.
For each word in a Treebank, the following data is required [17]:
1. Word position in the sentence
2. Word
3. Grammatical attributes
4. Head word position
5. Dependency type
Malt Parser contains pre-trained models for English French, and Swedish. For other languages, it is necessary to create a malt tab training set.
2.7 Structure of Ontology
According to T. R. Gruber [8], ontology can be defined as an explicit specification of a conceptualization. The term 'ontology' comes from philosophy, and it means 'the science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality'.
In the internet, we can see the example of the ontology, such as Yahoo! Categories, DMOZ Directory , Amozon.com product catalogue ,WordNet, GO (Gene Ontology), Unified Module Language System (UML) and UNSPSC -terminology for products and services .
There are many commonly used kinds of ontology, including:
(a) Terminological ontologies where concepts are word senses and instances are words,
(b) Topic ontologies where concepts are topics and instances are documents, and
(c) Data-model ontologies where concepts are tables in a database and instances are data records (such as in a database schema)[18].
There are several reasons why someone wants to develop ontology, such as to share common understanding of the structure of information among people or software agents, to enable reuse of domain knowledge, to make domain assumption explicit, to separate domain knowledge from the operational knowledge and to analyze domain knowledge. Ontology describes knowledge about the domain in terms of concepts or vocabularies within the domain and relationships between them. Ontology is needed to develop semantic search engines. By using semantically richer ontologies, the following benefits can be obtained. Firstly, ontologies can be used to describe the domain knowledge and the terminology of the application in more detail. For example, relations between categories in different views can be defined. Secondly, ontologies can be used for creating semantically more accurate annotations in terms of the domain knowledge. Thirdly, with the help of ontologies, the user can express the queries more precisely and unambiguously, which leads to better precision and recall rates. Finally, through ontological class definitions and inference mechanisms, such as property inheritance, instance-level metadata can be enriched[10].
2.8WordNet Project
WordNet is a lexical database for the English language(primarily)available online. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications[4].
The main relation among words in WordNet is synonymy, such as between the words shut and close or car and automobile. Synonyms–words that denote on the same concept and are interchangeable in many contexts–are grouped into unordered sets (synsets). Each of WordNet's 117,000 synsets is linked to other synsets by means of a small number of 'conceptual relations'. Additionally, a synset contains a brief definition ('gloss') and, in most cases, one or more short sentences illustrating the use of the synset members[20].Word forms with several distinct meanings are represented in as many distinct synsets. Thus, each form-meaning pair in WordNet is unique[19].
The most frequently encoded relation among synsets is the super-subordinate relation (also called hyponymy or ISA relation). It links more general synsets like {furniture, piece of furniture} to increasingly specific ones like {bed} and {bunkbed}. Thus, WordNet states that the category furniture includes bed, which in turn includes differentiate bunk bed, conversely, concepts such that bed and bunkbed make up the category furniture. All noun hierarchies in the end go up the root node {entity}. Hyponymy relation is transitive if an armchair is a kind of chair, and if a chair is a kind of furniture, then an armchair is a kind of furniture. WordNet differentiate among Types (common nouns) and Instances (specific persons, countries and geographic entities)[15].Thus, armchair is a type of chair; Barack Obama is an instance of a president. Instances are always leaf (terminal) nodes in their hierarchies.
Metonymy, the part-whole relation holds between synsets like {chair} and {back, backrest}, {seat} and {leg}. Parts are inherited from their super ordinates: if a chair has legs, then an armchair has legs too. Parts are not inherited 'upward' as they may be characteristic only of specific kinds of things rather than the class as a whole, chairs and kinds of chairs have legs, but not all kinds of furniture have legs.
Verb synsets are is putted into hierarchies as well, verbs towards the bottom of the trees (troponyms) express increasingly specific manners characterizing an event, as in {communicate}-{talk}-{whisper}. The specific manner expressed depends on the semantic field; volume (as in the example above) is just one dimension along which verbs can be elaborated. Others are speed (move-jog-run) or intensity of emotion (like-love-idolize). Verbs describing events that necessarily and unidirectional entail one another are linked: {buy}-{pay}, {succeed}-{try}, {show}-{see}, etc [20].
Adjectives are organized in terms of antonym Pairs of 'direct' antonyms like wet-dry and young-old reflect the strong semantic contract of their members. Each of these polar adjectives in turn is linked to a number of 'semantically similar' ones: dry is linked to parched, arid, desiccated and bone-dry and wet to soggy, waterlogged, etc. Semantically similar adjectives are 'indirect antonyms' of the control member of the opposite pole. Relational adjectives ("pertains") point to the nouns they are derived from (criminal-crime).
There are only few adverbs in WordNet (hardly, mostly, really, etc.) as the majority of English adverbs are straightforwardly derived from adjectives via morphological affixation (surprisingly, strangely, etc.)[21].
The majority of the WordNet's relations connect words from the same part of speech (POS). Thus, WordNet really consists of four sub-nets, one each for nouns, verbs, adjectives and adverbs, with few cross-POS pointers. Cross-POS relations include the 'morph semantic' links that hold among semantically similar words sharing a stem with the same meaning: observe (verb), observant (adjective) observation, observatory (nouns). In many of the noun-verb pairs the semantic role of the noun with respect to the verb has been specified: {sleeper, sleeping car} is the location for {sleep} and {painter}is the agent of {paint}, while {painting, picture} is its result[37].
2.8.1 WordNet Applications:
WordNet has been used for a number of different purposes in information systems, including[22]:
' word sense disambiguation
' information retrieval
' automatic text classification
' automatic text summarization
' machine translation and even automatic crossword puzzle generation.
A common use of WordNet is to determine the similarity between words. Various algorithms have been proposed, and these include[23]:
measuring the distance among the words and synsets in WordNet's graph structure, such as by counting the number of edges among synsets. The intuition is that the closer two words or synsets are, the closer their meaning.
Arabic WordNet was constructed according to the methods developed for Euro-WordNet (EWN ;Vossen 1998) and since applied to dozens of languages around the world [31]. The Euro''-Word Net approach maximizes compatibility across Word Net and concentrates on manual encoding of the most complicated and important concepts.
Language-specific concepts and relations are encoded as needed or desired. This results in what is called core WordNet for Arabic with the most important synsets, embedded in a solid semantic framework [18].
2.8.2 Structure of WordNet
The database structure contains four principal entity types, item, word, form and link.
Items are conceptual entities, including synsets, ontology classes and instances.
Besides a unique identifier, an item has descriptive information such as a gloss. Items lexicalized in different languages are distinct. A word entity is a word sense, where the clear form of the word is associated with an item via its identifier[24].A form is a special form that is considered dictionary information (not merely an inflectional variant). The forms of Arabic words that go in this table are the root and/or the broken plural form, where applicable. A link relates two items, and has a type such as "equivalence," "subsuming," etc. Links connect sense items to other sense items, such as a PWN synset to an AWN synset, a synset to a SUMO concept, etc.
This data model has been specified in XML as an interchange format, but is also implemented in a MySQL database hosted by one of the partners. The database will be the primary deliverable of the project, and will be distributed freely to the community [9].
2.8.3 Arabic Ontology
Arabic ontology is the basic step to the creation of Semantic Web. The basic categorization of terminologies and meanings can give out semantics. The interrelationship between one word to the other which matches its meaning can result to the stems and branches of semantics. The goal of ontology learning is to (semi-)automatically extract relevant concepts and relations from a given corpus [25].
The following are the six parts in the life cycle of ontology: Ontology Creation, Ontology Population, Ontology Validation, Ontology Deploy, Ontology Maintain and Ontology Evolve [18].
The ontology can also be subdivided into: Extract Term, Discover Synonyms, Obtain Concepts, Extract Concept Hierarchies, Define Relations among Concepts, Deduce Rules or Axioms. These processes are used in order to make the ontology matching become possible and that the related branches of topics would be available to any users[26].
Web information is usually language dependent; and the availability of information related to the language that would be much preferable according to the user would be an increasing need of today. Much is the need of the Arabic language since the ontology in English cannot be translated to Arabic[32].
The following figure shows the example of ontology for e-commerce domain.
2.8.4 Arabic WordNet Project (AWN):
Arabic WordNet is a lexical database, which is structured along the same structures the database contains on four principal entity types; item, word, form and link.
A form is a special form that is considered dictionary information (not merely an inflectional variant). The forms of Arabic words that go in this table are the root and/or
the broken plural form, where applicable. A link relates two items, and has a type such
as "equivalence," "subsuming," etc. Links connect sense items to other sense items, e.g.
a PWN synset to an AWN synset, a synset to a SUMO concept, etc [17].
2.8.5 Arabic WordNet Challenges (AWN)
Arabic origins are linked to Semitic languages [26] , and it is different in its grammar and linguistic forms from Indo-European languages. Classical Arabic refers to the standard forms of the language used in both writing and spoken forms on television and the radio, as well as in public speeches and Sermons.
' Arabic writing system has twenty-five consonants and three long vowels, and it is written from right to left.
' Letters take different shapes according to their position in the word.
' Arabic also has short vowels. They are not part of the alphabet, but are written as a vowel diacritics above or under the consonant to give the desired sound and, therefore, the desired meaning.
' Texts without vowels are considered to be more convenient by the Arabic-speaking community because this is the usual written and printed form in daily materials, such as books, Magazines, newspapers, letters, etc.
' In texts such as the Holy Quran, classical poetry collections, textbooks and some Arabic paper dictionaries, vowel diacritical marks appear in full. It is very usual for well-edited books, printed texts and manuscripts to have vowel diacritics partially or randomly written where words could be ambiguous or hard to read.
' Arabic speakers can read texts with vowels explicitly pointed out, but they may be less successful writing texts using the correct vowels diacritics.
' There are differences among linguists regarding the shapes of some diacritical marks for some words.
' Arabic language does not have capital letters (the names of people, and the names of months and days, cities and countries and continents and seas and mountains, ',etc.) and it does not use acronyms. This causes increased semantic ambiguity, and complexity of tasks, such as Information Extraction in general, and Named Entity Recognition in particular.
' Arabic is a highly derivational and inflectional language and its vocabulary can be easily expanded using a framework that is latent in the creative use of roots and morphological patterns [33].
2.9 Semantic Similarity Algorithms
In WordNet nouns, verbs, adjectives, and adverbs are connected to each other into taxonomic hierarchies by well-defined types of semantic relations. Here we only discuss measures based on nouns and is-a relationships. Some methods have been proposed, which can be classified into two groups: edge-based methods and information-based methods. Next, we will introduce these methods briefly. Definition of related concept as follows [15]
(1) len(ci,cj): the length of the shortest path from concept ci to concept cj
in WordNet.
(2) lso(ci,cj): the most specific common sub sumer of c and cj.
(3) depth(ci): the length of the path to concept ci from the global root entity.
(4) deep_max: the max depth(ci) of the taxonomy.
(5) hypo(c): the number of hyponyms for a given concept c.
(6) node_max: the maximum number of concepts that exist in the taxonomy.
One of path-based measures is Wu and Palmer's. They defined a scaled measure between a pair of conceptsc1 and c2 in a hierarchy as :
Another classical path-based measure is Leakcock and Chodorow's [6]. The maximum depth of taxonomy had been taken into account in their method.
Information content-based similarity measure was presented Resnik [10], followed by other methods proposed by Lin among others. All of Information content-based similarity measures rely on information content(IC) values assigned to the concepts in the taxonomy, but their usage of IC are different.