Essay: Design a system to extract ontology from unstructured information sources for enhancing user search results

Essay details:

  • Subject area(s): Computer science essays
  • Reading time: 21 minutes
  • Price: Free download
  • Published on: August 17, 2019
  • File format: Text
  • Number of pages: 2
  • Design a system to extract ontology from unstructured information sources for enhancing user search results Overall rating: 0 out of 5 based on 0 reviews.

Text preview of this essay:

This page of the essay has 2394 words. Download the full version above.

Abstract

With the fast growth of information volume through the World Wide Web causes an increasing requirement to develop new automatic system for retrieval of documents and ranking them according to their relevance to the user query. There are many search engines available out there. Most of the search engines are hit based which is page rank based search engines. There are very few ontology based search engines. And these available search engines can’t retrieve good results because of poor implementation. Our proposed system will overcome this drawback as well as it presents a Multilingual Information Retrieval approach that falls into the area of Domain Specific Information Retrieval. This system will use general page rank based search engine API.Results will be processed by web adviser which will implement ontology to annotate documents and create a pr”cised result list by expanding the query. Web Adviser will monitor user’s information to form meaning of query. After this the web pages will be ranked based the semantic similarity between ontological concepts extracted from web pages and ontological concepts represented by the user query.

 

1.6 PROBLEM STATEMENT

Design a system to extract ontology from unstructured information sources for enhancing user search results.

1.7ABSTRACT

With the fast growth of information volume through the World Wide Web causes an increasing requirement to develop new automatic system for retrieval of documents and ranking them according to their relevance to the user query. There are many search engines available out there. Most of the search engines are hit based which is page rank based search engines. There are very few ontology based search engines. And these available search engines can’t retrieve good results because of poor implementation. Our proposed system will overcome this drawback as well as it presents a Multilingual Information Retrieval approach that falls into the area of Domain Specific Information Retrieval. This system will use general page rank based search engine API.Results will be processed by web adviser which will implement ontology to annotatedocuments and create a pr”cised result list by expanding the query. Web Adviser will monitor user’s information to form meaning of query. After this the web pages will be ranked based the semantic similarity between ontological concepts extracted from web pages and ontological concepts represented by the user query.

1.8 GOALS AND OBJECTIVES

‘ The system provides a solution for machines to process data semantically. We use ontology learning methodologies to semantically model the significant concepts of a query along with its weighted semantic relations to other related concepts.

‘ Interoperability plays the major role in multilingual ontology. Here, the matching methods are important because it requires automatic searching and pattern matching of words of similar pattern or dissimilar pattern.
‘ The main activities are
(i) The user searches document using query in any language.
(ii) The query is analyzed, search result from Wikipedia and loaded in to the system.
(iii) From these results ontology is extracted. The system uses the generated ontology and looks for the translations.Simultaneously the query is converted into English and fired on the Tourism domain and related result are extracted. These results are categorized on basis of generated ontology.

1.9 RELEVANT MATHEMATICS ASSOCIATED WITH THE PROJECT

System Specification:

S= {S, s, X, Y, T, fmain, DD, NDD, ffriend, memory shared, CPUcount}
‘ S (system):- Is our proposed system which includes following tuple.
‘ s (initial state at time T ) :-GUI of search engine. The GUI provides space to enter a query/input for user.
‘ X (input to system) :- Input Query. The user has to first enter the query. The query may be ambiguous or not. The query also represents what user wants to search.
‘ Y (output of system) :- List of URLs with Snippets. User has to enter a query into search engine then search engine generates a result which contains relevant and irrelevant URL’s and their snippets.
‘ T (No. of steps to be performed) :- 6. These are the total number of steps required to process a query and generates results.
‘ fmain(main algorithm) :- It contains Process P. Process P contains Input ,Output and subordinates functions. It shows how the query will be processed into different modules and how the results are generated.
‘ DD (deterministic data):- Data will be fetched from Internet in runtime, users information will be maintained in database. Other than that all data related to serach will be processed in runtime and will be shown to user.
‘ NDD (non-deterministic data):- No. of input queries. In our system, user can enter numbers of queries so that we cannot judge how many queries user enters into single session. Hence, Number of Input queries are our NDD.
‘ ffriend :- WC And IE. In our system, WC and IE are the friend functions of the main functions. Since we will be using both the functions, both are included in ffriend function. WC is Web Crawler which is bot and IE is Information Extraction which is used for extracting information on browser.
‘ Memory shared: – Database. Database will store information like list of receivers, registration details and numbers of receivers. Since it is the only memory shared in our system, we have included it in the memory shared.
‘ CPUcount: – 2. In our system, we require 1 CPU for server and minimum 1 CPU for client. Hence, CPUcount is 2.

Subordinate functions:
‘ Identify the processes as P.
S= {I, O, P}
P= { QA, OP}
Where,
‘ QA is a query analyzer
‘ OP is output processor
‘ P is processes.

‘ QA= {Q, SA, Qr}
Where,
‘ Q =user Query
‘ Semantic analysis will be done on query
‘ Qr is resolved query with relation of query to the domain(Ontological meaning of Query)

‘ OP= {Qr, processing, Info}
Where,
‘ Qr is output of query analysis process
‘ Data related to query will be searched over Internet . Data links will be scored based on relevance to the query
‘ data will be displayed to user

1.10 NAMES OF CONFERENCES / JOURNALS WHERE PAPERS CAN BEPUBLISHED

1]IETE conference ‘ International Conference on Emerging Trends in Engineering and Management Research (ICETEMR-17)

2] IJTIR- International Journal of Emerging Technologies and Innovative Research

1.11 REVIEWOFCONFERENCE/JOURNALPAPERSSUPPORTINGPROJECT IDEA

1.12 PLAN OF PROJECTEXECUTION

Fig 1.1. Plan from June to Nov

 

CHAPTER 2 TECHNICAL KEYWORDS

2.1 AREA OFPROJECT

Data Mining

2.2 TECHNICALKEYWORDS

‘ Access controls
‘ Authentication
‘ Database processing
‘ Privacy
‘ Security

CHAPTER 3 INTRODUCTION

3.1 PROJECTIDEA

One of the first Multi-Language Information Retrieval (MLIR) systems was implemented in 1969 by Gerard Salton who enhanced his SMART system to retrieve multilingual documents in two languages, English and German. Majority of information retrieval systems are monolingual and more precisely English-based. Our proposed system presents a Multi-Language Information Retrieval (MLIR) approach that falls into the area of Domain Specific Information.
There are many search engines available. The drawback of current conventional web search engines is the knowledge gap between users and computers. The knowledge and work of computer is much more limited than the knowledge of user. Our proposed System use the ontology learning which extracts documents from wikipedia. This methodology is used to semantically model the significant concepts of a query along with its weighted semantic relations to other related concepts. The resulting ontology can be viewed as a benchmark of a topic that can be used to classify or re-rank documents based on the degree of similarity to the original query.

3.2 MOTIVATION OF THEPROJECT

Main motivation of the system is to provide Multilingual search engine to the user. This system analyses users history from database and provide link on the basis of results. System convert any type of language in to English and search result. Results are mainly depends on ontology, provide only prescribed results.

3.3 LITERATURESURVEY

A web search engine is a software system that is designed to search for information on the World Wide Web. The search results are generally presented in a line of results often referred to as search engine results pages(SERPs). The information may be a mix of web pages, images, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.
‘ Search engine maintains the following processes in near real time:
A) Web crawling
‘ Indexing
‘ Searching
Web search engines get their information by web crawling from site to site. The “spider” checks for the standard filename robots.txt, addressed to it, before sending certain information back to be indexed depending on many factors, such as the titles, page content, JavaScript, Cascading Style Sheets (CSS), headings, as evidenced by the standard HTML markup of the informational content, or its metadata in HTML meta tags.
Google is the world’s most popular search engine, with a market share of 71.11 percent as of September, 2016. The world’s most popular search engines (with >1% market share) are:
Google: 71.11%
Bing: 10.56%
Baidu: 8.73%
Yahoo: 7.52%
Currently, most of the organizations working in multilingual environment demand ontologies supporting different natural languages. Consequently, the inclusion of the multilingual information retrieval is not an option but a must. In general, ontology is the study of reality. More specifically, ontology is an expression of a particular model of reality, including a specification of concepts, relationships among concepts, and constraints that exist in the model.

CHAPTER 4

PROBLEM DEFINITION AND SCOPE

4.1 PROBLEMSTATEMENT

Design a system to extract ontology from unstructured information sources for enhancing user search results

4.1.1 Statement of scope

The system provides a solution to process data semantically and uses ontology
learning methodology The system searches documents only in English and Marathi.

.
4.2 MAJOR CONSTRAINTS

The system will able to search any query specific to tourism domain.

4.3 METHODOLOGIES OF PROBLEM SOLVING AND EFFICIENCYIS- SUES
Step 1: User will have to register and login before using this search engine
Step 2: User can search for a place or event using search engine
Step 3: QA: Query Analysis
Step 3.1: Users query will be tokenized
Step 3.2: tokens will be analyzed and a resolved query with relation to the
domain will be generated
Step 3.3: This resolved query will be sent to our data crawler to fetch data related
to topic
Step 4: Data crawler
Step 4.1: data crawler will use different APIs to collect data related to the query
Step 4.2: data related to query will be scored
Step 4.3 : a collection of data or kinks will be created with its score to relevance of topic
Step 5: Answer Page
‘ a)Data will be displayed to user based on scores
‘ b)User will be able to visit source of data for more info or can download documents if available
Step 6: Stop.
4.4 OUTCOME

1. Login and Search page for input query.
2. Extract related documents.
3. Ontology generation.
4. Most relevant document decided by semantic analysis will be displayed to user.

4.5 APPLICATIONS

‘ Aerospace and defense,
‘ Automotive,
‘ Consumer products,
‘ Travel,
‘ Telecommunications
‘ Engineering and construction,
‘ Banking
‘ Health care

4.6 HARDWARE RESOURCESREQUIRED

‘ Processor : Pentium IV or AMD
‘ Hard Disk : 500 GB
‘ RAM : 4 GB

4.7 SOFTWARE RESOURCESREQUIRED

‘ Front End
o User Interface : Java
o Programming Language : Java
o IDE/Workbench : Eclipse Mars 2
‘ Back end
o Database : MySQL

CHAPTER 5 PROJECT PLAN

5.1 PROJECT ESTIMATES

_ Communication :-
1. Requirement Gathering

_ Plan :-

1. Domain Specific Searching
2. using ontology generation methodology
3. Searching Speed analysis

_ Modules :-

1. Key phrase Extraction
2. Search and matching function

3. Cluster ranking and ontology generation

_ Design :-
1. GUI
2. Database Design
3. Front End Design
4. Back End Design

Code :-
1. Construction of Front End
2. Construction of Back End

5.1.1.Time Estimates
Month Goal
July Project Selection, Synopsis, Literature Survey
August SRS document preparation, Presentation of the idea about Project
September Preparation of detailed algorithm. Deciding the software tools and hardware.
October Preparation of 1st semester report, Preparing presentation regarding final layout of the project
November Presentation of 1st semester’s work. Submission of 1st semester report.
December Installation of software.
December-January Coding of the graphical user interface and validation.
January Database created. Testing of the front-end.
February Coding of the main modules.
March Testing of the main modules. Presentation of the completed part of project
April-May Inserting addition modules. Approving the project by guide.
May-June Preparing 2nd semester final Project report. Approving of report by the guide.
June Presentation of the entire project.

5.1.2ProjectResources
_ Human Resources

1. Member 1: Mulla Nilofar
2. Member 2: Pathan Ayesha
3. Member 3:Shahapurkar Namrata
4. Member 4: Tayde Dipali

_ Hardware Resources
1. Processor : Pentium IV or AMD
2.RAM : 4 GB

_ Software
1. Operating System : Windows
2. JDK : jdk 1.7
3. Programming Language : Java 7
4. IDE :- Eclipse

5.1 RISK MANAGEMENT W.R.T. NP HARDANALYSIS
‘ P- Problem: A problem is assigned to the P (polynomial time) class if there exists at least one algorithm to solve that problem, such that the number of steps of the algorithm is bounded by a polynomial in O(n), where n is the size of the input.

‘ NP-Problem: A problem is assigned to the NP (nondeterministic poly-nominal time) class if it is solvable in polynomial time by a nondeterministic Turing machine.

‘ NP-Hard: A problem is said to be NP-hard if an algorithm for solvingit can be translated into one for solving any other NP-problem. Itis much easier to show that a problem is NP than to show that it isNP-hard.

‘ NP-Complete: A problem which is both NP and NP-hard is called anNP-complete problem.

‘ In proposed method, to solve problem we are using following algorithms
– As described system uses preprocessing, tokenization, Keywordextraction and POS tagging so it is not possible to get 100% result.

‘ As system uses different algorithms to solve problem and evaluatingquality, so we can conclude that system is NP-hard.

5.2.1 Risk Identification

For risks identification, review of scope document, requirements specifications and schedule is done. Answers to questionnaire revealed some risks. Each risk is cate- gorized as per the categories mentioned in [?]. Please refer table 5.1 for all the risks. You can refereed following risk identification questionnaire.

1.Have top software and customer managers formally committed to support the project?
2.Are end-users enthusiastically committed to the project and the system/product to be built?
3.Are requirements fully understood by the software engineering team and its customers?
4.Have customers been involved fully in the definition of requirements?
5.Do end-users have realistic expectations?
6. Have customers been involved fully in the definition of requirements?
7. Do end-users have realistic expectations?

5.1.1 RiskAnalysis

The risks for the Project can be analyzed within the constraints of time and quality

ID
Risk Description
Probability Impact
Schedule Quality Overall
1 Description 1 Low Low High High
2 Description 2 Low Low High High

Table 5.1: Risk Table

Probability Value Description
High Probability of occurrence is >75%
Medium Probability of occurrence is 26’75%
Low Probability of occurrence is <25%

Table 5.2: Risk Probability definitions [?]

Impact Value Description
Very high >10% Schedule impact or Unacceptable quality
High 5 ‘10% Schedule impact or Some parts of the project have low quality
Medium <5% ScheduleimpactorBarelynoticeabledegradationinqual- ityLowImpactonscheduleorQualitycanbeincorporated

Table 5.3: Risk Impact definitions [?]

5.1.2 Overview of Risk Mitigation, Monitoring,Management

Following are the details for each risk.

Risk ID 1
Risk Description Description 1
Category Development Environment.
Source Software requirement Specification document.
Probability Low
Impact High
Response Mitigate
Strategy Strategy
Risk Status Occurred

Risk ID 2
Risk Description Description 2
Category Requirements
Source Software Design Specification documentation review.
Probability Low
Impact High
Response Mitigate
Strategy Better testing will resolve this issue.
Risk Status Identified

5.3 PROJECTSCHEDULE

5.3.1 Project taskset
Major Tasks in the Project stages are:
Task 1:- Information Gathering.
Task 2:- Requirement Analysis.
Task 3:- Literature Survey.
Task 4:- Problem Statement Definition.
Task 5:- Define Specification.
Task 6:- Project Planning.
Task 7:- Detail Design.
Task 8:- Model Design Strategy.
Task 9:- Factors Authentication and Identification.
Task 10:- System Analysis And Execution Scenario.
Task 11:- Transaction Database Design.
Task 12:- Risk Analysis.
Task 13:- Software Development.
Task 14:- Testing and QA.
Task 15:- Final Delivery.

5.3.1 Tasknetwork

5.3.2 TimelineChart

Fig 1.1. Plan from June to Nov

Fig 1.2 Plan from Dec to April

5.4 TEAMORGANIZATION

The manner in which staff is organized and the mechanisms for reporting are noted.

5.4.1 Teamstructure

Team consist of four members. Tasks are distributed among the members and are
defined for the proper execution of the project.
1. Mulla Nilofar
2. Pathan Ayesha
3. Shahapurkar Namrata
4. Tayde Dipali

5.4.2 Management reporting andcommunication

Team Leader: Team Leader will divide the task.
Team Developer: Developer will develop the code for execution.

CHAPTER 6

SOFTWARE REQUIREMENT SPECIFICATION

...(download the rest of the essay above)

About this essay:

This essay was submitted to us by a student in order to help you with your studies.

If you use part of this page in your own work, you need to provide a citation, as follows:

Essay Sauce, Design a system to extract ontology from unstructured information sources for enhancing user search results. Available from:<https://www.essaysauce.com/computer-science-essays/design-a-system/> [Accessed 23-09-19].

Review this essay:

Please note that the above text is only a preview of this essay.

Name
Email
Rating
Comments (optional)

Latest reviews: