CHAPTER 1: INTRODUCTION
1.1 OVERVIEW
This chapter presents the overview of the topic along with the objective and motivation behind the research. Below are definitions of concepts and terminologies that characterize the work and give an overall understanding of the research directions. In this research work an effort has been made to study and improve the performance of time series dataset. To deal with performance issues and prediction, we proposed an approach. Our proposed approach resolves all issues. Section 1.8 gives an overall structure of the thesis and provides a short overview of each of its seven chapters.
1.2. DATA MINING
Data Mining is the elicitation process of patterns from huge amount of data sets involving methods at the intersection of database systems, machine learning and statistics. It is an integrative subfield of computer science. In other words, we can say that overall target of the data mining process is to extract information from a data set and reconstruct it into an understandable structure for further use. Data mining is also known as analysis step of the "knowledge discovery in databases" process, or KDD. Many techniques are used in data mining to extract patterns from large amount of database.
Data mining is the process of discovering knowledge from large amounts of data stored either in databases or warehouses [2].As shown in figure1.1 knowledge discovery in databases (KDD) process commonly uses the following steps:
• Data Cleaning – Data is cleansed through processes such as filling in missing values, smoothing the noisy data or resolving the inconsistencies in the data.
• Data Integration – In this step, data with different representations are put together and conflicts within the data are resolved.
• Data Selection – For analysis task relevant data are retrieved from the database.
• Data Transformation – Data is revamped or consolidated into forms appropriate for mining by performing summary or aggregation operations.
Figure1.1: Step wise block diagram of the knowledge discovery process
• Data Mining − In this step, intelligent methods are applied in order to extract data patterns.
• Pattern Evaluation − In this step, data patterns are evaluated.
• Knowledge Presentation − In this step, knowledge is represented.
Data mining is essential step in Knowledge Discovery Process. Every step of Knowledge Discovery plays important role.
1.3. MACHINE LEARNING
Machine Learning (ML) is an established and well-recognized research area of computer science. We can define Machine Learning as “It is a software application to become more accurate in predicting outcomes without being explicitly programmed.”We can use machine learning in many fields like prediction-making, spam filtering, search engines, computer vision, optical character recognition (OCR) etc. We can categorize machine learning task in two categories according to their nature of learning feedback available to the system. Following are the types of machine learning:
1.3.1. SUPERVISED LEARNING
When set of possible classes is known in advance this is called supervised learning. This type of algorithm produces input and desired output for humans, in addition to furnishing feedback about the accuracy of predictions during training.
Fig1.1.2: Document classification as example of supervised learning
Figure1.2: Supervised learning model
The algorithm will apply what was learned to new data, when training is complete.Figure1.2 shows supervised learning model.
1.3.2. UNSUPERVISED LEARNING
When set of possible classes is not known in advance this is called unsupervised learning. This type of algorithm does not produce desired output for humans. These algorithms do
Figure1.3: Unsupervised learning model
not need to be trained with desired output. They use iterative approach called deep learning to review data and arrive at conclusions. Unsupervised learning algorithms are used for more complex processing tasks than supervised learning system. Figure1.3 shows unsupervised learning model.
For example, in case of explaining answer of given question, when we know input and get desired output this is supervised learning. On other side we know input but do not get desired output this is unsupervised learning. In daily life we see many examples of supervised learning and unsupervised learning.
There are many algorithms for supervised learning as well as unsupervised learning. Like for supervised learning regression, decision tree, classification etc. For unsupervised learning clustering, association analysis etc.
1.4. CLASSIFICATION
For prediction of time series data we are using classification technique of data mining. Classification is a data mining (machine learning) technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”.
Figure1.4: Steps of classification in data mining
It is the organization of data in given classes. Classification uses given class labels to order the objects in the data collection. Classification approaches normally use a training set where all objects are already associated with known class labels. The classification algorithm learns from the training set and builds a model. The model is used to classify new objects. For example, after starting a credit policy, the manager of a store could analyze the customers’ behaviour vis-à-vis their credit, and label accordingly the customers who received credits with three possible labels “safe”, “risky” and “very risky”. The classification analysis would generate a model that could be used to either accept or reject credit requests in the future.
Prediction has attracted considerable attention given the potential implications of successful forecasting in a business context. There are two major types of predictions: one can either try to predict some unavailable data values or pending trends, or predict a class label for some data. The latter is tied to classification. Once a classification model is built based on a training set, the class label of an object can be foreseen based on the attribute values of the object and the attribute values of the classes. Prediction is however more often referred to the forecast of missing numerical values, or increase/ decrease trends in time related data. The major idea is to use a large number of past values to consider probable future values.
1.5. TIME SERIES
We are using Time series dataset in this thesis. Time series is a series of data points indexed (or listed or graphed) in time order. A time series is a set of observations on the values that a variable takes at different times. It is a sequence of discrete-time data. Such data may be collected at regular time intervals, like monthly (eg. CPI), weekly (eg. Money supply), quarterly (eg. GDP) or annually (eg. Government budget). Time series are used in statistic, econometrics, mathematical finance, weather forecasting, earthquake prediction and many other applications. The following are some reasons why we are using Time Series data.
• Prediction of the future based on the past.
• Control of the process producing the series.
• Understanding of the mechanism generating the series.
• Description of the salient features of the series
1.6. HASHING
We are using Hashing for performance enhancement of time series data. Hashing is known as process of transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. We are using hashing for indexing and retrieving of data. Hashing is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms. Different type of hashing algorithm is known as hash function. Hash function is applied to the hash field value of a record and yields the address of the disk block in which the record is stored. Hashing is also used as an internal search within a program whenever a group of records is accessed or exclusively by using the value of one field. This increases performance and reduces complexity of data.
1.7. MOTIVATION
Performance of data of a project is very important to an organization. If an organization wants to enhance the performance of data and prediction about that data, then our approach helps in making it easier using hash naïve bayes classification method. Using this we create a framework for any dataset, which enhance performance of data and make prediction about data. This can help an organization in 2 aspects- first, the performance of data is increased. Second, we can predict for a new project using a framework based on past data of project.
1.8. OBJECTIVE
The objective of this work is to classify airlines passenger’s data into their category by using data mining approaches. This thesis has two parts. First is to analyze the performance of airlines passenger’s data and second is to prediction about this data. The objective is met by using a naive bayes classifier and we make a hash key for every data entry of airlines passenger. In other words we can say that our main objectives of this thesis are prediction for airlines passengers without reducing their performance and increasing performance of data.
1.9. ORGANIZATION OF THE THESIS
The rest of the thesis is organized as follows.
Chapter 2 presents the technical background of this thesis.
Chapter 3 presents the literature survey done for the thesis.
Chapter 4 presents the proposed work approach.
Chapter 5 presents the methodology which is used in this thesis.
Chapter 6 presents the experiments and results of this thesis.
Chapter 7 presents the conclusion and future work of this study.