Ans/ a- in business and economic environment, we apply data mining techniques when we want to do any sales campaigns, if we found that the results are as our expectation, then we can start the campaigns, otherwise the decision will be to cancel it.
On the otherwise, we use advanced techniques of data mining in strategic management of enterprises. Some decision cannot be taken by top management without applying some analysis based on data stored on the company's database or other data from external sources to ensure taking the right decision.
Reference: (Andronie, Mihai and Crisan, Daniel, (2010), Commercially Available Data Mining Tools used in the Economic Environment, Database Systems Journal, 1, issue 2, p. 45-54)
b- 1-Genetic data such as the nucleotide sequences in genomic DNA are digital. However, experimental data are inherently noisy, making the search for pa”erns and the matching of sub-sequences di”cult. Machine learning algorithms such as artificial neural nets and hidden Markov chains are a very a”racTve way to tackle this computationally demanding problem
2. Classification of astronomical objects. the thousands of photographic plates that comprise a large survey of the night sky contain around a billion Faint objects. Having measured the a”ributes of each object, the problem is to classify each object as a particular type of star or galaxy. Given the number of Features to consider, as well as the huge number of objects, decision-tree learning algorithms have been Found accurate and reliable for this task.
Reference: The book, web site(https://www.coursehero.com/file/12735064/IT446-Assignment-1docx/ )
Q2)
Ans/
classification clustering
l Given a collection of records (training set ).
Each record contains a set of attributes, one of the attributes is the class.
-Find a model for class attribute as a function of the values of other attributes.
-In general, in classification you have a set of predefined classes and want to know which class a new object belongs to.
-For example
a company needs to analyze
its employee’s history to know who will renew his subscription in internet service. So, a model will be constructed to predict the categorical labels which are Yes or No
l Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that.
-Data points in one cluster are more similar to one another.
-Data points in separate clusters are less similar to one another
Clustering tries to group a set of objects and find whether there is some relationship between the objects.
-For example
if you have an educational institute, you can make groups of your students based on their subscription
patterns. So, when you want to offer specific prices or
discounts for specific students, the clustering process will help you to choose the suitable group for this offer.
Reference: The slide ch(1), web site
http://www.cs.duke.edu/courses/fall03/cps260/notes/lecture18.pdf
, the book.
Q3)
Ans/
We need preprocessing because it’s an important step in the whole data mining process because it prepares initial data set to go through mining process smoothly and effectively. The data at the beginning may be in low quality such as noise, inconsistency, or missing values, and that will affect the result of data mining and the decisions upon it, preprocessing will ensure that the data quality is high according to (accuracy, completeness, consistency, timeliness, believability, and interpretability), which will facilitate the mining process and improve its efficiency to get high quality results
4 data preprocessing techniques are :
1- Data Integration: It is a technique used to merge and integrate data from multiple data sources (databases, data cubes, or files) into a coherent store. In this technique we need to use some statistical functions to remove any redundancies and to detect any inconsistencies
2- Data Transformation and data discretization: data is transformed or combined into forms that suitable for mining. So, it needs to perform some operations such as normalization to scale attribute data to fall within small range, smoothing and removing the noise from data, aggregation and summarization operations, and discretization by dividing the range of a continuous attribute into intervals.
3- Data Cleaning: used to clean data by handling missing values, removing or smoothing noisy data, removing outliers, and solving the inconsistencies problems in order to prepare data for mining process.
4- Data Reduction: This technique is used to save time during data analysis process. It is to obtain a reduced representation of the data set, but in smaller volume, and this representation will produce analytical results that same or almost same as the original data set. Data reduction strategies include dimensionality reduction, in which data encoding schemes are applied to get a reduced representation. Another strategy is numerosity reduction, in which data are replaced by alternative, smaller representations through parametric or nonparametric models.
Reference: our slide, the book (Han, J., & Kamber, M. (2012). Data mining concepts and techniques, third edition (3rd ed., pp. 84-116). Waltham, Mass.: Morgan Kaufmann.)
Q4)
Ans/
Correlation analysis Covariance analysis
which is used to quantify the association between two continuous variables (e.g., between an independent and a dependent variable or between two independent variables)
Correlation is another way to determine how two variables are related. In addition to telling you whether variables are positively or inversely related, correlation also tells you the degree to which the variables tend to move together.
Analysis of covariance (ANCOVA) is a general linear model which blends ANOVA and regression.
Covariance indicates how two variables are related. A positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related. The formula for calculating covariance of sample data is shown below.
Example :
Using the covariance formula
Now you can identify the variables for the covariance formula as follows.
x = 2.1, 2.5, 4.0, and 3.6 (economic growth)
y = 8, 12, 14, and 10 (S&P 500 returns)
= 3.1
= 11
Substitute these values into the covariance formula to determine the relationship between economic growth and S&P 500 returns.
To calculate correlation, you must know the covariance for the two variables and the standard deviations of each variable
Using the information from above, you know that
COV(x,y) = 1.53
sx = 0.90
sy = 2.58
Now you can calculate the correlation coefficient by substituting the numbers above into the correlation formula, as shown below.
Note: Both covariance and correlation identified that the variables are positively related. By standardizing measures, correlation is also able to measure the degree to which the variables tend to move together