1 | P a g e
1. What is Predictive Analysis?
Consider the power of predictive analytics:
• A Canadian bank uses predictive analytics to increase campaign response rates by 600%, cut
customer acquisition costs in half, and boost campaign ROI by 100%.
• A large state university predicts whether a student will choose to enroll by applying predictive
models to applicant data and admissions history.
• A research group at a leading hospital combined predictive and text analytics to improve its ability
to classify and treat pediatric brain tumors.
• An airline increased revenue and customer satisfaction by better estimating the number of
passengers who won’t show up for a flight. This reduces the number of overbooked flights that
require re-accommodating passengers as well as the number of empty seats.
As these examples attest, predictive analytics can yield a substantial ROI. Predictive analytics can help
companies optimize existing processes, better understand customer behavior, identify unexpected
opportunities, and anticipate problems before they happen.
1.1 High Value, Low Penetration. With such stellar credentials, the perplexing thing about predictive
analytics is why so many organizations have yet to employ it. According to our research, only 21% of
organizations have ―fully‖ or ―partially‖ implemented predictive analytics, while 19% have a project
―under development‖ and a whopping 61% are still ―exploring‖ the issue or have ―no plans.‖ Predictive
analytics is also an arcane set of techniques and technologies that bewilder many business and IT
managers.
1.2 Applications. Predictive analytics can identify the customers most likely to churn next month or to
respond to next week’s direct mail piece. It can also anticipate when factory floor machines are likely to
break down or figure out which customers are likely to default on a bank loan. Today, marketing is the
biggest user of predictive analytics with cross-selling, campaign management, customer acquisition, and
budgeting and forecasting models top of the list, followed by attrition and loyalty applications.
Fig. Among business intelligence disciplines, prediction provides the most business value but is also the most
complex. Each discipline builds on the one below it—these are additive, not exclusive, in practice
2 | P a g e
1.3 Versus BI Tools. In contrast, other BI technologies—such as query and reporting tools, online
analytical processing (OLAP), dashboards, and scorecards—examine what happened in the past. They are
deductive in nature—that is, business users must have some sense of the patterns and relationships that
exist within the data based on their personal experience. They use query, reporting, and OLAP tools to
explore the data and validate their hypotheses.
Predictive analytics works the opposite way: it is inductive. It doesn’t presume anything about the data.
Rather, predictive analytics lets data lead the way. Predictive analytics employs statistics, machine
learning, neural computing, robotics, computational mathematics, and artificial intelligence techniques to
explore all the data, instead of a narrow subset of it, to ferret out meaningful relationships and patterns.
Predictive analytics is like an ―intelligent‖ robot that rummages through all your data until it finds
something interesting to show you.
1.4 More Than Statistics. It’s also important to note that predictive analytics is more than statistics. Some
even call it statistics on steroids. Linear and logistic regressions—classic statistical techniques—are still
the workhorse of predictive models today, and nearly all analytical modelers use descriptive statistics
(e.g., mean, mode, median, standard deviation, histograms) to understand the nature of the data they want
to analyze.
However, advances in computer processing power and database technology have made it possible to
employ a broader class of predictive techniques, such as decision trees, neural networks, genetic
algorithms, support vector machines, and other mathematical algorithms. These new techniques take
advantage of increased computing horsepower to perform complex calculations that often require multiple
passes through the data. They are designed to run against large volumes of data with lots of variables (i.e.,
fields or columns.) They also are equipped to handle ―noisy‖ data with various anomalies that may wreak
havoc on traditional models.
1.5 Terminology. Predictive analytics has been around for a long time but has been known by other
names. For much of the past 10 years, most people in commercial industry have used the term ―data
mining‖ to describe the techniques and processes involved in creating predictive models. However, some
software companies—in particular, OLAP vendors—began co-opting the term in the late 1990s, claiming
their tools allow users to ―mine‖ nuggets of valuable information within dimensional databases. To stay
above the fray, academics and researchers have used the term ―knowledge discovery.‖
1.6 Training Models. Supervised learning is the process of creating predictive models using a set of
historical data that contains the results you are trying to predict. For example, if you want to predict which
customers are likely to respond to a new direct mail campaign, you use the results of past campaigns to
―train‖ a model to identify the characteristics of individuals who responded to that campaign. Supervised
learning approaches include classification, regression, and time-series analysis. Classification techniques
identify which group a new record belongs to (i.e., customer or event) based on its inherent characteristics.
1.7 Unsupervised Learning. In contrast, unsupervised learning does not use previously known results to
train its models. Rather, it uses descriptive statistics to examine the natural patterns and relationships that
occur within the data and does not predict a target value. For example, unsupervised learning techniques
can identify clusters or groups of similar records within a database (i.e., clustering) or relationships among
values in a database (i.e., association.) Market basket analysis is a well-known example of an association
technique, while customer segmentation is an example of a clustering technique.
3 | P a g e
2. The Business Value of Predictive Analytics
2.1 Incremental Improvement. Although organizations occasionally make multi-million dollar
discoveries using predictive analytics, these cases are the exception rather than the rule. Organizations that
approach predictive analytics with a ―strike-it-rich‖ mentality will likely become frustrated and give up
before reaping any rewards. The reality is that predictive analytics provides incremental improvement to
existing business processes, not million-dollar discoveries.
―We achieve success in little percentages,‖ says a technical lead for a predictive analytics team in a major
telecommunications firm. She convinced her company several years ago to begin building predictive
models to identify customers who might cancel their wireless phone service. ―Our models have
contributed to lowering our churn rate, giving us a competitive advantage.‖
3. The Process of Predictive Modeling
3.1 Methodologies. Although most experts agree that predictive analytics requires great skill—and some
go so far as to suggest that there is an artistic and highly creative side to creating models—most would
never venture forth without a clear methodology to guide their work, whether explicit or implicit. In fact,
process is so important in the predictive analytics community that in 1996 several industry players created
an industry standard methodology called the Cross Industry Standard Process for Data Mining (CRISPDM.)
3.2 CRISP-DM. Although only 15% of our survey respondents follow CRISP-DM, it embodies a
common-sense approach that is mirrored in other methodologies. ―Many people, including myself, adhere
to CRISP-DM without knowing it,‖ says Tom Breur, principal of XLNT Consulting in the Netherlands.
4. Most Processes for Creating Predictive Models Incorporate the Following Steps:
4.1. Defining the Project
Although practitioners don’t spend much time defining business objectives, most agree that this phase
is most critical to success. The purpose of defining project objectives is to discourage analytical
fishing excursions where someone says, ―Let’s run this data through some predictive algorithms to
see what we get.‖ These projects are doomed to fail.
Collaboration with the Business. Defining a project requires close interaction between the business
and analytic modeler. To create a predictive model, this analyst meets with all relevant groups in the
marketing department who will use or benefit from the model, such as campaign managers and direct
mail specialists, to nail down objectives, timeframes, campaign schedules, customer lists, costs,
processing schedules, how the model will be used, and expected returns.
4.2. Exploring the Data
The data exploration phase is straightforward. Modelers need to find good, clean sources of data since
models are only as good as the data used to create them. Good sources of data have a sufficient number
of records, history, and fields (i.e., variables) so there is a good chance there are patterns and
relationships in the data that have significant business value.
On average, groups pull data from 7.8 data sources to create predictive models. (―High value‖
predictive projects pull from 8.6 data sources on average.) However, a quarter of groups (24%) use just
4 | P a g e
two sources, and 40% use fewer than five sources. Most organizations use a variety of different data
types from which to build analytical models, most prominently transactions (86%), demographics
(69%), and summarized data (68%).
Fig. Based on 149 respondents that have implemented predictive analytics.
4.3. Preparing the Data
Cleaning and Transforming. Once analysts select and examine data, they need to transform it into a
different format so it can be read by an analytical tool. Most analysts dread the data preparation phase,
but understand how critical it is to their success. Preparing data means first cleaning the data of any
errors and then ―flattening‖ it into a single table with dozens, if not hundreds, of columns. During this
process, analysts often reconstitute fields, such as changing a salary field from a continuous variable
(i.e., a numeric field with unlimited values) to a range field (i.e., a field divided into a fixed number of
ranges, such as $0–$20,000, $20,001–$40,000, and so forth), a process known as ―binning.‖ From
there, analysts usually perform additional transformations to optimize the data for specific types of
algorithms.
4.4. Building Predictive Models
Creating analytic models is both art and science. The basic process involves running one or more
algorithms against a data set with known values for the dependent variable (i.e., what you are trying
to predict.) Then, you split the data set in half and use one set to create a training model and the other
set to test the training model.
If you want to predict which customers will churn, you point your algorithm to a database of
customers who have churned in the past 12 months to ―train‖ the model. Then, run the resulting
training model against the other part of the database to see how well it predicts which customers
actually churned. Last, you need to validate the model in real life by testing it against live data.
Iterative Process. As you can imagine, the process of training, testing, and validation is iterative.
This is where the ―art‖ of analytic modeling comes to the forefront. Most analysts identify and test
many combinations of variables to see which have the most impact. Most start the process by using
statistical and OLAP tools to identify significant trends in the data as well as previous analytical work
done internally or by expert consultants. They also may interview business users close to the subject
and rely on their own knowledge of the business to home in on the most important variables to
include in the model.
5 | P a g e
Selecting Variables. Most analysts can create a good analytic model from scratch in about three
weeks, depending on the scope of the problem and the availability and quality of data. Most start with
a few hundred variables and end up with 20 to 30. This agrees with our survey results showing that a
majority of groups (52%) create new models within ―weeks‖ and another third (34%) create new
models in one to three months. Once a model is created, it takes about half the groups (49%) a matter
of ―hours‖ or ―days‖ to revise an existing model for use in another application and takes another 30%
―weeks‖ to revise a model. In addition, about half (47%) of models have a lifespan shorter than a
year, and one-third (16%) exist for less than three months.
Fig. How Long Does It Take to Create a New Model from Scratch?
Fig. How Many Variables Do You Use in Your Models?
4.5. Deploying Analytical Models
Focus on Business Outcomes. A predictive model can be accurate but have no value. Predictive
models can fail if either (1) business users ignore their results or (2) their predictions fail to
produce a positive outcome for the business. The classic story about a grocery that discovered a
strong correlation between sales of beer and diapers illustrates the latter situation. Simply identifying
a relationship between beer and diaper sales doesn’t produce a valuable outcome. Business users must
know what to do with the results, and their decision may or may not be favorable to the business.
Fig. What Does Your Group Do with the Models It Creates?
6 | P a g e
4.6. Managing Models
The last step in the predictive analytics process is to manage predictive models. Model management
helps improve performance, control access, promote reuse, and minimize overhead. Currently, few
organizations are concerned about model management. Most analytical teams are small and projects
are handled by individual modelers, so there is little need for check in/check out and version control.
―We don’t have a sophisticated way of keeping track of our models, although our analytical tools
support model management,‖ says one practitioner. She says her four-person team, which generates
about 30 models monthly, maintains analytical models in folders on the server.
Fig. Which Best Describes Your Group’s Approach to Model Management?
7 | P a g e
5. Advances in Predictive Analytics Software
Analytical software has taken much of the labor, time, and guesswork out of creating sophisticated
analytical models.
5.1. Integrated Analytic Workbenches. Leading vendors of analytical software have introduced in the
past several years robust analytic workbenches that pre-integrate a number of functions and tasks
that analytic modelers previously completed by hand or with different tools. Today, modelers can
purchase a single analytic development environment that supports all six steps in the analytic
development process.
5.2. Graphical Modeling. One major advancement offered by these workbenches is their ability to
graphically model the flow of information and tasks required to create and score analytic models. In
the past, modelers had to hand-code these steps into SQL or a scripting program. ―I can’t develop
models without the types of analytic tools available today since I don’t have programming skills,‖ says
TN Marketing’s Siegel. ―Today, I can create one hundred little steps in a graphical workflow,
configure each step, and then hit a button to make the program run. The tool builds the programming
logic behind the scenes so I don’t have to.‖
5.3.Automated Testing. Analytic workbenches have also improved developer productivity by
automatically running multiple models and algorithms against a data set and measuring the impacts to
see which provides the best performance. Previously, developers had to spend time testing each type
of model and algorithm separately, effectively limiting the options they could test.
5.4. Client/Server. Today’s analytic workbenches run in a client/server configuration rather than only on
a desktop. A client/server architecture consolidates queries onto the server, reducing what analysts
must download to their desktops to explore data and create analytic models. This reduces network
traffic and redundant queries, which can bog down system performance.
5.5. Text Analytics. Predictive text analytics enables organizations to explore the ―unstructured‖
information in text in much the same way that predictive analytics explores tabular or ―structured‖
data. Through text analytics, organizations can uncover hidden patterns, relationships, and trends in
text. As a result, companies gain greater insight from articles, reports, surveys, call center notes, email,
chat sessions, and other types of text documents. Predictive text analytics also allows
organizations to combine structured and unstructured information in the same models or retrieve
documents related to specific KPIs.
5.6. Analytic Data Marts. Along with the client/server workbench, most organizations implement an
analytical data mart to house much of the data that analysts want to analyze. Most organizations
refresh these analytical data marts on a monthly basis so modelers can rerun models on new data.
Having a dedicated environment for predictive modelers further offloads query processing from a
central data warehouse and operational systems, and improves performance across the systems.
8 | P a g e
6. Machine Learning Methods for Mail Spam Classifier
6.1. Naïve Bayes classifier method
In 1998 the Naïve Bayes classifier was proposed for spam recognition. Bayesian classifier is
working on the dependent events and the probability of an event occurring in the future that can
be detected from the previous occurring of the same event. Every word has certain probability of
occurring in spam or ham email in its database. If the total of words probabilities exceeds a certain
limit, the filter will mark the e-mail to either category. Here, only two categories are necessary:
spam or ham. Almost all the statistic-based spam filters use Bayesian probability calculation to
combine individual token's statistics to an overall score.
The message is considered spam if the overall spamminess product S[M] is larger than the
hamminess product H[M]. The above description is used in the following algorithm:
Stage 1. Training:
Parse each email into its constituent tokens
Generate a probability for each token W
S[W] = Cspam(W) / (Cham(W) + Cspam(W))
store spamminess values to a database
Stage 2. Filtering:
For each message M
while (M not end) do scan message for the next token Ti query the database for spamminess S(Ti)
calculate accumulated message probabilities
S[M] and H[M]
Calculate the overall message filtering indication by:
I[M] = f(S[M] , H[M])
f is a filter dependent function,
such as:
I [M] = 1+S[M]-H[M]/2
International Journal of Computer Science & Information Technology (IJCSIT), Vol 3, No 1, Feb
2011:
176
if I[M] > threshold msg is marked as spam
else
msg is marked as non-spam.
9 | P a g e
6.2. K-nearest neighbour classifier method
The k-nearest neighbour (K-NN) classifier is considered an example-based classifier, that means
that the training documents are used for comparison rather than an explicit category
representation, such as the category profiles used by other classifiers. As such, there is no real
training phase. Additionally, finding the nearest neighbours can be quickened using traditional
indexing methods. To decide whether a message is spam or ham, we look at the class of the messages
that are closest to it.The comparison between the vectors is a real time process. This is the idea of the k
nearest neighbor algorithm:
Stage 1. Training:
Store the training messages.
Stage 2. Filtering:
Given a message x, determine its k nearest neighbours among the messages in the
training set. If there are more spam's among these neighbours, classify given
message as spam. Otherwise classify it as ham. The use here of an indexing method in order to reduce
the time of comparisons which leads to an update of the sample with a complexity O(m), where m is
the sample size. As all of the training examples are stored in memory, this technique is also referred to
as a memory-based classifier.
. Another problem of the presented algorithm is that there seems to be no parameter that we
could tune to reduce the number of false positives. This problem is easily solved by changing the
classification rule to the following l/k-rule:
If l or more messages among the k nearest neighbours of x are spam, classify x as spam,
otherwise classify it as legitimate mail.
The k nearest neighbour rule has found wide use in general classification tasks. It is also one of
the few universally consistent classification rules
10 | P a g e
6.3. Support Vector Machines classifier method
Support Vector Machines are based on the concept of decision planes that define decision
boundaries. A decision plane is one that separates between a set of objects having different
class memberships, the SVM modeling algorithm finds an optimal hyperplane with the
maximal margin to separate two classes, which requires solving the following optimization
problem.
A k-fold cross validation randomly splits the training dataset into k approximately equal-sized
subsets, leaves out one subset, builds a classifier on the remaining samples, and then
evaluates classification performance on the unused subset. This process is repeated k times
for each subset to obtain the cross validation performance over the whole training dataset. If
the training dataset is large, a small subset can be used for cross validation to decrease
computing costs. The following algorithm can be used in the classification process.
Input :
sample x to classify
training set T = {(x1,y1),(x2,y2),……(xn,yn)};
number of nearest neighbours k.
Output:
decision yp Î {-1,1}
Find k sample (xi,yi) with minimal values of K(xi,xi) – 2 * K(xi,x)
Train an SVM model on the k selected samples
Classify x using this model, get the result yp
Return yp.
Table 1. Performance of three machine learning algorithms by selecting top
100 features
Algorithm Spam Recall
(%)
Spam Precision
(%)
Accuracy
(%)
NB 98.46 99.66 99.76
SVM 95.00 93.12 96.90
KNN 96.92 96.02 96.83
Fig. Spam Recall, Spam Precision and Accuracy curves of three classifiers
11 | P a g e
7. Conclusion
In this report we review some of the most popular machine learning methods and of their
applicability to the problem of spam e-mail classification. Descriptions of the algorithms are
presented, and the comparison of their performance on the Spam Assassin spam corpus is
presented, the experiment showing a very promising results specially in the algorithms that is
not popular in the commercial e-mail filtering packages, spam recall percentage in the three
methods has the less value among the precision and the accuracy values, while in term of
accuracy we can find that the Naïve bayes and rough sets methods has a very satisfying
performance among the other methods, more research has to be done to escalate the
performance of the Naïve bayes and KNN either by hybrid system or by resolve the feature
dependence issue in the naïve bayes classifier, or hybrid the Immune by rough sets. Finally
hybrid systems look to be the most efficient way to generate a successful anti spam filter
nowadays.