“…take our 20 best people and virtually overnight we become a mediocre company”
- Bill Gates
Companies have widely considered people as their most important assets. Winning the intense competition for human talents is a vital competitive edge. In the information technology age, the innovation is key factor and the ability of companies to attract and retain human talents remains a critical success factor.
The “IBM HR Analytics Employee Attrition & Performance” (IBM-HR) Kaggle dataset is a fictional but comprehensive dataset created by IBM data scientists to reflect typical data commonly tracked and accumulated by the companies' Human Resources Department ("HRD") on organisational personnel employment and termination.
Data mining methodology, utilising database, statistical and machine learning applications, can aid in uncovering patterns from the dataset. Any reliable data relationships uncovered will provide the basis for creating models to identify or predict future occurrences of attrition.
2. Project Objectives
So, what are the key characteristics of employees lost to attrition compared to those employees successfully retained by the company?
Using the SEMMA (Sample, Explore, Modify, Model and Assess) methodology and SAS Enterprise Miner Software, the project will explore the IBM HR dataset to uncover some distinctive patterns regarding the issue of employees' attrition and focus on identifying the relationship(s) between attrition and selected key variables such as employees':
• Education level
• Job satisfaction
• Performance rating
The objectives of the project are:
• Understand the business context;
• Use data sampling to explore understand, detect actionable patterns or profiles;
• Apply various modelling techniques to create models for testing
• Assess and recommend a reliable predictive model to guide HRD in HR policies, spotting employees at risk of attrition and reducing overall employee attrition
Expected challenges to the data mining process are:
• Ensure data integrity and consistency. Data cleansing may be required to handle missing data, data gaps and outliers etc.;
• Ensure data are properly represented (e.g. scale, format, nominal, numeric data etc.);
• Problem definition and characterisation;
• Usefulness and reliability of recommended model.
3 Project Timeline
The proposed project timeline are planned as follows:
4. Literature Review
Data mining refers to the overall process consisting of data gathering, and analysis and development of inductive learning models and adoption of practical decisions and consequent actions based on the knowledge acquired (Vercellis, 2009).
The purpose of data mining analysis is to draw insights from past data and derive some new, original general rules that is applicable to the general population. This knowledge enables descriptive, predictive or prescriptive modelling of solutions to identify, forecast future problems (Vercellis, 2009).
5. Data Mining Strategies
Below are several data mining strategies that can be used to approach data issues or problems:
For a set of objects with given attributes values, the classification model is able to identify an object as belonging to a given class with supervised learning (Ali & Waisimi, 2007).
The model can be used to predict the class of the new object. As each class can only have discrete values, the algorithms developed uses categorical data only. Some classification algorithms include Bayesian Classification, Decision Trees, Neural Networks, K-nearest Neighbour Classifiers, and Genetic Algorithm (Benoît, 2002 & Dunham, 2003).
Classification algorithms are applied in bioinformatics, medical diagnosis, fraud detection, loan risk prediction, text classification etc. For example, a bank mines its database and categorises its customers according to various attributes such as income, employment status, age, savings, outstanding loans and home ownership. The data mining process utilises a classification model to identify a customer as belonging to a specified class. The output (classification category) will be used by the bank manager as a basis to approve any future loan to the customer.
Association analysis is used to discover patterns that describe associated features in the data (patterns) in the most efficient manner. Useful applications include finding groups of genes with related functionality, identifying web pages that are accessed together or understanding different elements of earth's climate system.
Clustering is similar to classification except the classes are not predefined. Clustering analysis seeks to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters (Tan, Steinbach & Kumar, 2006). The similarity measure is an important criterion for clustering algorithm (Ali & Wasimi, 2007).
Clustering algorithms are used in many areas including document clustering where a good clustering algorithm is able to identify and group for example, news articles based on their respective topics. Another example is in anomaly detection where the anomaly detection algorithm discovers observations whose characteristics are significantly from rest of data. Other applications include fraud detection, network intrusions or unusual patterns of disease.
Linear regression models are one of the best-known learning and predictive methodology in statistics. Regression is the process of assessing the numeric value of an attribute based on the given values of other related attributes. When the assessment is for a future time, the problem is called a prediction (Ali & Wasimi, 2007). Depending on the complexity of the relationship, the algorithmic model may assume a simple linear regression, multiple regression or logistic regression analysis.
6. Data Mining Tools
Arrays of data mining tools or algorithms are available and some popular algorithms are:
• Function Estimation-based Algorithms
An optimisation method helps this algorithm estimate the optimum parameters. Neural Networks and Support Vectors Machines are the more popular examples of function estimation-based data mining algorithms and have wide applications in industries, from identifying clusters of valuable customers to estimating financial streams, recognising numbers written on cheques etc.
• Lazy Learning-based Algorithms
As the name suggests, this type of algorithm defer processing until a query needs to be answered, making it highly efficient and accurate learning algorithm when small numbers of objects are to be classified. K-Nearest Neighbours and Lazy Bayesian Rules are examples of lazy-learning -based algorithms. Applications include text categorisation and breast cancer predictions.
• Probability-based Algorithms
Probability-based algorithms are based on classical statistical Bayes' Theorem for calculating conditional probabilities. Naive Bayes and BayesNet are examples of probability-based learning algorithms. Common uses include text mining, fraudulent credit card applications and medical data classification.
• Tree-based Algorithms
Tree or rule-based learning (also called classification trees) is a divide-and-conquer approach. The most common uses of decision tree algorithms include marketing, sales data analysis, loan approvals and fraud detection (Ali & Wasimi, 2007) .
7. Data Mining Applications in Humana Resources
Attrition is the loss of employees resulting from voluntary or involuntary resignation and measured by employee turnover (or wastage) metric. Attrition could result from market changes, workload or stressor organisational factors such as teamwork, pay or role expectations mis-match etc. Good attrition results in less productive employees leaving and bad attrition results in high performers leaving. Both resulting in recruiting and training costs, loss of productivity,
To maintain their competitive edge, modern corporations need to successfully manage employees' turnover or attrition, boost employees' retention and avoid the costly impact of attrition. Impact of losing an employee can range from tens of thousands to millions of dollars. Data mining can uncover useful patterns and forms predictions from historical data regarding factors that have a negative impact on employees' turnover and retention.
HRD can apply such insights to future metrics to monitor susceptible employees at risk of attrition. HR policies, pre-employment profiling can be tweaked to ensure a better match between the organisation and employment candidates. Such steps can contribute employee retention rate, reducing re-hiring and training costs as well as improving employees' morale.
Challenges of Data Mining
Due to challenges presented by the dataset, the data normally has to be cleaned and pre-processed prior to data mining use. Some key data challenges include:
• Size of dataset;
• Higher dimensionality;
• Missing and noisy data;
Data errors may result from duplication, incorrect entry or processing, outliers. Common methods of dealing with these errors include avoiding the missing data points, imputation and assigning replacement data such as mean substitution or using a model-based approach.
8. Applications of Data Mining
Nowadays, great amounts of data are collected and accumulated at an unprecedented rate in human history. Technological advances in point-of-sale & mobile applications data collection, identification and smart card technologies, enabled collection in almost any sphere of human activities including commercial, industrial, social activities as well as for scientific and environmental knowledge and purposes.
Whether we are banking, driving or shopping at a supermarket, data are collected by companies, government agencies and product developers for data mining. This results in new insights for descriptive, predictive or prescriptive purposes to enable applications in many areas including marketing, manufacturing processes, medical diagnosis and banking. Already applications in monitoring for banking fraud transactions, traffic conditions and product performance helped to law enforcement, reducing costs of product manufacture or enabling companies to predict consumer behaviour to improve the success rates of their marketing strategies (cross-selling, up-selling etc.) or making informed decisions.
When data mining is applied to text or image recognition, it can also expedite recognition and classification of text and images. Text mining applies to documents, books, emails and webpages and images recognition may apply to static digital images or dynamic video images (Vercellis, 2009).
...(download the rest of the essay above)