The past decades have seen a vast growth in the amount of text, images and more generally data regularly produced and stored. Data hold information that can be analysed in order to discover knowledge. However, because of the increasing amount of data, it becomes challenging (sometimes even impossible) to manually analyse them, extract useful information and draw conclusions. In this scenario, data mining plays a fundamental role. Data mining, in fact, is the process of analysing existing data in order to extract implicit, previously unknown useful information and patterns from such data (Witten and Frank, 2005). Data mining is a practice that examines large amount of data; when applied on small dataset, this practice can return misleading patterns or even not find any useful information.
In order to apply data mining techniques to a large dataset, several processes are required to get useful results. The resulting patterns found can represent exceptional predictions for future data. This is extremely important in a number of different sectors, as data mining finds its application on various fields. In particular, the application of data mining in healthcare is becoming essential, as prediction drawn from analysis of current data can help identify diseases, treat them accurately and even prevent the loss of lives. Despite the advantages of using data mining, there are several risks associated to its usage and its introduction in a company.
Process and techniques
Data mining is one of the processes of Knowledge Discovery in Databases (KDD), which, as the name suggests, deals with finding knowledge in data. The various processes of KDD can be observed in Fig. 1. The preliminary process needed to apply any data mining technique is the collection of data. This process may occur in various ways, for instance, by producing a survey, measuring objects or people, collecting log files from computers. Upon collection of a dataset, data need to be accurately processed (cleaned): noise (i.e. meaningless data), inconsistency, missing values and duplicates need to be removed. Occasionally, data from multiple sources can be combined in order to achieve better results (integration process). Frequently, some attributes of the dataset are not be relevant to the desired prediction; such attributes should be removed as part of the selection process. The selected attributes then need to be transformed into an appropriate form such that the data mining algorithm can efficiently process the data. After carefully carrying out the data preprocessing steps, the data mining tool is fed with such data (data mining process). This processes the data and output some statistical results. Such results needs to be evaluated in order to determine whether they are accurate or not (pattern evaluation process). If not, different attributes may be used or a different transformation may be applied to the dataset. Once the data mining tool returns a useful result (i.e. the tool predicts outcomes with a certain level of accuracy which can be understood), experts (e.g. physicians or doctors) can apply such knowledge to take more informed decisions.
There are a variety of techniques that can be used to build data mining tools. As explained by Brown (2012) and Oracle (2017), some of the core techniques used are:
Association: creates a correlation between two or more items. This is particularly used in marketing and sales in order to promote a product based on what other people have bought in the past. It effectively determines the probability of the co-occurrence of any two particular items.
Classification: divides and group data together based on different attributes. It aims to predict the target class for each case in the data. For example, classifying cars into different types by identifying different attributes (number of seats, car shape, driven wheels).
Clustering: grouping data to form a group (called cluster) where data in a particular cluster are more similar than data in other clusters.
Decision trees: providing a set of rules (selection criteria) used to filter specific data.
Combinations: applying multiple techniques together in order to obtain optimal results. Commonly, classification and clustering are used in combination to refine classifications.
Data mining is widely used due to the accurate forecasts that can be made from the analysis of vast amount of accurate recorded data. An area where data mining has found extensive applications is healthcare, improving medical performances. In fact, as IBM (2017c) explains, thanks to the power of data-driven insights, healthcare organisations can deliver more efficient care, engage patients and consumers, and optimise business performance. Some data mining application in healthcare can be grouped into: oncology, patient prioritisation, death prediction, healthcare management, drug discovery, treatment effectiveness, hospital infection control and fraud and abuse.
Data mining is increasingly being used for treatment of tumours. As IBM (2017d) explains, every year over 11,000 articles are printed which are only related to breast cancers; one of the biggest challenges for a surgeon is to stay on top of the literature and research. In March 2017, Jupiter Medical Center - a regional medical centre based in Florida, US - became the first to adopt Watson for Oncology (IBM, 2017d). As described by IBM (2017b), thanks to Watson, physicians can quickly interpret and analyse patients health record, surface relevant articles, and explore treatment options to reduce the variability of care; in particular, Watson is capable of analysing more than 300 medical journals, more than 200 text books, and nearly 15 million pages of text to provide insights about different treatment options.
Besides this practical application, there are also various research being conducted on the application of data mining in oncology. In fact, as Herland et al (2014) highlights, in order to help physicians treat their cancer patient, two particular studies could be applied: one study categorises leukaemia into different subclasses by analysing genes, while the second makes use of data to predict relapse among patients in the early stages of cancer.
When effectively prioritising at-risk patients, physicians can intervene more rapidly, resulting in improved quality of the care provided (IBM, 2017e). One practical application of data mining is in patient prioritisation, where hospitals analyse patientâ€™s data in order to identify at-risk patients. IBM (2017e) explains that Mercy Health, a premier healthcare provider in Ohio, has chosen IBM Watson in order to achieve this. In fact, the organisation can rapidly gather all the pertinent claims and clinical information about its patients; Watson would then analyse such data in order to produce a patientâ€™s summary, allowing doctors to look at risk scores, gaps in care and more. They estimated that this application of data mining has increased their standard for care of patients while rising their portion of shared savings. Death prediction
A research conducted by Paoin (2011) shows that data mining can be applied to predict the cause of death of deceased patients with unknown death causes. The study achieves this by using the World Health Organization (WHO) mortality database, which contains mortality statistics divided by country. Different techniques in the WEKA software have been used in this research (decision tree, NaÃ¯ve Bayes, Apriori algorithm), with confidence levels reaching up to 86% - 100%.
As shown by Koh et al (2011), data mining techniques can be applied to analyse extensive volume of data and use such results to compare different healthcare practices, resource utilisation, length of stay and costs of different hospitals. In fact, a practical application of healthcare management is the treatment guidelines, disease management groups, and cost management adopted by Sierra Health Services (Koh et al, 2011).
A further study from Koh et al (2011) suggests that data mining could also be applied in order to develop an automated warning system in the event of epidemics. Despite this early-warning system has not found any practical application yet, the usage of data mining to predict the spread of epidemics might potentially help save thousands of lives.
Drug discovery is an area of healthcare in which data mining techniques have been applied in order to identify new indications for existing drugs as well as novel drug targets, as explained by IBM (2017a). They also report that the Barrow Neurological Institute - US, applied the IBM Watson for drug discovery platform in order to identify new targets for ALS research. In fact, based on the information that Watson returns after analysing data, physicians are able to identify new improved effective treatments for ALS.
Using data mining techniques, it is possible to evaluate the effectiveness of medical treatment, identifying which treatment proves more effective. A practical application of this concept has been applied by United HealthCare, which analysed its treatment record data in order to discover novel ways to reduce costs while increasing the quality of its medicine (Koh et al, 2011). As they continue, data mining has also been applied to reduce various side-effects of patientsâ€™ treatment, effectively analysing how patients respond to specific drugs, identifying proactive steps that can reduce the risk of affliction.
Hospital infection control
Using data mining techniques, the University of Alabama implemented a system able to identify infection control data (Obenshain, 2004). In particular, they used a surveillance system which applies association rules on culture and patient care data and generates monthly patterns that are reviewed by an expert in infection control. Obenshain (2004) concludes that infection control performed with data mining tools is more accurate, specific and sensitive than the traditional approach.
Fraud and abuse
The healthcare system can often be subject to fraudulent activities, such as counterfeit claims by physicians or clinics, overpriced laboratory tests, fraudulent insurance and medical claims. As explained by Koh et al (2011), data mining has been successfully applied to the detection of fraudulent activities in several hospitals and insurance companies, and the results have been astonishing: all of the them identified fraudulent activities and increased their annual saving up to 20%.
Usage and Risks
In order to introduce data mining in the company, various procedures are required. In particular, the company, needs to take into account the need for different types of resources for each process of data mining (described in Section Process and Techniques). The firsts process is data collection: in order for data mining tools to efficiently use the collected data, the company needs to store them in the appropriate format (e.g. image, video, text) depending on the type of information. In order to keep large amount of them, the company needs to invest in data storage facilities, such as incrementing/building a datacenter which allows the company to store vast amount of data, necessary to feed into the data mining tool. Alternatively, the use of cloud services can achieve similar results, reducing the costs of incrementing/building a datacenter. Collecting and storing data is a task that ideally needs to be carried out by specialists, as the storage of the wrong type of data may compromise the data mining analysis results. Hence, specialised human resources would need to be hired. In particular, the ideal team would contain statisticians, data mining experts, mathematicians as well as computer scientists. The size of the team will vary depending on the amount of data the company wishes to store. Cleaning & integration and selection & transformation are process that require computer scientists and data mining experts. No particular constraints on the infrastructures are needed; software constraints may apply. The data mining process is the usage of the computer system containing data mining and data analysis algorithms. The purchase of such tool is required; some of the most notorious and widely used data mining computer system are IBM Watson, Oracle Data Mining, WEKA, RapidMiner and Orange. While some of these software need monthly subscriptions in order to be used (e.g. IBM Watson), most of them are free open source data mining tools. However, even open source tools might involve additional costs, such as training and assistance costs. The pattern and evaluation process, on the other hand, is carried out by statisticians and data mining experts, who will analyse how meaningful and accurate the results of the data mining tool are. The final process, knowledge presentation, will be carried out by physicians, clinicians and doctors (depending on the type of data evaluated by the tool).
Although the advantages of data mining outweigh its disadvantages, there are some risks to be considered. The first issue is the evaluation of the data mining toolâ€™s performances, in particular how efficient the algorithm used is (i.e. how long the program takes to analyse the dataset). An issue related to the dataset is the scalability of data. In fact, over time the data collected will inevitably grow. This will arise the question of which data to maintain and which data to discard. Hence, an accurate analysis of relevant and irrelevant data has to be carried out periodically. The quality of the dataset is also an important concern. In fact, if the data is not accurate or contains too much noise then the data mining tool will not perform optimally, resulting in a poor analysis. Data should be accessed only by authorised personnel, preventing malicious activities from unauthorised people accessing the data. If a datacenter is used, security of such building(s) needs to be considered. However, if cloud services are used, on top of who is allowed to access the data, the country in which the data will be stored needs to be evaluated, as different countries have different laws regarding to the storage of data. Protection of the data will avoid possible leak of information (hacking) which could expose sensible information. A further concern is ethical issues, as data collected cannot be used for any other purposes than the ones agreed on.
Following the reasonings above, it becomes clear that data mining techniques applied to healthcare can improve significantly the care management of patients. The introduction of this methodology includes several risks and might be costly; nonetheless, the benefits that it can provide in the long run might increase the companyâ€™s turnover and increment savings while helping delivering a better care to patients.
...(download the rest of the essay above)