Analyzing 800,000 Phone calls
to a General Practitioner
PROJECT DESCRIPTION:
The aim of this project is to develop a processing chain for analyzing big number of audio files recorded by a general practitioner. The purpose is to develop an application able to handle huge number of audio files and prospect important features and topics that serve the task required from analysis.
Project problem Definition:
A local general practitioner recorded 800,000 phone calls over 2.5 years which considered as a huge number of audio files. Analyzing this number of files to identify the content and the contextual features using traditional methods is inefficient and time costing.
Accordingly, this project designed to helps in processing large number of audio files. Indeed, the analysis process is carried out by converting audio files from audio representation to text representation and then analyze the resulting text to find contextual features of interest.
Moreover, the resulting text will be more efficient to be used for different purposes of analysis. It can be used in conducting medical researches such as indicating the stress level of a patient by following the conversations. Also, it can be used to monitor the standards of costumer’s service, staff training, etc.
Project Objectives:
The general objectives of this project are as follows:
1. To develop an application for Batch data analytics.
2. To apply text mining to large set of data.
3. To help health sector in having more efficient analysis of data.
4. To develop an application that can be useful at any time period with any amount of input data.
Significance of this Project:
Overall, the final product of this project can serve the health sector in conducting different analysis on the recorded phone calls.
CONDUCT OF THE PROJECT:
Generally, this project focuses on analyzing huge amount of audio files in a limited time using limited resources. The project involves batch analysis using data analytics approaches and it also applies text mining to retrieve meaningful information that can serve various medical purposes.
Background:
As stated before, this project uses 800,000 audio file representing phone calls recorded by a local general practitioner over 2,5 years. Regarding the huge number of audio files that one general practitioner produces in a short period of timeframe, a storage requirement becomes a matter of concern in this project. Furthermore, analyzing big number of files requires a special type of applications and analysis approaches. Hence, this project depends on big data analysis methods and frameworks to adapt the possibility of the dramatic increase in the number of audio files. Moreover, big data frameworks offer good approaches to develop an application that can adapt the increase in the number of audio files in future.
Big data analytics is a process used to observe and analyze large set of data to reveals any correlations or insights from the given data. By applying analytics to data, data will be more valuable and informative and can help in making more accurate decisions. Big data analytics provides methodologies to manage processing and analyzing large number of files with any format or structure. Also, it provides good storage management for big data. Data used in this project considered as batch data where the data are collected over time and then used to apply different analysis processes.
Batch analytics frameworks like Apache Spark and MapReduce are designed to be used in processing large volume of data set to reveal hidden patterns, measure data correlation or other convenient information that can help in making informed decision.
Additionally, text analytics or known as text mining, is used to extract certain information from large set of text to give an insight from the given data. Indeed, for this project, text mining is used to find contextual features of interest from conversations contained in a given audio files.
Project Development and Requirements:
The development process followed in this project is the Agile method. Agile method is used for its incremental and iterative property to check the correctness of the outputs and detect errors of any inconsistency in the analysis output throughout the development cycle. In addition, agile method allows frequent interaction with the GP and the wider National Health Service (NHS) to increase product satisfaction. Also, it helps in working within the time limits to make sure the product will be done by the deadline (3 months for this project).
a. Data requirement:
Moreover, the data used for this project considered to be part of human data where the data used are in audio representation with “.wav” extension. Each audio file represents a conversation between a doctor and his patients where each conversation might include one or multiple topics which might contains personal and sensitive information of the patients.
Accordingly, and by following the ethics of using recorded phone calls, and According to British Medical Association (BMA) ethics, all of recorded phone calls used for secondary purposes including training, healthcare assessment and research required clear consent before making recordings. Therefore, all of used audio files in this project are used after a clear consent with patients.
b. Software and Hardware Requirement:
the development of this project involves using Apache Spark framework with java programming language as a core. Apache Spark is used in this project to handle batch analytics processes and tools which runs on Linux operating system. Also, Apache Spark uses Hadoop distributed files system (HDFS) for storage managements.
The following chart shows the overall architecture of this project:
figure: project system architecture
STATEMENT OF DELIVERABLES:
The development process will be carried out by using computers cluster frameworks to handle large number of input files in parallel. Among different frameworks used in big data analysis, Apache Spark is the most likely framework that can best serve the aim of the project.
Apache spark is an open source framework used to help in achieving parallelization for analyzing big data. To start, Spark will be used in a standalone mode with java core that runs in a single node using Linux as an operating system. Then, to achieve parallelization, spark can be extended to a cluster composed of commodity machines to process the intended tasks in parallel.
Apache Spark use HDFS in standalone deployment to manage the large number of input data and also to provide a storage for the output data.
figure: spark architecture
Furthermore, the analysis process of this project requires to undergo two major tasks, namely:
1. Converting audio files to text.
2. Text analysis.
For each task, there are different approaches followed to achieve the purpose.
Audio to text:
First, all audio files (or only part of audio files if needed) will be converted from audio representation to text representation. For audio conversion, Sphinx4 speech recognition library in java provides an appropriate API that helps in converting audio files to text. By using Sphinx4 library, the program will start reading the audio files and recognize words from the audio conversations. Then, the extracted words will be exported and saved in a text file.
Text Analysis:
The next step is to analyze the resulting text files using big data analysis distribution, Known as Latent Dirichlet Allocation to find contextual features of interest.
“Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words”.(Blei, Ng, & Jordan, 2003).
Moreover, by using LDA the application will be able to model and classifies the words according to their relating topics.
figure: Graphical model representation of LDA.(Blei et al., 2003)
the above figure represents the process of LDA, the boxes describes plates, where the outer box refer to the document and the inner plate refer to the duplicated words and topics in the document.
LDA use a probabilistic formula applied to documents with mixed words and topics to retrieve and classifies the words into related topics.
The probabilistic formula is given by:
where:
is k-vector of non-negative numbers
is a dirichlet random variable
is the Gamma function
By applying this formula to the documents retrieved from the first step, the application will be able to classify words to related topics and form a dictionary of topics retrieved from a given documents.
following these tasks an application will be able to analyze any number of audio files and categories the contained topics.
Application Testing strategy:
Big data applications testing require a careful testing and validation of the application database, performance and functionality.
a) Database testing: testing and verification of the database content should be done at various points:
I. Pre-data validation: require testing and validating the input before storing them in the database. For this project, we are required to test if the audio files having the same format as the format accepted by the application.
II. Process validation: require validating the data used while processing and check if the data converted and used between each step is in a good format. We are focusing on validating the accuracy and readability of the resulting text from the first audio conversion step. At this step, we are required to test the validity of the analytic process by observing the accuracy of the analytics output.
III. Post data validation: test and validate the final output of the application process.
b) Performance testing:
Performance testing is used to detect the cluster and application performance. The test start with cluster setup, then identify the job and the load to be executed and according to the load, a custom script will be used to test the performance. While testing the performance, if any result does not lead to satisfaction then the cluster components will be optimized till an output satisfaction is reached.
c) Functional testing:
Testing the front end of the application based on the user requirements. The test is mainly focus on using the application output and compare it with the expected output to detect any errors.
Plan:
The following Gantt chart shows the plan timetable to be followed in the development process.
Risk assessment report:
The following are the expected risk for this project:
– Testing and verifying the system performance and accuracy. Where each test is mandatory, if any skipped or missed error can affect the storage and the performance of the application.
–
– Troubles in communication between nodes in computers cluster.
–
List of References:
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
1. Project Description: