There are many types of data that come from an educational context, and due to a recent increase in technology-driven learning environments even more data about students and their cognitive learning has become available. This source of data has paved the way for Educational Data Mining (EDM) to emerge as discipline of data mining. The main goal of data mining is to analyze large amounts of data in search of knowledge which becomes a valuable support for decision-making. Since EDM analyzes the educational data by using algorithms and statistics it inherently ties several disciplines together including computer scientists, statisticians, educational researchers, and psychologists. Obviously, computer scientists and statisticians design the algorithms and formulas to calculate useful information from the large amount of data. Also, the educational researchers and psychologists are helpful when applying the concepts of teaching and the learning process to help better inform the final decision makers.
The first stage of any type of data mining, is preprocessing the data. In EDM, this phase transforms raw educational data into a format that can be quickly analyzed by algorithms to solve specific questions. There are four main steps involved to complete the preprocessing stage which are as follows: Data gathering, Data cleaning, and Data transformation. Data gathering is the first step, and it involves collecting all relevant types of data (profile, content, and communication) into one data repository. This is a necessary step because educational data is often gathered from many different sources. Also, students of various levels often do not have all assignments completed which results in missing data, therefore it is important to choose the data that best represents the attributes researchers want to analyze. The second preprocessing step is data cleaning, and during this step inaccurate data is discovered such as missing data or outliers. When incorrect data is found, one solution is to use global values to fill in missing pieces, or to simply label those parts of data as missing or NULL. If the incorrect data is widely different from the general behavior of other data, then an outlier has been detected. Usually, it is best to remove outliers from the data because their values can significantly throw off results for the rest of the data. Data transformation is the third and final step of preprocessing educational data. During this step, data is transformed into a format that is easily evaluated by algorithms that answer specific educational research questions.
Another name for the recorded data about website usage is a Web log file. The main obstacle of EDM is to derive useful information from these massive logs. Web log files contain information such as user identification, interaction, page URL, and time spent. The data mining aspect of EDM follows preprocessing steps that are common to all types of data mining. Preprocessing data involves data cleaning, data normalization, data reduction, and removal of outliers.
There are several current techniques to analyze data that are widely used in EDM. However, the important bridge between analyzing data and communicating results to an audience is the way data is presented which can also be called information visualization. The main goal of information visualization can be defined as “the display of data with the aim of maximizing comprehension rather than photographic realism” [3]. Typically, in EDM data sets, human perceptual capabilities are able to derive more information from viewing a graphical representation of data rather than reading text. A proper graphical representation will show the data in a clear way that encourages a viewer to compare different pieces of data from several perspectives. One way in which information visualization is used for EDM, is to analyze educational software that provides teachers with feedback on the status of a student as they progress through subject materials. There are three main types of educational software to which information visualization can be applied which are as follows: user models, online communications, and student tracking data. User models deal with representing what would be a general user’s knowledge and goals of a particular subject. QV is one specific type of user modeling which is implemented as an interface that gives a hierarchal order of known or unknown concepts. In Figure 1.1, it can be seen that white shapes indicate a known topic and black shapes are unknown (Figure 1.1). QV is useful for cases where many components need to be displayed. Another type of user modeling is implemented by ViSMod which is an interactive tool that displays a Bayesian model of a student’s learning. ViSMod is interesting because it considers the instructor and student opinions to represent complex relationships involved in the learning process. As seen in Figure 1.2, ViSMod displays this network or relationships using various colors and lengths of connecting edges (Figure 1.2). A third type of user modeling is called E-KERMIT is an extension of Knowledge-Based Entity Relationship Modelling Intelligent Tutor (KERMIT) which uses histograms to model a students’ knowledge of the domain or course materials. In Figure 1.3, the blackened parts of the histogram are parts of course that a student knows correctly and gray represents incorrectness in that given category of the domain (Figure 1.3). Another main type of educational software is online communications which usually take the form of discussion forums which can measure time spent or the number of postings. Simuligne is a type of this modeling that analyzes social network interactions and communications to build a graphical network with assigned statistical measures of each individual’s interactions. Figure 2.1 clearly shows this representation type (Figure 2.1). A second type of online communication modeling is called PeopleGarden. PeopleGarden uses flowers in a garden to represent the amount of time spent and number of postings to a message board. In Figure 2.2, a flowers height denotes the amount of time an individual spends on the message board, and the flower petals are representative of their postings (Figure 2.2). Student tracking data is the third major kind of educational software which can be used for distance learning online learning. One example of this is implemented in course management systems to monitor student activity in a large log. A current issue is converting this textual log into a helpful visualization for the instructor leading the course. ViSION is one program that displays student interactions which helps students with group projects. Another program is called CourseVis which uses graphical representations to analyze student tracking data. CourseVis is focused on providing instructors with access to social, cognitive, and behavioral aspects of long-distance students enrolled in online courses that use course management systems. Figure 3.1 shows an example of CourseVis which summarizes students accessing coursework for an online class (Figure 3.1). A third implementation of student tracking data representation is called GISMO. GISMO stands for Graphical Interactive Student Monitoring, and it is a tool used to confirm a students’ attendance, readings, and assignment submissions for an online course.
A major resource to the EDM community is the Pittsburgh Science of Learning Center DataShop (PSLC DataShop) which is a huge educational data repository open to researchers involved in the EDM discipline. Currently PSLC DataShop focuses on educational software such as data from online courses, tutoring systems, virtual labs, online assessment systems, and simulations. The PSLC DataShop has a specific format for entering data. This helps to keep all data in a uniform standard which in turn facilitates analysis and representation.
A core technique to analyzing educational data is classifying characteristics. Classification allows researchers to predict academic success, adapt next task in environment, and course outcomes. The most common methods of classification are “decision trees, Bayesian networks, neural networks, K-nearest neighbor classifiers, support vector machines, and different kinds of regression-based techniques” [3]. The first question that needs answered is whether a classifier is discriminative or generative. A discriminative classifier will determine a value or boundary for each class, but a generative classifier will define the probability of data belonging to a certain class in other words model the distribution of a class. After all this analysis, researchers must consider the classification accuracy which defines the amount of correctly classified rows in the data set. An important issue that may arise is called overfitting which means a model is so close to the actual data that it is unable to generalize anything because all it divides is the outliers or errors in the data set.
Decision trees use a tree-like structure with each node representing a test on a characterstic or classification of data, and each branch from a node is a possible outcome of the test. Bayesian networks are models probabilistic relationships between classes or data of interest. A Bayesian network can be useful for finding cause-effect relationships. Neural networks are interconnected groups of data which can change shape throughout a learning phase, this is useful for modeling the complex associations between inputs and outputs of educational data. K-nearest neighbor algorithms are used in EDM to find all classes that fit a certain similarity measure such as GPA. Support vector machines is a supervised machine learning algorithm that is useful for classification by looking for a boundary that separates one class from another. Finally, regression-based techniques are used to predict a range of numeric values given a data set.
Clustering is done by grouping data of certain classifications together in a way that improves information analysis. There is k-means clustering divided a dataset into a user specified number k groups or clusters. Although k-means is by far easier to implement and more efficient there is also fuzzy c-means clustering which means each cluster corresponds to a “fuzzy” set of the entire data.
Each technique has many application programs that implement an interface for which to apply these methods to actual data. Although there are many ways to analyze educational data, most of these approaches focus on the student aspect over the teacher. The amount of information given to educators does not matter if they do not use it to actually implement better decisions in our nation’s schools.