1.Introduction

Logical analysis of data (LAD) is a mathematical methodology which comprises ideas and concepts from optimization, combinatorics and Boolean functions. The fundamental concept in LAD is that of patterns, or rules, which were found to play a decisive role in classification, clustering, detection of subclasses, feature selection, development of pattern-based decision support systems, medical diagnosis, marketing and other problems.

The research area of LAD was initiated by Peter L. Hammer in 1986, who helped the methodology to be successful in many data analysis applications. While LAD has been applied in many data analysis disciplines, including economics and business, etiology, oil exploration, etc. but one of the major goals of LAD is to classify new observations with prior knowledge of supervised data sets. The available information consists of a vector of attribute values and the outcome of it. The LAD aims at detecting logical patterns from training set, which distinguish observations in one class from all the other observations. Other approaches to this problem have been proposed, based on different considerations and data models. Established ones include: Neural Networks, Support Vector Machines, k-Nearest Neighbors, Bayesian approaches, Decision Trees, Logistic regression, Boolean approaches.

This overview presents some of the basic aspects of LAD, from the definition of the main concepts to the efficient algorithms for pattern generation and proposes an original enhancement to this methodology based on statistical considerations on the data. The logical analysis of data was originally developed for the analysis of datasets whose attributes take only binary (0–1) values. Since it turned out later that most of the real-life applications include attributes taking real values, a “binarization” method was proposed. This is done by using the training set for computing specific values for each field, called cut-points that split each field into binary attributes. The selected binary attributes constitute a support set, and are combined for generating logical rules called patterns and these patterns are used to classify each unclassified record.

In this paper, we propose the following enhancement to the LAD methodology. The idea of evaluating correlation coefficient of each binary attribute with all the other attributes and correlation analysis of attributes with result, result being the attribute specifying the class label to which the observation belong. Correlation between two attributes is defined as the linear relationship between two attributes, i.e. how two attributes are related with each other, what kind of relationship they follow. Correlation between attributes is measured as correlation coefficient, which may be positive as well as negative. If correlation coefficient of two attributes is close to +1 or -1, this means both these attributes are linearly dependent. For a pair of independent attributes, correlation coefficient is zero but the converse is not true as correlation coefficient is the measure of linear dependence so if two attributes have zero correlation coefficient, then either they are independent or they are not linearly dependent.

2. Notation and terminology

The input information in the problems to be studied in this paper consists of an archive of past observations denoted by S. Each observation is an n-dimensional vector having as components the values of n attributes and is accompanied by the indication of the particular class (e.g., positive or negative) this observation belongs to. We can think of the classification as being defined by a partition of S into two sets S+ and S- representing the positive and the negative observations, respectively. A set of records used for evaluating the performance of the learned classifier is called test set T. We compare the classification of T given by the learned classifier, also called predicted classification, to the real classification of T: the differences are the classification errors of our classifier.

An archive (S¬+, S-) of the type described above can be naturally represented by a partially defined Boolean function i.e., a mapping S→ {0, 1}, where S is viewed as a subset of {0, 1}n. Any completely defined Boolean function (i.e., a mapping {0, 1}n→ {0, 1}) which agrees with all the classifications in the archive will be called an extension of Φ. An extension of Φ is function f such that f agrees with Φ; that is, if x is one of the data points given in D then f(x) = 1 if and only if x is classified as positive in Φ. In a sense, the extension explains the given data and it is to be hoped that it generalizes well to other data points, so far unseen. A common special class of Boolean functions frequently used for choosing an extension is the class of threshold (or linearly separable) functions in which the classification is decided by whether a weighted sum of the attributes does or does not exceed a certain threshold.

In this technique a support set D of variables is found such that no positive data point agrees with a negative data point. Once a support set has been found, one then looks for patterns. These are conjunctions of literals which are satisfied by at least one positive example in Φ but by no negative example. We then take as the extension f the disjunction of a set of patterns which together cover all positive examples (that is, which are such that each positive example satisfies some pattern). There are some variants on this method. It is also possible to make use of negative patterns. A negative pattern is a conjunction of literals which is satisfied by at least one negative example and by no positive example.

Consider, for example, the partially defined Boolean function given in Table 1. This table represents an archive of three positive and three negative observations (the rows of the table), expressed in terms of three attributes (the columns of the table). It can

be seen that a1a3 is a positive pattern. Indeed, 1) there is no negative observation with a1 . a3 . 1 (i.e., with a1a3 . 1), and 2) there is a positive observation (the third row) with a1a3 . 1. Moreover, since a1 covers the negative observation in the fourth row and a3 covers the negative observation in the fifth row, it follows that neither a1 nor a3 are positive patterns. Therefore, a1a3 is a prime positive pattern.It can be checked that the positive prime patterns of this function are a2 and a1a3, while the negative prime patterns are a1a2, a1a3, and a2a3.

3. Binarization

Logical analysis of data was initially developed for binary attributes i.e. attributes that take values 0 and 1. However, it was found out that most of the real-world applications have attributes which take real values i.e. either the data is continuous in nature or the data is mainly categorical with more than two classes, so a method to binarize such kind of attributes was proposed known as \"Binarization\".

The binarization method involves associating several binary attributes to each of the real value attributes. The numerical value of the attribute and the corresponding threshold determines the values taken by the new binary attributes introduced. If the attribute's value is more than a certain threshold, it is assigned the value 1 otherwise 0. The basic idea of binarization is to find minimum number of such threshold values called cutpoints. These cutpoints are chosen in such a way that we can distinguish between positive and negative observations.

Let us consider Table 1, containing a set of S+ of positive observations and a set S- of negative observations having the attributes A, B,C. To be more specific, we can consider a phenomenon S as breast cancer and the attributes A, B and C may represent lump size, bone density and age of the person respectively.

Let us first introduce the following cutpoints

℧A = 3.0, ℧B = 2.0, ℧C = 3.0

for the attributes A,B and C respectively. These cutpoints convert numerical attributes into binary values. Consider an observation α = (αA, αB, αC…) which is mapped to a binary vector y(u) = (yA, yB, yC,…) by assigning yA = 1, iff αA> ℧A , yB = 1, iff αB> ℧B and yC = 1, iff αC> ℧C for all observations. The result of this binarization of Table 1 is given in Table 2.

Although, a single cutpoint is introduced for each numerical attribute, this idea can be extended to assigning multiple cutpoints. Several cutpoints are assigned to each real valued attribute such that if K-cutpoints introduced for an attribute, then the attribute A is converted to a K-dimensional boolean vector.

3. Support set minimization

The set of binary attributes generated through the binarization method is very likely to contain a number of redundant attributes. Such attributes increases the computation process, and hence should be eliminated. So the major concern is to reduce the size of obtained dataset by eliminating the redundant attributes such that there is no observation point that is true and false at one and the same time. As set of attributes is called a support set if the positive and negative observations are disjoint. A support set is called irredundant if no proper subset of it is a support set.

Different methodologies have been proposed to select a small support set such that there is no loss of information following the elimination of the redundant attributes. One of the interesting approach to avoid the loss of information is to evaluate the quality of each attribute. This evaluation determines the selection of binary attributes to form the support set.

Another simple approach to identify a minimal support set is based on correlation analysis. The basic idea of this approach is to identify the relationship or association existing between the attributes. This relationship or connection between two or more attributes is known as correlation and is measured through correlation coefficient.

One way to minimize the support set involves computing the correlation coefficient of each attribute with the result variable. The attribute can be eliminated if value of its correlation coefficient is less than the given threshold value. Other way comprises of calculating the correlation coefficient of each attribute with all the other attributes to obtain a minimized support set. All those variables which are correlated above a threshold with each other are removed and replaced by a single variable. The value of the new attribute is given by the weighted average of the values of each attribute, here weight of each attribute is the correlation coefficient of that attribute with the result.

4. Pattern Generation

The key concept of logical analysis of data is pattern. A pattern is defined as a combination of attribute values that occur together only in some observations. A positive pattern P+ covers at least one positive observation but no negative ones, and a negative pattern P- has a similar definition. The basic idea is to select a subset of pattern detected such each observation point is covered by at least one pattern.

A hybrid bottom-up─top-down approach is used for pattern generation. In this approach, short patterns are generated by proceeding in a bottom-up fashion, however it could leave some observations uncovered. So, to cover these observations a top-down approach is adopted generating additional patterns that are further simplified by removing literals from them.

An adequate selection of the detected patterns is used to generate a classification rule which classifies the new observations. In this method, the weighted sum of both positive and negative patterns is used to determine the class of the new observations. This weighted sum is known as discriminant. Suppose P1, P2,…, Pr are the positive patterns and N1, N2,…., Ns are the negative patterns. The discriminant is given by ⌂.

There are multiple ways of assigning non-negative (non-positive) weights to positive (negative) patterns. The simplest approach is to assign equal weights to all patterns thus giving equal importance to them. However, the weight of a pattern can also be determined by the number of observation points covered by it. The consideration of the degree of pattern as a criterion for assigning weight is an another reasonable approach to realize the relative importance of patterns.

The value of the discriminant indicates whether the new observation is positive or negative. A low value of discriminant is insufficient to determine the character of new observation. Therefore, the classification of new observations is possible only if the absolute values of the discriminant exceed a problem-dependent threshold.

**...(download the rest of the essay above)**