Exploring Cluster Analysis in statistical methods: Hierarchical, K-means, EM Algorithm

Cardiff University

MAT002: Statistical Methods

Individual Coursework

Kemos Dimitrios C1753774

Question 1

For the statistical analysis of this data set, the Two-Way ANOVA test is used to investigate whether an interaction exists between the two independent variables family and schooling type, on the dependent variable which is the academic achievement. In order to continue with the analysis, we have to ensure that our data set satisfies the requirements of the Two-Way ANOVA:

The dependent variable is measured at the continuous level, and normally distributed according to the Shapiro-Wilk normality test (ρ=0.610).

Our observations are independent.

The two dependent variables each consist of at least two categorical and independent groups.

Homogeneity of variances exists according to Levene’s test of equality (ρ=0.491) demonstrated in Table 1.1.

Assuming that all the above requirements are meet a Two-Way ANOVA was conducted and the obtained results according to Figure3 are the following:

There was a statistically significant interaction between the effects of schooling and education type over academic achievement, F (1,16) = 62.468, p<.0005.

Since a statistically significant interaction exists we can take into consideration the simple main effects generated through the Two-Way ANOVA procedure to investigate any in between effects. Simple main effects analysis showed that people grew up in a dual family type had significantly better academic achievement F(1,16)=119.514 ρ<0.005, than people who grew up in a single-family type.

Furthermore, simple main effects analysis indicated that there was no significant difference in academic achievement between home and public schooling.

The academic achievement by schooling type was analysed a simple main effect analysis. Home schooling type influenced academic achievement F(1,16)=15.401 p=.001,but also public school type influenced F(1,16)=52.608 p<0.005.

Question 2

Exploring the problem and the data set involved, it is observed that our data follow a hierarchical order so the appropriate statistical method to analyse it is Multilevel Linear Modelling. In order to continue with the analysis, we have to ensure that our data set satisfies the requirements of the Multilevel Linear Modelling:

The dependent variable is measured on a continues scale.

Linear relationship

Not multicollinearity

No outliners

Residual errors approximately normally dist.

The Null Model

The first step in a multilevel analysis is to a null (no predictors) model to partition the variance in in the outcome into its within and between groups components. The null model for pricesi and modelj can be presented as

█(Y_ij=β_0j+ε_ij#3.1)

█(β_0j=γ_00+u_0j#3.2)

The null model provides estimated mean achievement prices for all modes. It also provides a partitioning of the variance between Level 1 (εi) and Level 2 (uoj) and the variation in individual prices across models (εij).

The useful information from the SPSS Output regarding the first model that’s worth noticing is the -2LogLikelihood = 1010.155 and the total number of parameters estimated is 3. Furthermore, the question given to us about explaining price across models can now me explained. As Figure 2.1 suggests, there is significant variance to be explained within groups (Wald Z = 4.952, p<.001). Similarly, the intercept parameter indicates that the intercepts don’t vary significantly across the sample of models (Wald Z = 1.381, p=.167).

Building the Level 1 Random Intercept Model

For each price i in model j a proposed model similar to Equation 3.1 can be expressed as

█(Y_ij=β_0j+〖β_1 (Age)〗_ij+〖β2(Mileage)〗_ij+ε_ij #3.3)

█(β_oj=γ_00+u_0j#3.4)

Equation 3.3 suggests that the individual level, within-group Age and Mileage is related to the total price.

Equation 3.4 implies that variation in intercepts can be described by a model-level intercept (γ00) and a random parameter (u0j).

From the output elements of the SPSS we can first indicate the difference between our model and the previous one. In the new model -2LL=971.754 and df=5. Therefore:

〖x^2〗_change=1059.450-971.754=87.696

df_change=5-4=1

According to the Chi-Squared distribution is 7.879 (p=0.950, df=1) therefore, the changes made in the model are highly significant. By allowing the intercepts to vary we also have a regression parameter for the effect of price which is -36.38 and age -907.79.

Building the Level 2 Random Intercept Model

Next, a model-level variable is being added to explain the variability in intercepts across models.

The SPSS output indicates that we are estimating 9 total parameters in this model and the -2LL is 969.827. Taking into account the previous values for these factors we have

〖x^2〗_change=971.754-969.827=1.927

df_change=9-5=4

According to the Chi-Squared distribution the critical value is 0.711 (p=1.927, df=) therefore, the changes made in the model are highly significant

To conclude Vendors F(4, 49.160)=1.821 , p=.140 and Mileage F(1, 49.152)= 1.184 p=.282 didn’t significantly predict the price a car is sold , the age of the car on the other hand F(1, 50.238) = 26.398 did significantly predict the price.

Question 3

Cluster analysis is a multivariate method which aims to search for patterns in a data set by grouping the observations into clusters. The main purpose of cluster analysis is to find the optimal grouping for which the observations or objects within each cluster are similar, but the clusters are dissimilar to each other(Rencher and Christensen, 2012). It is mainly used for:

Taxonomy Description: Identifying groups within the data.

Data Simplification: The ability to analyse groups of similar observations.

Relationship Identification: The simplified structure of cluster analysis portrays relationships not revealed otherwise.

The primary objective of cluster analysis is to define the structure of the data by placing the most similar observations into clusters, many techniques use an index of similarity between each pair of observations(Rencher and Christensen, 2012). This similarity is used to express the degree of correspondence among objects across all the characteristics used in the analysis. There are two types of similarity measures. The first one which is less frequently used is correlational measures where large values of r’s do indicate similarity. The second and the most often used is distance measures, where higher values representing greater dissimilarity.

There are a number of different methods that can be used to carry out a cluster analysis and they can be classified as follows:

Hierarchical Methods

Agglomerative hierarchical methods, can be described as a button-up clustering method where the most similar objects are first grouped, and these initial groups are merged according to their similarities. At the end, as the similarity decreases, all subgroups are fused into a single cluster.

Divisive hierarchical methods, operate in the opposite direction to agglomerative methods, starting with one large cluster and successively splitting clusters. They are computationally demanding if all 2k-1 -1 possible divisions into two sub clusters of a cluster of k objects are considered at each stage(Everitt, 2010).

In hierarchical methods, divisions or fusions, once made, are irrevocable so

that when an agglomerative algorithm has joined two individuals they cannot

subsequently be separated, and when a divisive algorithm has made a split it

cannot be undone(Everitt, 2010).

Non-Hierarchical Methods (known as k-means clustering)

Non-Hierarchical clustering techniques are designed to group items, rather than variables, into a collection of clusters. The number of these clusters can be specified in advance or specified as part of the clustering procedure. Because a matrix of distances does not have to be determined, and the basic data do not have to be stored during the computer run, non-hierarchical models can be applied to much larger data sets compared to hierarchical techniques(Johnson, 2007).

Before the beginning of the Hierarchical Cluster analysis we have to validate that low level of collinearity exist among the variables, which is valid because after inspection no correlation coefficient is greater than 0.90 in the data provided.

The method used for the distance measure is the Squared Euclidean Distance, the linking measure will be the Centroid algorithm since it is recommended to use with Squared Euclidian Distance according to (Hair, 2014) and it is less affected by outliers also according to (Hair, 2014). Variables are measured on different levels so standardizing them is necessary.

From the SPSS output, the agglomeration schedule (Table 3.1) displays the objects or clusters combined at each stage also the dendrogram demonstrated in Figure 3.1 is a visual explanation of the agglomeration schedule. On the next step of the analysis, a determination must be made according to the number of clusters most representative of the data used. Upon inspecting the agglomeration coefficient, it would be better to stop the cluster analysis after the 37th stage eliminating the last case or after 35th stage. Upon further investigation on both 2-Cluster and 3-Cluster solution the consensus solution is found to be the 3-Cluster.

Generating the Figure 3.2 and Table 3.2 from descriptive statistics we can identify two meaningful subgroups. Cluster 1 represents cities with medium SO2 content and Cluster 2 cities with medium S02 content and Cluster 3 cities with high SO2 content.

References

Everitt, B. (2010) Cluster Analysis: . John Wiley & Sons Inc. Available at: https://www.dawsonera.com:443/abstract/9780470977804.

Hair, J. F. (2014) Multivariate data analysis. Seventh ed. Edited by W. C. Black, B. J. Babin, and R. E. Anderson. Harlow : Pearson.

Johnson, R. A. (Richard A. (2007) Applied multivariate statistical analysis. 6th ed. Upper Saddle River, N.J.: Pearson Prentice Hall.

Rencher, A. C. and Christensen, W. F. (2012) ‘Cluster Analysis’, in Methods of Multivariate Analysis. John Wiley & Sons, Inc., pp. 501–554. doi: 10.1002/9781118391686.ch15.

Essay: Exploring Cluster Analysis in statistical methods: Hierarchical, K-means, EM Algorithm

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay details and download:

Text preview of this essay:

References

About this essay:

Essay Categories: