Analyzing 3.5 Data to Identify Usability Problems in E-Commerce Sites

3.5 Data analysis

A determination was made to identify the data with its applicable usability problem. This analysis was undertaken in three stages. The first two stages followed the methodology of the multiple case study design, by Yin (1984), as illustrated in Figure 3.1. This method begins by analysing the individual methods within each case and interpreting the results at the case level. Following this, a comparison is made between the cases. These results allow the drawing of conclusions from the overall study from multiple cases. Therefore, in this research, stage one involved analysing each usability method independently for each case (i.e. each e-commerce website) and identifying any usability problems found for each method of each case. The second stage involved identifying common usability problems pinpointed by each method. This pinpointing was achieved by performing a comparison of each usability method across the three cases (i.e. the three e-commerce sites). A third stage was undertaken to generate a list of standardised usability problem themes and sub-themes to facilitate comparisons among the various methods. These themes and sub-themes were identified from the common usability problem areas which originated from each method. These themes and sub-themes were then used, in turn, to classify the problems that had been identified. The list was generated iteratively. After an initial analysis of the first method (the performance data and observations), new problem themes and sub-themes were added to the list of problems that had not been covered in the standardised themes. The analysis of each method provided a description of the overall usability of the sites. This section explains how the quantitative and qualitative data obtained from the various methods employed in this research were analysed at each stage.

3.5.1 User testing analysis

Data collected during user testing were analysed in several ways. It is worth noting that the user testing participants were categorised as either novice or expert, as suggested by Nielsen (1993). Nielsen stated that: 'one of the main distinctions between categories of users is that between novice and expert users.' The participants' experience with the Internet was used as the criterion to categorise the participants. Specifically, three years' experience using the Internet was the dividing line between novice and expert. When analysing each use testing method, the user's assignment to the novice or expert grouping was taken into consideration. This section presents the analysis of the five user testing methods.

3.5.1.1 Pre-Test questionnaires

Data collected from the pre-test questionnaires were analysed in various ways. For Sections 1 and 2, descriptive analysis was employed to describe the characteristics of novice and expert participants and their experience with online shopping. Likert scores were calculated for each statement in Sections 2 and 3 to describe the participants' overall perceptions and experiences with online shopping. It should be noted that for this research, a Likert score of 1-3 was regarded as a negative response, 5-7 a positive response, and 4 a neutral response. The response values for the negative statements were reversed before calculating the Likert score. This reversal was taken into consideration during the analysis of the pre-test questionnaires, the post-test (satisfaction) questionnaires, and the heuristic checklist statements. The Mann-Whitney test was used to determine if there was a statistically significant difference between novice and expert users' ratings in their perceptions about online shopping. This test is a nonparametric test and was the most appropriate statistical approach since the statements were measured on an ordinal scale (Conover 1971). The seven-point Likert scores were considered ordinal values because it does not specify if the differences between adjacent score values are the same across the entire range of Likert scores. This point was illustrated by May (2001) who stated that the differences between 'agree and strongly-agree' are not the same as the differences between 'disagree and strongly-disagree'.

3.5.1.2 Performance data and the observation method

Data about performance were summarised in several ways. The task timing (in seconds) was computed, and the mean time (in seconds) and the standard deviation were derived from descriptive statistics. The tasks' accuracy was also determined.

The performance data illustrates the percentage of users who completed each task within the prescribed time limit.

It is important to note that the average of the performance data includes values from users who performed the tasks within the time limit and users who exceeded the time limit. Users who exceeded the time limit of a task were asked to stop performing the task and the benchmark time was then used for that task.

To highlight usability problems within the performance data, two

steps were used, as suggested by Rubin (1994):

A. Identification of the problematic tasks

To compile a comprehensive list of usability problems for each site, all problematic tasks were taken into account. Instead of identifying the most problematic tasks (e.g. the tasks that have success rates below 70 percent as suggested by Rubin (1994)), all the tasks that one or more users could not complete successfully within the time benchmark were considered.

B. Identifying users' problems followed by a 'source of error' analysis

To identify users' difficulties with these problematic tasks, and to investigate the roots of the usability problems behind these tasks, various sources were examined. These investigations comprised the notes from in-session observation, notes gathered from reviewing the sixty Camtasia sessions, and users' comments obtained during the testing.

These sources identified a significant number of usability problems. These problems were examined and categorised. They formed a list of sixteen common usability problem areas from the three sites. These sixteen common usability problems generated sixteen problem sub-themes and a seven corresponding problem themes. The list of

the problem themes and sub-themes that was generated from the analysis of this method is explained in the Results Chapter (Chapter 4).

To quantify the overall usability of the sites, a summary of the total number of tasks completed successfully by all users was compiled along with a list of the sources of different usability problems.

The analysis of variance (ANOVA) test was used to obtain statistically significant results. Both one-way within-subjects ANOVA tests and mixed ANOVA design were used.

The one-way within-subjects ANOVA test was employed for each of the ten tasks. This test was used to determine if the time spent performing each task was significantly different. The within-subject factor, the sites, had three levels: site 1, site 2 and site 3. The dependent variable was the total time by users, in seconds, to perform a task. However, this test does not provide a detailed analysis.

Therefore, a mixed ANOVA design test was employed to obtain a more detailed analysis of the data. This type of analysis is used to analyse data from studies with many factors because it can investigate both the effects of each factor individually and the interaction between factors (Brace 2006). This design was used to determine:

' If the time for performing all the tasks on the three sites was significantly different between novice and expert users.

' If the time spent to perform all the tasks on each site was different for the three sites in a significant way.

The mixed design used was a 2*3*10 mixed ANOVA. The first factor

was the between-subjects factor for the two group levels: novices and experts. The second factor was the within-subjects factor of sites with three levels: site 1, site 2, and site 3. The third factor was the within-subjects factor of tasks with ten levels: the ten tasks: task 1 to task 10. The dependent factor was the time, in seconds, the user took to perform a task.

3.5.1.3 Post-Test questionnaires – quantitative data

Data collected from the post-test questionnaires were used to establish the basis of usability problems with the sites.

The Likert scores were calculated for each statement in Section 1 of the post-test questionnaire for each site to obtain the overall results concerning the participants' satisfaction with the sites.

The post-test statements were grouped into four categories from the developed heuristic guidelines: architecture and navigation, content, design, and purchasing process; along with their corresponding sub-categories except three statements (17, 26, 28). These three statements related to the overall evaluation of the tested sites and thus were grouped under a new sub-category: overall evaluation

of the sites. The statements were grouped to facilitate pinpointing usability problems. The post-test questionnaire did not include statements related to the accessibility and customer service category of the heuristic guidelines and its sub-categories and, therefore, this category was not considered for grouping the post-test questionnaire statements.

A Likert score rating of 1 to 3 (negative) on a post-test questionnaire

statement was interpreted as indicating the existence of a usability problem from the users' viewpoint. Negative statements identified a number of usability problems with the sites. These negative statements were mapped to the problem themes and subthemes identified by the previous method. Four statements caused three new problem sub-themes to be identified.

To examine the overall usability of the sites, two inferential statistical tests were used for each statement of the post-test questionnaire:

' The Mann-Whitney test was used to determine if there was a statistically significant difference between the ratings given by novice and those given by expert users.

' The Friedman test was used to determine if there was a statistically significant difference between users' ratings among the three sites.

The Friedman test and the Mann-Whitney test are nonparametric tests. These tests were the most appropriate statistical techniques due to the ordinal scale of measurement that was used with the collected data (as explained in Section 3.5.1.1).

***

3.5.1.4 Post-Test questionnaires – qualitative data

When determining the sites' usability problems, users' responses to the post-test questionnaires open-ended questions provided qualitative data.

The users' answers were first translated from Arabic into English. Then they were combined for each site and grouped under five categories from the heuristic guideline categories that had been developed: architecture and navigation, content, accessibility and customer service, design, and purchasing process; along with their corresponding sub-categories.

Several usability problems were identified from the responses of users. These answers were mapped to the problem themes and sub-themes pointed out by the previous methods; nine new sub-themes were generated. Seven of these sub-themes were mapped to appropriate problem themes and the other two sub-themes generated new problem themes.

3.5.1.5 Post-Evaluation questionnaires – qualitative data

Data obtained from the post-evaluation questionnaires were first translated into English from Arabic. These data represented answers to questions which had asked users to identify the site with the best six features. The answers were grouped into the six features of the sites that related to navigation, internal search, architecture, design,

purchasing process, and security and privacy.

3.5.2 Heuristic evaluation analysis

Heuristic evaluators collected both qualitative and quantitative data. The collected data was analysed in several ways. This section presents the analysis of the two heuristic evaluation methods.

3.5.2.1 Heuristic evaluation analysis – qualitative data

The heuristic evaluators' comments, noted during the fifteen discussions about the compliance of each site to the heuristic principles, were translated into English. They were then grouped together for each site and categorised under the categories and sub-categories of the designed heuristic guidelines. From examining the heuristic sub-categories, forty common usability problem areas were identified across the three sites. Twenty-four problems were mapped to the corresponding themes and sub-themes identified through the user testing methods. However, fifteen new problem sub-themes were identified and one sub-theme identified a new problem theme.

3.5.2.2 Heuristic evaluation checklist

For each site, a Likert score was calculated for every statement on the heuristic checklist to obtain an overall rating of the five heuristic evaluators. Mappings from the heuristic checklist were done to five identified heuristic categories and the corresponding sub-categories. Statements 87 to 89 were excluded because they required making a purchase from the site. The Likert scores were arranged as follows: a score of 1 to 3 (negative) of the heuristic checklist statements indicated a severe usability problem. From these negative statements, a list of usability problems was compiled. These statements were then mapped to the identified problem themes and sub-themes.

The Friedman test was used to capture information about the overall usability of the sites. The goal of using this test was to determine if there was any statistically significant difference among the ratings of for the three sites by the heuristic evaluators and each statement in the heuristic checklist. The reasons for using this test were explained in Sections 3.5.1.1 and 3.5.1.3.

3.6 Reliability and validity

The validity of evaluation techniques is concerned with if a technique measures what it is intended to measure. This evaluation involves examining the technique itself and how the technique is performed (Preece et al. 2002). For example, the validity of the user testing method,

according to Nielsen (1993), is related to if the results actually reflect the usability issues the researcher wishes to test. Nielsen (1993) provided examples of typical validity problems which included involving designing the wrong tasks, engaging the wrong users, or not considering time constraints and social influences.

Furthermore, Gray and Salzman (1998) defined threats to the validity of experimental studies within the context of HCI research. Gray and Salzman examined the design of five experiments and compared usability evaluation methods. They also recommended ways of addressing different kinds of validity which are most relevant to HCI research. For example:

' To provide internal validity, they recommended considering the issues of instrumentation, selection and setting:

a. Instrumentation for usability evaluation methods is concerned with biases in how human observers identify or gauge the severity of usability problems. When comparing methods or groups, the instrumentation is only valid when there is a way of rating the results that does not inappropriately favour one condition over the others. This means that the same evaluators or experimenters should not be assigned to multiple UEMs nor also asked to identify, classify or rate the usability problems. Also, usability problem categories that are defined by one UEM should not be used by the experimenter to categorise problems found by another UEM.

b. Selection concerns the characteristics of the participants. For example, if the participants are related to the manipulation of interest and if the participants assigned to different groups are equal in terms of relevant characteristics (e.g. knowledge or experience) which are significant to the conditions of the experiment.

c. Setting concerns the location of an experiment and the conditions under which the experiment is administered. Researchers should ensure that all participants in each UEM perform the experiment under uniform conditions and at the same location.

' To ensure causal construct validity, researchers should provide explicit information regarding the exact operation and method used so that a UEM should be applied according to the understanding of the reader of that method. For example, in the case of heuristic evaluation, the evaluators should use guidelines and explain if evaluators could work together or independently while identifying the usability problems. Furthermore, to avoid the problem of interactions between different treatments, it is highly recommended the same participants not be matched to two or more UEMs. Each group of participants should conduct only one UEM.

In this research, all these recommendations were incorporated to ensure validity. The internal validity of this research was concerned with instrumentation, selection and setting. The researcher was not assigned to multiple UEMs and

did not identify multiple usability problems. Even though the researcher was involved in the collection of data and played the role of observer in the user testing sessions and heuristic evaluation sessions, the web experts identified the usability problems individually during the heuristic evaluation sessions. The researcher only reported the results of the experts. Furthermore, the categorisation of usability problems discovered through each method was not the basis for categorising the usability problems obtained from the other methods. Each method was analysed individually, then problems that were identified by each method were compared so as to generate the problem themes and sub-themes which were generated gradually, as mentioned in Section 3.5.

The selection issue was also observed while recruiting participants in the user testing and heuristic evaluation methods. The participants' characteristics for the user testing were drawn from the companies' profiles of their users. Also, the web experts all had a similar level of experience (i.e. more than 10 years) in designing e-commerce sites.

Unanticipated characteristics of participants in the two experiments were not included.

The 'setting' issue was also respected in this research, where all the participants in the user testing performed the testing at the same location under the same conditions. Also, they all followed the same procedure, as illustrated in Section 3.4.1.10. All the experts in the heuristic evaluation performed their investigation under the same conditions and followed the same procedure, as mentioned in Section 3.4.2.4. Even though every web expert evaluated the sites in his/her office in his/her company, similar conditions existed in each company.

Also considered in this research was causal construct validity. The data collection sections are explicit in their description of how each method was used in the research and these methods are the usability methods described in the literature. Problems of interaction were avoided because the user testing participants were not the same as those who participated in the heuristic evaluation.

It should be noted that the multiple case study approach used in this research will enhance the external validity and generalisation of the findings, as stated by Merriam (1998).

As indicated by Preece et al. (2002), the reliability, or consistency, of an evaluation technique is connected to 'how well a technique produces the same results on separate occasions under the same circumstances.' For instance, for user testing, reliability is demonstrated when the same result would be obtained if the test was repeated (Nielsen 1993). Preece et al. (2002) stated that, in the case of experiments, if an experiment is controlled carefully then it will have high reliability so that if another evaluator follows exactly the same process then they should achieve the same results. Due to time constraints, it was prohibitive to investigate if the same results would be obtained a second time. However, some techniques in this research, such as the post-test questionnaire, did lend themselves to reliability measurements.

For the questionnaires, reliability means that 'a measure should

consistency reflect the construct that it is measuring' (Field 2009). The most common measure of reliability is Cronbach's Alpha, where values of 0.7 to 0.8 are acceptable. These values indicate a reliable measure whereas values that are substantially lower would indicate an unreliable measure (Field 2009).

The post-test questionnaire was based on a reliable measure (CSUQ) and also on other questions proposed in earlier research, as mentioned in Section 3.4.1.4. These questions are specifically designed to measure users' satisfaction with an interface.

The Cronbach's Alpha for this measure exceeded 0.89 (Lewis 1993). The reliability of the post-test questionnaire was calculated using the overall Cronbach's Alpha for each site. This measure showed a high level of reliability since all Cronbach's Alpha for each site were higher than 0.8. Specifically, the values of Cronbach's Alpha for sites 1, 2 and 3 were: .939, .937 and .931, respectively.

3.7 Conclusion

This chapter presents justification and examination of the selected research philosophy, design, and methods which facilitated the aims and objectives of this research. The chapter also discussed the various techniques that were employed to collect and analyse the data related to the three main methodologies used in this research: user testing, heuristic evaluation and Google Analytics.

Essay: Analyzing 3.5 Data to Identify Usability Problems in E-Commerce Sites

Essay details and download:

Text preview of this essay:

About this essay:

Essay details and download:

Text preview of this essay:

About this essay:

Essay Categories: