Question 1 x
Most uses of psychological testing falls into one of five categories: classification, diagnosis and treatment planning, self-knowledge, program evaluation, and research. Explain what is meant by each of these categories, including a practical example for each, and the advantages and disadvantages of using testing for each of these purposes.
Classification
Classification is the process by which individuals or groups are organized into clinically significant categories in order to determine best interventions, or for the predictive purposes. For example, tests that help to classify people into different intelligence categories may be used to ensure that students receive the appropriate amount and type of academic support. Students with high intelligence may require more materials or work to stay engaged. Students who fall into one of the lower intelligence categories may require more time and individual attention to learn basics that may seem intuitive for other students.
One of the advantages to using testing for classification purposes is that they can be used to make quick inferences about a person or a group without needing to perform a thorough investigation into each case. This could be of use particularly in cases of emergencies, where provision of psychological services is time sensitive, and opportunities for full psychological batteries are scarce. Simple classifications (e.g. of intelligence), may be beneficial to make quick judgements regarding whether a person should enter into the “mainstream” of care, or if they need specialized assistance.
However, this “quick fix” approach carries with it many disadvantages as well. It may be tempting to assign all the characteristics of a classification to a particular individual, but by relying solely on these categories, it is easy to miss the wholeness and intricacy of the individual person, who may or may not fall cleanly into these categories. Even if a person does fall neatly into a particular classification or category, there may be other facets that are not captured by that category, which may impact the course of treatments or other interventions.
Diagnosis + Treatment Planning
Diagnosis is the process by which a person’s behaviors and affects are assigned a classification, based on a particular diagnostic system, as an indicator of a person’s general functioning and as an aid to determine the nature and necessity of possible interventions. It enables professionals working within the same field to quickly convey a large amount of information with relatively few words. For example, diagnoses are usually used in medical and mental health treatments. Assigning someone the diagnosis of borderline personality disorder indicates to other healthcare professionals that this person probably has poor interpersonal and empathetic skills, and tends to veer between extremes in their perceptions of others (unless their pathology is well-managed). Additionally, it would aid other professionals using the same diagnostic system to quickly identify appropriate treatment options.
The main advantage of using testing to arrive at diagnoses is that it provides a standardized avenue to make classification determinations. Thus, it helps to regulate the process by which healthcare professionals arrive at particular diagnoses, helping to systematize the way that treatments are administered.
One possible disadvantage of using testing to make diagnoses is that tests vary in their specificity and sensitivity. A test that is too sensitive or not sensitive enough, runs the risks of indicating false positives or false negatives that may affect whether a patient receives the care they actually need (Glaros & Kline, 1988).
Self Knowledge
Self knowledge refers to the awareness and understanding of one’s own behaviors, motivations, and way of being in the world. In the realm of psychological testing, people may take tests such as the Minnesota Multiphasic Personality Inventory (MMPI) to learn more about themselves.
One possible advantage of using testing for self knowledge is that some of these tests can be useful to better understanding one’s own likes and dislikes, and to better articulate how one operates in the world. When used in therapeutic context, certain testing may also prompt people to examine areas of their lives that may have otherwise been ignored, and provide the space to pursue self-growth (Trimboli & Kilgore, 1983).
However, tests for the purpose of self knowledge also run the risk of being used as be-all-end-alls, and the uninformed consumer may make more out of a test result than appropriate, or generalize a particular result to other dimensions of their life.
Evaluation
Evaluation refers to the systematic assessment of a person or a group, in order to draw some sort of conclusion regarding a particular ability, set of knowledge, or value that this person/group may have.
For example, potential graduate students are usually required to take an exam such as the Graduate Record Examination (GRE), which is supposed to serve as an indicator of student academic success (Kuncel, Hezlett, & Ones, 2001). .
Similar to diagnoses, one major advantage of using testing for evaluations is the potential for standardization. If all candidates for a particular position are being evaluated via the same test, there is reasonable certainty that their test outcomes will allow the evaluators to compare them with each other.
Testing for evaluation purposes is limited in its scope because of the specific nature of tests. Unless an evaluee is being given a battery of tests, it is likely that the results from a singular test will not provide adequate information to make a general evaluation. Therefore, for situations when a candidate is being evaluated for professional or academic roles, it is unlikely that a single test will be able to encompass all the information that is required to make a thorough assessment and decision.
Research
Research is the process of systematically exploring or analyzing any given subject. Psychology research usually explores different aspects of peoples’ (or groups of peoples’) functionings, behaviors, and other aspects of life.
All the previous examples would fall within the realm of psychology research. Each of those examples could be used to draw conclusions about particular persons or groups of people, which could then be analyzed and disseminated as valuable information to the scientific community.
Testing is one of the most common ways that psychological research is conducted. It is the simplest way to execute an experimental (or quasi-experimental) procedure, and has a long history of being used in the field of psychological study.
As in the case of all the other uses of testing, many of the same issues are present in the case of testing used for research. Tests may be psychometrically unreliable or invalid (or have low validity), have poor sensitivity or specificity, be biased, or contain a host of other issues that may impact testing results and lead researchers to draw false conclusions.
References:
Kuncel, N. R., Hezlett, S. A., Ones, D. S. (January, 2001). A comprehensive meta-analysis of the predictive validity of the Graduate Record Examinations: Implications for graduate student selection and performance. Psychological Bulletin, 127, 1, 162-181.
Glaros, A. & Kline, R. (1988).Understanding the accuracy of tests with cutting scores: The sensitivity, specificity, and predictive value model. Journal of Clinical Psychology, 44, 1013-1023. https://doi.org/10.1002/1097-4679(198811)44:6
Trimboli, F., & Kilgore, R. B. (1983). A psychodynamic approach to MMPI interpretation. Journal of Personality Assessment, 47, 614–626. http://dx.doi.org/10.1207/s15327752jpa4706_6
Question 3 x
Psychologists have a responsibility to be concerned about how the culture of examinees may interact with the testing situation. Describe at least three examples where cultural factors may cause an examinee to score less well than might be expected, given their abilities. Suggest ways that test creators and administrators can help reduce the irrelevant influence of cultural factors in each case.
Example 1: Stereotype Threat
Stereotype Threat is the effect that the fear of confirming a negative stereotype about a group one belongs to (by one’s individual characteristic) can have on one’s performance. This may cause an examinee to score less well than expected because they may feel pressured to do well to combat negative stereotypes about their particular culture or subculture. This increase in pressure may increase test anxiety, leading to poorer performance, and lower scores (Cassady & Johnson, 2002).
Test creators can help to reduce the irrelevant influence of stereotype threat (unless it is the foci of their study) by creating a non-threatening environment. This may include using proctors who the examinee may be more comfortable with (e.g. those of the same race and gender) (Tsui & O’Reilly, 1989). Additionally emphasizing individual performance and potential for growth may help to shift participant focus from group representation to self-representation. To counteract a negative stereotype threat, it may be beneficial to provide positive statistics about the relevant group before the participants takes the test. However, this approach runs the risk of being affected by a different set of biases.
Example 2: Self Rating Bias of Collectivist Cultures
Self rating bias refers to the tendency of certain individuals or groups to rate themselves consistently higher or lower than their actual capabilities. People with certain East-Asian and other collectivist cultural backgrounds tend to under-represent their abilities or other positive factors in self-reports. This causes them to have lower self-ratings on positive attributes. For example, Chinese people are equally (if not more) giving than other people, but because of their collectivist background, they rate themselves as lower on “giving” scales (Xie, Chen & Roy, 2006). Thus their test results may indicate that they are less giving than the average population, which is not an accurate depiction of this group’s generosity.
In order to help correct this effect (or others like it), test makers should strive to be culturally informed about the group they are testing, and be aware of cultural tendencies that may skew test results. Additionally, test makers may also be able to attempt to statistically correct the data by using data from previous research which may indicate how much this self-rating bias affects individual ratings.
Example 3: ESL/ELL Students’ Math Scores
A report by Conroy and Garcia of the Economic Policy Institute indicated that ELL students tended to score lower in both math and English when compared to their native English-speaking counterparts (2017). The cause for the difference in English scores may seem obvious– but the reason for differences in the math scores are less clear.
It is possible that word problems in math sections are responsible for these differences. Many times, a student is capable of solving the mathematical problem being presented, but they are unable to parse out what the question is asking in order to even answer it. If students cannot understand a question, or misunderstand a question, their risk of an incorrect answer increases substantially. In other words, they may have sufficient math skills, but their poor English verbal skills affects their ability to use their math skills.
Test creators can help to reduce this cultural influence through a variety of means. They may be able change the language of the test to make it more simple and easy to understand (assuming that they want to test the actual math skills, and not the ability to determine the question in its original phrasing). Alternatively, they may consider bringing in a translator for ESL students who can convey the question to students in their native language. Lastly, test creators could invest in creating translated, equally valid versions of their test.
The responsibilities of test publishers and the need for ethical standards
These steps may seem excessive. How much are test publishers responsible for making sure their tests are culturally adaptable? I would argue that test publishers have a responsibility to make their tests/measures accessible to as many people as is feasible for them, rather than what might be comfortable for them. Not every test or measure is meant to generalize or be used with every population. However, there are a number of intellectual and cognitive ability tests that are, or are intended to be, cross-culturally valid.
When claiming to measuring universal constructs such as intelligence or emotional competencies, test creators have an ethical responsibility either to specify the limits of their tests (in the populations that it is relevant for), or to expend resources in making their tests truly universal measures.
References:
Carnoy, M. & García, M. (2017). Five key trends in U.S. student performance: Progress by blacks and Hispanics, the takeoff of Asians, the stall of non-English speakers, the persistence of socioeconomic gaps, and the damaging effect of highly segregated schools. Retrieved from Economic Policy Institute website: https://www.epi.org/publication/five-key-trends-in-u-s-student-performance-progress-by-blacks-and-hispanics-the-takeoff-of-asians-the-stall-of-non-english-speakers-the-persistence-of-socioeconomic-gaps-and-the-damaging-effect/
Cassady, J., & Johnson, R. (2002). Cognitive test anxiety and academic performance. Contemporary Educational Psychology, 27, 270–295. http://dx.doi.org/10.1006/ceps.2001.1094
Tsui, A. S., & O'Reilly, C. A. (1989). Beyond simple demographic effects: The importance of relational demography in superior-subordinate dyads. Academy of Management Journal, 32, 402–423. http://dx.doi.org/10.2307/256368
Xie, J. L., Chen, Z., & Roy, J.-P. (2006). Cultural and personality determinants of leniency in self-rating among Chinese people. Management and Organization Review, 2, 181–207. http://dx.doi.org/10.1111/j.1740-8784.2006.00043.x
Consider This: Steele and Aronson found that conditions that activated stereotype threat resulted in the worst performance by African Americans. Illustrate the need for ethical and professional standards in testing and the responsibilities of test publishers.
Question 4 x
Some psychologists argue that intelligence is too complex a construct to define or measure adequately, and yet intelligence testing is widely practiced in many contexts today. Outline the major arguments for and against the use of intelligence testing in several , with particular attention to specific factors that may make such testing appropriate or inappropriate in each context.
Consider This for Comparison: The MBTI classifies individuals on four domains of personality, while the NPI-3 classifies them on five.
Arguments for Intelligence Testing:
Proponents for intelligence testing may argue that while no test is perfect, one of the reasons why this testing is as widely practiced and accepted as it is today, is because there has been good evidence for the validity and usefulness of these tests (Snyderman & Rothman, 1987). They would also argue that these tests measure the most important aspects of intelligence, and are useful in making related determinations, such as educational placements.
Arguments Against Intelligence Testing:
Opponents of intelligence testing question the validity of the test, and whether it is too vulnerable to outside influences to be an accurate measure of intelligence. In particular, they point to racial and SES differences that may put certain groups of people at a disadvantage for particular items on the tests, especially in school settings where such assessments are heavily relied on to determine placements (Snyderman & Rothman, 1987). Ethnic and financial differences are not the only potential issues with intelligence testing. Because of factors such as the Flynn effect (Rundermann, Becker, & Coyle, 2017), others argue that these testing models may not sustainable in their current forms, and lack comparative value across generations. Finally, other professionals argue that the current tests do not take into account other aspects of interpersonal functioning that could be considered part of intelligence, such as emotional intelligence (Goleman, 1995). Insofar as they do not consider these realms, they provide an incomplete rendering of a person’s intellectual capacity.
Factors Affecting Testing Appropriateness
The main factors affecting testing appropriateness would involve the context in which a particular test is being administered, and the purposes for which it is being administered. Additionally, the contexts to which the results are generalized should also be taken into administration. For example, administering the WAIS-IV intelligence test to adults from a rural tribe on the African continent, in order to compare their intelligence levels with adults in the U.S. would be a completely inappropriate use of the test. The test is not actually useful in helping to determine the relative intelligence of one group when compared to the other, for many reasons including (but not limited to): cultural factors that affect the skills that are given emphasis in different cultures, language differences, life experiences, etc. To use the WAIS-IV as a cover-all measure of raw intelligence would be invalid. However, if the goal of administering the test was simply to see how much information on the test was known by the adults of the African tribe, then the use of the test may be appropriate.
In short: before using intelligence tests, it is important to consider if the use of the test themselves are valid for the purpose of one’s measurements. If it is valid, then it is most likely appropriate. Of course, there are instances in which these intelligence tests could be used to discriminate against certain individuals or groups (in a manner that falls outside of simple evaluative decisions), that would certainly not be included in this criteria.
References:
Goleman, D. (1995). Emotional intelligence. New York, NY, England: Bantam Books, Inc.
Rindermann, H., Becker, D., & Coyle, T. R. (2017). Survey of expert opinion on intelligence: The Flynn effect and the future of intelligence. Personality and Individual Differences, 106, 242-247.
Snyderman, M., & Rothman, S. (1987). Survey of expert opinion on intelligence and aptitude testing. American Psychologist, 42(2), 137-144. doi:10.1037/0003-066X.42.2.137
Question 8
Validity is generally considered an important quality of a good psychological test. Discuss what validity means in psychometrics, including the different types of validity, why each is important, and how each is established. Include a discussion of some adverse consequences for using a test that does not have adequate validity.
Consider This: The item discrimination index is based on a comparison of the number of students in the upper and lower ranges that got the item correct.
What is Validity?
Validity within psychometrics refers to the ability of a test to measure what it purports to measure, and whether it does so in a way that is psychometrically precise. A test or measure has one validity– it is either valid, or it is not. However, psychometricians use a variety of strategies to determine test validity by examining varying aspects of the test. These are referred to as types of validity.
Types of Validity
The three main types of validity are content validity, criterion validity, and construct validity. Content validity is determined by asking whether the test questions cover enough of the domain being measured to constitute a true measure of that domain. The occasionally referenced “face validity” sometimes falls under this category. Although it is not technically a validation strategy, adequate face validity would allow an examiner to assert that a measure appears to be valid at face value (Nevo, 1985). For example, it would be a face value judgement to state that measures of a person’s nose length would not indicate their ice cream flavor preference.
Content validity is primarily established by one of two methods: Average Congruency Percentage (ACP) or Content Validity Index (CVI). In short, ACP uses multiple experts’ computations of the relevance of the exam, and averages the total percentage to determine a test’s validity. If the average percentage is over 90%, the test is considered valid (Modammadipour, Rashid, Rafik-Galea & Thai, 2018). The CVI also uses experts to determine test validity, but further divides validity into item and overall scale validity. The relevance of each item to the construct being measured is estimated by these experts on a four-point Likert scale, as is the validity of the whole scale. The proportion of experts who designated relevance as being a 3 or 4 (for both individual items and the scale as a whole) is used to determine the validity of the whole test. Alternatively, the average of all CVI scores for individual items is averaged to determine the CVI of the whole test.
The second validation strategy is “criterion validity”: the “prediction power,” so to speak, of a measure. There are two subtypes of criterion validity: concurrent validity and predictive validity. Concurrent validity describes how precisely a particular measure describes certain behaviors or abilities within the relevant domain. For example, a calculus test would have strong concurrent validity to gain an understanding of a student’s current knowledge of the particular concepts being tested. On the other hand, predictive validity is a measure of how precisely a test can predict future performance in the same, or in a related domain. For example, the GRE claims to predict student success in graduate school, although the test items may not directly correspond with a student’s chosen field of study.
Criterion validity is established by comparing participant scores on a test either with scores from a test that measures the same domain, or by measures of future success, depending on the purpose of testing. Others have argued that the best way to determine criterion validity is by using content specialists (similar to using experts in determining content validity (Rovinelli & Hambleton, 1976).
Construct Validity is the third and final validation strategy discussed here. This strategy involves determining if a psychometric instrument can measure its related theoretical construct. Accuracy is determined by comparing performance on the evaluated instrument, to other instruments that have an established construct validity. This is done primarily through the statistical method of factor analysis, which identifies clusters of items within an instrument that can be combined into underlying factors. Each individual item within the instrument is then compared to this common factor, and its correlation is computed. If correlation to the factor-regression line is not strong enough, those particular items may be discarded to increase the overall validity of a test.
Adverse Consequences to Using Unvalidated/Poorly Validated Tests
Simply stated, the worst effects of using an unvalidated or poorly validated measure, is that the construct that requires measurement is not accurately quantified. This has immense implications for testing in a practical sense. For example, tests or items with poor content validity may provide false conclusions about the person who’s qualities or characteristics are being measured. When it comes to factors such as intelligence, or other professional aptitudes, an inaccurate measure may lead to loss of educational or professional opportunities that a candidate may in fact be suited for. A poor score on a test that claims to measure these aptitudes (but in fact, is lacking in content validity), could lead to some of these severe outcomes.
Items or tests with poor construct validity may be subject to may of the same outcomes, because they also would not be measuring what they intend to measure, which is in this case, a particular theoretical construct. The implications may be slightly different because they do not necessarily directly affect a person… at least at the outset. However, insofar as these particular tests or measures are considered in formulating psychological constructs and therapeutic approaches, they can also have a significant impact.
In the cases of poor criterion validity, predictions may be skewed and not present an accurate picture of the projection of a person’s abilities and skills. Again, this may lead to large professional and personal losses.
Finally, tests of poor validity set back the development of the field by supplying false premises that may take years to be disproved and otherwise harming the progress of science. That being said, validity is not easy to ensure, and will involve many trials and errors to achieve fully (if in fact it can be achieved at all with a particular situation).
References:
Mohammadipour, M., Rashid, S. M., Rafik-Galea, S., & Thai, Y. N. (2018). The
relationships between language learning strategies and positive emotions among Malaysian ESL. International Journal of Education & Literacy Studies, 6, 86–96.
Nevo, B. (1985). Face validity revisited. Journal of Educational Measurement, 22,
287–293. http://dx.doi.org/10.1111/j.1745-3984.1985.tb01065.x
Rovinelli, R. J., & Hambleton, R. K. (1976, April). On the Use of Content Specialists in the
Assessment of Criterion-Referenced Test Item ValidityRecord. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, California.
Question 12 x
The Wechsler Scales of Intelligence include the WPPSI-IV, the WISC-IV, and the WAIS-IV. Describe the common features shared by these tests, and how those features reflect Wechsler's theory of intelligence. Also discuss what differentiates each test from the others, including a discussion of the intended uses of each test and how those uses are reflected in the makeup of each test.
Wechsler’s Theory of Intelligence
To understand how the Wechsler Scales of Intelligence help to measure intelligence, it would be amenable to first understand Wechseler’s theory of intelligence. Weschler theorized that a person’s intellectual quotient (IQ) remains the same even as a person ages. He understood intelligence as being the overarching capacity of a person to live and work effectively in his/her respective environment (Wechsler, 1939). He allowed that intellectual abilities may shift and change based on life circumstances and aging, but asserted quotient remains constant throughout, and that it can be assessed by what a person can do. This is the assumption held by most intellectual theorists, although it is more a philosophical premise than a scientifically proven fact. However, this is the premise that the majority of work in intelligence testing is based on, and acceptance of this premise is necessary to understand intelligence testing as it is today.
Common Features
The WPPSI-IV, the WISC-IV, and the WAIS-IV are different intelligence scales designed for specific age ranges. Although designed for different age groups, each of these scales contain 13-15 subtests, empirically-based breakdowns of composite scores and an aggregate IQ score (with some differences between tests on how this is reported), a common metric for IQ and index scores, and six subtests that make up the core of each scale.
Differences between Scales
The primary difference between the scales are the populations for which they were designed. The WPPSI-IV is designed to be used for agest 2 years 6 months, to 7 years and 7 months. The WISC-IV is designed for children 6 to 16 years old (Gomez, Vance, & Watson, 2016), and the WAIS-IV is designed for children and adults 16 and older. Additionally, there are a number of differences in the subtests for each of these tests, based on their intended populations.
For example, but the WISC-IV and the WAIS-IV contain composite scores of verbal comprehension, perceptual reasoning, working memory, and processing speed. The WPPSI-IV also contains similar scores, but has the addition of a fluid reasoning composite score. The items are also adjusted to better fit the age groups being tested. Although the tests measure similar modalities, many of the more difficult items on the WAIS-IV have been readjusted or re-written to better suit the age group targeted by the WISC-IV.
Intended Uses of the Tests
The Wechsler scales were intended to provide an overarching picture of a person’s functioning. Currently, they serve to provide information relevant to forming psychological or educational diagnoses (Reschly, 1997), and to provide a marker for determining intervention strategies. The Wechsler scales are also used to identify individuals with learning disabilities, or those who are exceptionally gifted in particular intellectual domains.
The organization of the subscales allow assessors to target and evaluate particular domains. Thus even if a person’s overall IQ is below average, an investigation into their subscale scores may reveal that they excel in their verbal comprehension and perceptual reasoning, but may be lacking in their processing speed. Because of the added nuance from the subscales, it is possible to gain a more holistic picture of the person being assessed.