Back to Home

Math Ability/Performance:
Fairness in Testing Abstracts

Aronson, J., Lustina, M.J., Good, C., Keough, K., Steele, C.M. and Brown, J. When White Men Can’t Do Math: Necessary and Sufficient Factors in Stereotype Threat
Research on “stereotype threat” (Aronson, Quinn, & Spencer, 1998; Steele, 1997; Steele & Aronson, 1995) suggests that the social stigma of intellectual inferiority borne by certain cultural minorities can undermine the standardized test performance and school outcomes of members of these groups. This research tested two assumptions about the necessary conditions for stereotype threat to impair intellectual test performance. First, we tested the hypothesis that to interfere with performance, stereotype threat requires neither a history of stigmatization nor internalized feelings of intellectual inferiority, but can arise and become disruptive as a result of situational pressures alone. Two experiments tested this notion with participants for whom no stereotype of low ability exists in the domain we tested and who, in fact, were selected for high ability in that domain (math-proficient white males). In Study 1 we induced stereotype threat by invoking a comparison with a minority group stereotyped to excel at math (Asians). As predicted, these stereotype-threatened white males performed worse on a difficult math test than a nonstereotype-threatened control group. Study 2 replicated this effect and further tested the assumption that stereotype threat is in part mediated by domain identification and, therefore, most likely to undermine the performances of individuals who are highly identified with the domain being tested. The results are discussed in terms of their implications for the development of stereotype threat theory as well as for standardized testing.

Cohen, G. L., Garcia, J., Apfel, N., Master, A., Reducing the racial achievement gap: a social-psychological intervention

Two randomized field experiments tested a social-psychological intervention designed to improve minority student performance and increase our understanding of how psychological threat mediates performance in chronically evaluative real-world environments. We expected that the risk of confirming a negative stereotype aimed at one's group could undermine academic performance in minority students by elevating their level of psychological threat. We tested whether such psychological threat could be lessened by having students reaffirm their sense of personal adequacy or "self-integrity." The intervention, a brief in-class writing assignment, significantly improved the grades of African American students and reduced the racial achievement gap by 40%. These results suggest that the racial achievement gap, a major social concern in the United States, could be ameliorated by the use of timely and targeted social-psychological interventions.

Dar-Nimrod, I. and Heine, S. J., Exposure to Scientific Theories Affects Women's Math Performance

Stereotype threat occurs when stereotyped groups perform worse as their group membership is highlighted. We investigated whether stereotype threat is affected by accounts for the origins of stereotypes. In two studies, women who read of genetic causes of sex differences performed worse on math tests than those who read of experiential causes.

C. Goldin and C. Rouse, Orchestrating Impartiality: The Impact of “Blind” Auditions on Female Musicians

Claudia Goldin and Cecelia Rouse analyzed the results of musicians' auditions for positions at U.S. symphony orchestras between 1970 and 1996. Based on these data, Goldin and Rouse estimate that the use of a screen in auditions increased by 50 percent the probability that a woman would be advanced from certain preliminary rounds and increased several-fold the likelihood that a woman would be selected in the final round. Other studies of empirical data have been conducted for acceptance of articles in academic journals and fellowship applications.

Gonzales, P.M., Blanton, H., and Williams, K.J. The effects of stereotype threat and double-minority status on the test performance of Latino women

This study investigated the interactive influences of diagnosticity instructions, gender, and ethnicity as they related to task performance. In a laboratory experiment of 120 male and female, Latino and White college students, both a gender-based and an ethnicity-based stereotype-threat effect were found to influence performance on a test of mathematical and spatial ability. Closer inspection revealed that the gender effect was qualified by ethnicity, whereas the ethnicity effect was not qualified by gender. This suggests that the ethnicity of Latino women sensitized them to negative stereotypes about their gender, leading to a performance decrement in a context in which stereotype threat was activated. In contrast, it appeared that the gender of Latino women did not sensitize them to negative stereotypes about their ethnicity, because both male and female Latinos evidenced ethnicity-based stereotype threat. These findings have implications for the interplay between multiple group identities as they relate to concern for confirming negative stereotypes.

Hunter, J. E., Schmidt, F. L., Racial and gender bias in ability and achievement tests: Resolving the apparent paradox

The study of potential racial and gender bias in individual test items is a major research area today. The fact that research has established that total scores on ability and achievement tests are predictively unbiased raises the question of whether there is in fact any real bias at the item level. No theoretical rationale for expecting such bias has been advanced. It appears that findings of item bias (differential item functioning; DIF) can be explained by three factors: failure to control for measurement error in ability estimates, violations of the unidimensionality assumption required by DIF detection methods, and reliance on significance testing (causing tiny artifactual DIF effects to be statistically significant because sample sizes are very large). After taking into account these artifacts, there appears to be no evidence that items on currently used tests function differently in different racial and gender groups

Keller, J., Blatant Stereotype threat and women’s math performance: Self-handicapping as a strategic means to cope with obtrusive negative performance expectations

Examined the impact of increased salience of negative stereotypic expectations on math performance among high school students. Results indicated that female students in the condition of heightened salience of negative stereotypic expectations underperformed in comparison to their control group counterparts. The effect of blatant stereotype threat resulted in increased self-handicapping tendencies in women, which led to significantly impaired math performance

Keller, J. & Dauenheimer, D. Stereotype threat in the classroom: Dejection mediates the disrupting threat effect on women’s math performance

Research on stereotype threat, which is defined as the risk of confirming a negative stereotypic expectation about oneÕs group, has demonstrated that the applicability of negative stereotypes disrupts the performance of stigmatized social groups. While it has been shown that a reduction of stereotype threat leads to improved performance by members of stigmatized groups, there is a lack of clear-cut findings about the mediating processes. The aim of the present study is to provide a better understanding of the mechanisms that stereotype threat causes in women working on mathematical problems. In addition, the study set out to test stereotype threat theory in a natural environment: high school classrooms. The experiment involved the manipulation of the gender fairness of a math test. The results indicate that the stereotype threat effect exists in this everyday setting. Moreover, it appears that dejection emotions mediate the effect of threat manipulation.

Langenfeld, T. E., Test Fairness:Internal and External Investigations of Gender Bias in Mathematics Testing

What two major approaches have been used to study gender bias in test scores? How do statistical DIF detection methods differ? How does DIF screening of items affect mean score differences?

Martensa, A., Johnsa, M. Greenberga, J., and Schimelb, J., Combating stereotype threat: The effect of self-affirmation on women’s intellectual performance

The present studies were designed to investigate the effects of self-affirmation on the performance of women under stereotype threat. In Study 1, women performed worse on a difficult math test when it was described as diagnostic of math intelligence (stereotype threat condition) than in a non-diagnostic control condition. However, when women under stereotype threat affirmed a valued attribute, they performed at levels comparable to men and to women in the no-threat control condition. In Study 2, men and women worked on a spatial rotation test and were told that women were stereotyped as inferior on such tasks. Approximately half the women and men self-affirmed before beginning the test. Self-affirmation improved the performance of women under threat, but did not affect men’s performance.

McCornack, R.L., McLeod, M.M., Gender Bias in the Prediction of College Course Performance

Is the relationship of college grades to the traditional predictors of aptitude test scores and high school grades different for men and women? The usual gender bias of underpredicting the grade point averages of women may result from gender-related course selection effects. This study controlled course selection effects by predicting single course grades rather than a composite grade from several courses. In most of the large introductory courses studied, no gender bias was found that would hold up on cross-validation in a subsequent semester. Usually, it was counterproductive to adjust grade predictions according to gender. Grade point average was predicted more accurately than single course grades.

Shepard, L. , Camilli, G. , & Averill, M., Comparison of procedures for detecting test-item bias with both internal and external ability criteria

Test bias is conceptualized as differential validity. Statistical techniques for detecting biased items work by identifying items that may be measuring different things for different groups; they identify deviant or anomalous items in the context of other items. The conceptual basis and technical soundness were reviewed for the following item bias methods: transformed item difficulties, item discriminations, one- and three-parameter item characteristic curve methods, and chi-square methods. Sixteen bias indices representing these approaches were computed for black-white and Chicano-white comparisons on both the verbal and nonverbal Lorge-Thorndike Intelligence Tests. In addition, bias indices were recomputed for the Lorge-Thorndike tests using an external criterion. Convergent validity among bias methods was examined in correlation matrices, by factor analysis of the method correlations, and by ratios of agreements in the items found to be "most biased" by each method. Although evidence of convergent validity was found, there will still be important practical differences in the items identified as biased by different methods. The signed full chi-square procedure may be an acceptable substitute for the theoretically preferred but more costly three-parameter signed indices. The external criterion results also reflect on the validity of the methods; arguments were advanced, however, as to why internal bias methods should not be thought of as proxies for a predictive validity model of unbiasedness.
Shepard, L., Camilli, G., & Williams, D., Accounting for statistical artifacts in item bias research
Theoretically preferred IRT bias detection procedures were applied to both a mathematics achievement and vocabulary test. The data were from black and white seniors on the High School and Beyond data files. To account for statistical artifacts, each analysis was repeated on randomly equivalent samples of blacks and whites (n's = 1,500). Furthermore, to establish a baseline for judging bias indices that might be attributable only to sampling fluctuations, bias analyses were conducted comparing randomly selected groups of whites. To assess the effect of mean group differences on the appearance of bias, pseudo-ethnic groups were created, that is, samples of whites were selected to simulate the average black-white difference. The validity and sensitivity of the IRT bias indices was supported by several findings. A relatively large number of items (10 of 29) on the math test were found to be consistently biased; they were replicated in parallel analyses. The bias indices were substantially smaller in white-white analyses. Furthermore, the indices (with the possible exception of ? 2) did not find bias in the pseudo-ethnic comparison. The pattern of between-study correlations showed high consistency for parallel ethnic analyses where bias was plausibly present. Also, the indices met the discriminant validity test-the correlations were low between conditions where bias should not be present. For the math test, where a substantial number of items appeared biased, the results were interpretable. Verbal math problems were systematically more difficult for blacks. Overall, the sums-of-squares statistics (weighted by the inverse of the variance errors) were judged to be the best indices for quantifying ICC differences between groups. Not only were these statistics the most consistent in detecting bias in the ethnic comparisons, but they also intercorrelated the least in situations of no bias.

Shih, M., Pittinsky, T.L., & Ambady, Stereotype Susceptibility: Identity Salience and Shifts in Quantitative Performance, N.

Recent studies have documented that performance in a domain is hindered when individuals feel that a sociocultural group to which they belong is negatively stereotyped in that domain. We report that implicit activation of a social identity can facilitate as well as impede performance on a quantitative task. When a particular social identity was made salient at an implicit level, performance was altered in the direction predicted by the stereotype associated with the identity. Common cultural stereotypes hold that Asians have superior quantitative skills compared with other ethnic groups and that women have inferior quantitative skills compared with men. We found that Asian-American women performed better on a mathematics test when their ethnic identity was activated, but worse when their gender identity was activated, compared with a control group who had neither identity activated. Cross-cultural investigation indicated that it was the stereotype, and not the identity per se, that influenced performance.

Spencer, S.J., Steele, C.M., & Quinn, D.M., Stereotype threat and women’s math performance

When women perform math, unlike men, they risk being judged by the negative stereotype that women have weaker math ability. We call this predicament stereotype threat and hypothesize that the apprehension it causes may disrupt women's math performance. In Study 1 we demonstrated that the pattern observed in the literature that women underperform on difficult (but not easy) math tests was observed among a highly selected sample of men and women. In Study 2 we demonstrated that this difference in performance could be eliminated when we lowered stereotype threat by describing the test as not producing gender differences. However, when the test was described as producing gender differences and stereotype threat was high, women performed substantially worse than equally qualified men did. A third experiment replicated this finding with a less highly selected population and explored the mediation of the effect. The implication that stereotype threat may underlie gender differences in advanced math performance, even those that have been attributed to genetically rooted sex differences, is discussed.

Steele, C.M., & Aronson, J., Stereotype threat and the intellectual test performance of African-Americans

Stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one's group. Studies 1 and 2 varied the stereotype vulnerability of Black participants taking a difficult verbal test by varying whether or not their performance was ostensibly diagnostic of ability, and thus, whether or not they were at risk of fulfilling the racial stereotype about their intellectual ability. Reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled). Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it. Study 4 showed that mere salience of the stereotype could impair Blacks' performance even when the test was not ability diagnostic. The role of stereotype vulnerability in the standardized test performance of ability-stigmatized groups is discussed.

Walton, G. M. and Cohen, G. L., Stereotype Lift

When a negative stereotype impugns the ability or worth of an outgroup, people may experience stereotype lift - a performance boost that occurs when downward comparisons are made with a denigrated outgroup. In a meta-analytic review, members of non-stereotyped groups were found to perform better when a negative stereotype about an outgroup was linked to an intellectual test than when it was not (d =.24, p < .0001). Notably, people appear to link negative stereotypes to evaluative tests more or less automatically. Simply presenting a test as diagnostic of ability was thus sufficient to induce stereotype lift. Only when negative stereotypes were explicitly invalidated or rendered irrelevant to the test did the lift effect disappear.

Walton, G. M. and Spencer, S. J., Latent Ability: Grades and Test Scores Systematically Underestimate the Intellectual Ability of Negatively Stereotyped Students

Past research has assumed that group differences in academic performance entirely reflect genuine differences in ability. In contrast, extending research on stereotype threat, we suggest that standard measures of academic performance are biased against non-Asian ethnic minorities and against women in quantitative fields. This bias results not from the content of performance measures, but from the context in which they are assessed - from psychological threats in common academic environments, which depress the performances of people targeted by negative intellectual stereotypes. Like the time of a track star running into a stiff headwind, such performances underestimate the true ability of stereotyped students. Two meta-analyses, combining data from 18,976 students in five countries, tested this latent-ability hypothesis. Both meta-analyses found that, under conditions that reduce psychological threat, stereotyped students performed better than nonstereotyped students at the same level of past performance. Walton & Spencer discuss implications for the interpretation of and remedies for achievement gaps.