Statistical Significance vs. Practical Significance of Research Results

M. Gall's home page

FIRST DRAFT: DO NOT QUOTE

Figuring out the Importance of Research Results: Statistical Significance versus Practical Significance

M. D. Gall

University of Oregon

Paper to be presented at the 2001 annual meeting of the American Educational Research Association

Introduction

At some point in the analysis of data from a study, researchers face the question, "How important are these results?" Tests of statistical significance and effect sizes are commonly invoked to provide the answer. Both approaches have merit but, as I demonstrate in this paper, their value for this particular purpose is quite limited. I propose several other forms of statistical analysis that are better for making judgments about the importance of research results.

Research results can be important for theory development or for the improvement of practice. My concern in this paper is with the importance of research results for the improvement of practice&emdash;specifically, the improvement of educational practice. The question I am addressing is this: How should educational researchers go about determining the practical significance of their research results?

My analysis focuses primarily on results obtained in quantitative research, but I extend my argument later in the paper to the problem of determining the practical significance of results obtained from qualitative research studies.

With respect to quantitative research, I use the common experimental/control-group design as the exemplar. My line of reasoning and conclusions, though, can be extended to other quantitative research designs.

Approaches to Determining Practical Significance

Tests of Statistical Significance

It is an unfortunate circumstance that statistical methods used to test the null hypothesis are commonly called tests of statistical significance. Equally unfortunate is the tendency to make statements of the type, "The difference between the experimental and control group was significant at the .05 level," or the correlation between the two variables was significant at the .05 level." The word "significant" misleads professional practitioners and the lay public into thinking that the research results are important for this reason. In fact, even researchers and research journal editors might be swayed into thinking that a research result is important because it is statistically significant, or the converse: that a research result is not important because it is not statistically significant.

In fact, a statistically significant result only tells us that the null hypothesis can be rejected at some level of certainty&emdash;assuming that certain conditions (most importantly, random sampling from a defined population) have been satisfied. Rejecting the null hypothesis means that we accept the alternative, namely, that the difference between the experimental and control groups is not a consequence of sampling error, but rather that it is a real difference&emdash;in other words, the samples come from different populations having different mean scores.

The finding of a real group difference is not important in and of itself. If one compares any two groups (e.g., high school freshmen and sophomores, or males and females), they are likely to differ on a great many variables. By itself, then, a significant p value is of relatively little importance. The importance of significant p values is further diminished by the fact that they are easily influenced by sample size, the value of p used as the criterion for rejecting the null hypothesis, and whether the test of statistical significance is one-tailed or two-tailed.

My claim, then, is that tests of statistical significance say virtually nothing about the importance of a research result. Other researchers (Thompson, XX; p. 185 of ER6) have come to a similar conclusion for the reasons mentioned above and for other reasons as well. Some of these researchers have proposed effect size as better measure of the importance of a research result.

Effect Size

Effect size (ES) is a statistic used to determine the magnitude of a research result. Typically, it is used to determine the magnitude of the difference in the mean scores of two groups on a measure. The effect size does not actually specify the amount of points by which the group differ. Instead, the amount of difference is expressed in standard-deviation units. The advantage of standard-deviation units is that effect sizes calculated on different measures within the same study or across studies have the same meaning.

The alleged advantage of ES is that it says something about the practical significance of a research result. The term "practical significance" implies a research result that will be viewed as having importance for the practice of education or, in other words, it will be viewed as important by teachers, school administrators, policy makers, and others concerned about the day-to-day workings of education and efforts to improve it. (Hereafter, I will use the term "practitioners" to refer collectively to these various groups.)

The claim that ES is a good measure of the practical significance of a research result&emdash;or at least a better measure than tests of statistical significance&emdash;involves an assumption that matters of magnitude are important to practitioners. The assumption is doubtless warranted. Practitioners like to find that students' test scores are going up or that some intervention is working to produce measurable gains in student learning. Judgments of "going up" and "is working" require a discernible magnitude of difference between two groups or gain over time within one group.

If a discernible magnitude of difference is good and therefore important, this does not mean that all discernible magnitudes of difference are equally good and therefore equally important. Herein lies a problem in using ES as a measure of the importance of a research result. How much more important is, for example, an ES of .81 than an effect size of .33?

To determine the importance of a particular ES statistic, one must first comprehend what an ES signifies. It is an abstract statistic typically derived from mean scores and standard deviations. I doubt that effect sizes are comprehensible to the vast majority of practitioners. Researchers evidently agree with this view, because they sometimes express an ES statistic in terms of percentile differences. For example, an ES of 1.00 means that the average individual in one group scored at the 84th percentile of the other group's score distribution. This is a slightly more comprehensible expression of magnitude for practitioners. However, I think they&emdash;and perhaps researchers as well&emdash;would be hard pressed to explain how much difference is encompassed between the 50th percentile and 84th percentile of a score distribution, especially when the characteristics of the outcome measure and the population being studied are under-specified&emdash;as they often are&emdash;in research reports.

Setting aside the issue of whether ES is a comprehensible statistic, researchers are faced with the problem of judging the importance of a research result expressed as an ES. Glass (XX) proposed that an ES of magnitude .33 or greater (true?) has practical significance for education. However, there are two reasons why an ES of .33&emdash;or any other proposed ES&emdash;is suspect as a criterion for judging the importance of a research result.

The first reason is that ES treats all measurement scales alike. For example, an ES does not signify whether two groups differed by 3 points on a 50-point scale or by 3 points on a 10-point scale; yet practitioners are likely to view the latter difference as potentially more important than the former difference.

The second reason is that ES does not express the shape or variability of the score distributions of the two groups. (The magnitude of an ES is influenced by distribution shape and variability, but this is not the same thing as saying that an ES does, or does not, express distribution shape and variability.) While shape or variability of score distributions typically are not important to researchers (judging by my reading of many published studies), they are&emdash;or should be&emdash;important to practitioners. As a public good, education must serve all children, and therefore the importance of an intervention should be judged not only by whether it raises the mean score of a group, but also by whether it raises the scores of students at all points in the distribution. ES does not provide the information for judging the importance of a research result by this criterion.

Recommendations

My argument thus far is this: (1) tests of statistical significance tell us virtually nothing about the importance of a research result; and (2) ES tells us about magnitude of difference, which is important, but it is difficult for practice-oriented practitioners to comprehend and too limited in the information it conveys to them. The following discussion presents several alternative statistical approaches that I think provide a better basis for judging the importance of a research result.

Sample Distribution on a Scoring-Guide Scale

The recent national and state emphasis on high-stakes achievement testing has brought scoring guides into prominence. The guides consist of scales, with each point on the scale representing a meaningful level of achievement on some domain of curriculum or performance. An example of a scoring guide is shown in the leftmost column of Table 1. It is an abbreviated form of a scoring guide developed by the California Department of Education to measure student performance on essay-type mathematics problems.

If we consider the case of an experimental/control-group design, researchers might analyze their data to show the percentage of individuals in each group who are at each point on the scale. Table 1 presents hypothetical data of this type.

Research results presented in this manner would be meaningful both to practitioners and researchers. Moreover, it is relatively easy to make judgments about the importance of group differences.

Using the hypothetical data set in Table 1, suppose that a score of 3 represents a minimal level of mastery to be awarded a "Pass" by the state. We can examine the results to determine what percentage of individuals in the experimental and control group achieve the criterion score of 3 or higher. The importance of a research result would be judged by the magnitude of the difference in percentages of the two groups that achieve mastery.

For example, we find in Table 1 that 65 percent of the experimental group, but only 50 percent of the control group, achieved mastery on the posttest administration. This is an important result, because the state's goal presumably is to have 100 percent of students achieve mastery, and the experimental intervention moves students toward this goal. The research result would be especially important to school districts with an unusually low percentage of students achieving mastery. These districts will be looking for any intervention that can increase this percentage.

A similar approach to displaying research results is seen in some survey research studies. A sample of individuals is given a set of scales, each of which asks them to express their attitude or opinion about a particular topic. The scales typically include 5 to 7 points, with each point representing a particular intensity of attitude or opinion along a continuum, for example: Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree.

The advantage of labeling the points this way is that the labels are meaningful to the respondents and to practitioners and researchers. The scale points become even more useful when the researchers display the percentage or number of individuals in the sample who check each points. These research results are important to the extent that researchers or practitioners are interested in the topic being measured and how individuals distribute themselves along the continuum of an attitude or opinion about the topic.

Comparison of a Sample to a Meaningful Reference Group

Some research results are in the form of test scores that can be compared to a table of norms. This is the case with standardized achievement tests and measures of personality. By comparing the mean score of the sample to a table of norms, researchers and practitioners can determine how well the sample performed relative to the population on whom the test was normed. This comparison can help them draw conclusions about whether the sample is average, superior, or deficient with respect to the population. If grade norms are available, the researchers can determine whether the sample is performing at grade level, or above or below it.

Tables of norms, like the ES statistic, tell use something about magnitude of difference. However, the magnitude of difference revealed by a table of norms might not be meaningful or useful information to practitioners. For example, suppose the research sample consists of fifth-graders and they are found to be reading at the third-grade level on a particular standardized reading test. How well does the typical fifth-grader read, and how well does the typical third-grader read? And what is the magnitude of the gap? A reading specialist or a teacher at these grade levels might be able to answer these questions, but not anyone else. If practitioners don't know what these levels mean, they cannot form a sound judgment about whether the research result is important.

To make test scores meaningful, I think it is necessary to create norm groups that practitioners find meaningful. To give a sense of what I mean, consider a sport such as golf. Serious golfers&emdash;whether professionals or amateurs&emdash;know what a score of 72 (often called "par") means for elite professional golfers playing on golf course layouts designed for them. They also have a sense of the level of golfing ability represented by scores in the 80s, 90s, and 100s. They know what it means to be a "low handicapper" or a "high handicapper"; and they are likely to take notice of a golfer who has found an instructor, technique, or piece of equipment that has taken several strokes off his or her game.

What are examples of such groups in education? In the case of academic achievement tests, possibilities include: students who have consistently failed courses in the curriculum domains covered by the test; and students who have performed at the highest levels in these courses. With students such as these as anchor points and with a careful description of the test, practitioners will have a better sense of the level of achievement represented by an experimental or control group's mean test score.

Confidence Limits

Confidence limits provide a method for estimating population values from sample statistics, such as a sample mean or correlation coefficient. These statistics are used to estimate a range of values&emdash;the confidence limits&emdash;that are likely to include the actual population parameter. For example, if the 95 percent confidence limits for a sample mean are 26 and 34, we can infer that there is a high likelihood that the true population mean lies between these two values. More precisely, we can infer that if we collected data on 100 samples drawn from the same population that we sampled for our study, only five of them would contain confidence limits that did not include the actual population mean. Following the same reasoning as in tests of significance used to test the null hypothesis, we can argue that it is unlikely that our sample would be among those five; therefore, we conclude that the calculated confidence limits are likely to include the true population mean.

Confidence limits typically are expressed as "margins of errors" in public opinion polls and election polls. For example, the Oregon Annual Social Indicators Survey for the year 2000 (as reported in The Register-Guard, March 8, 2001, p. 1A) had a "margin of error of 3.3 percent" for the various percentages reported (e.g., 82.5% of the sample indicated that they trusted state government to do what is right often or sometimes). Although policy makers and the public may not understand the statistical meaning of a margin of error, they likely would understand the point&emdash;if it was explained to them&emdash;that the reported percentage might be a bit higher or lower if the entire population had been sampled.

In effect, a confidence limit provides information about how precise a given numerical value obtained from a sample is. This information can be important to practitioners, such as when political party workers have the results of a poll of likely voters and must judge from those results whether a political race is tight or strongly in one candidate's favor; their judgment might well influence subsequent campaign strategy and spending.

Are confidence limits also useful in education? In fact, there are polls of public opinion about educational matters, and the results of these polls can be important to practitioners at the local, regional, or national level. Because such polls typically involve samples, I believe that confidence limits should be calculated around the obtained statistics. One reason for my belief is that confidence limits can be explained so that they are meaningful to practitioners.

The second reason is that confidence limits help practitioners and researchers judge the importance of a result. If the confidence limits are small, practitioners have a high degree of assurance that the research result is a true&emdash;or nearly true&emdash;number for decision-making purposes. Conversely, if the confidence limits are large, practitioners must be more tentative. It might even be that a large confidence limit for a result relating to an important variable would lead practitioners to request a replication of the study, but with a larger sample.

Determining the Importance of Qualitative Research Results

Qualitative research involves the intensive study of specific instances&emdash;sometimes called cases&emdash;of a phenomenon. The purpose of the research is to achieve an understanding of the phenomenon. This understanding may contribute to educational practice or theory, or it might suggest theoretical hypotheses and variables that can be studied by quantitative research methodology.

Qualitative researchers rely heavily on verbal data and artifacts, but occasionally they collect numerical data as well. For example, they might select a case to study because the individual or group earned an unusually high or low score on a test. Or they might study the effects of an intervention on an individual or group; one of the effects might be a quantitative outcome, such as performance on an achievement test.

Because qualitative research focuses on cases rather than samples, tests of statistical significance and the ES statistic are not applicable to these types of quantitative data. However, I think that there are two situations in which scoring guides and meaningful reference groups can inform judgments about the importance of qualitative research results.

The first situation involves the sampling strategy used to select cases for a qualitative research study. Patton (xx) identified 15 possible strategies for this purpose. One of them is extreme or deviant case sampling. In this strategy, the qualitative researcher selects cases that are unusual or special in some way. As I noted above, high-stakes tests involving scoring guides are becoming increasingly prominent. A qualitative researcher might choose to study one or more students who have scored at the highest level or the lowest level on the test. If the levels of the scoring guide are defined meaningfully, it is likely that practitioners will view the study's results as important because they involve cases that are meaningful to them. They would examine findings about high-scoring students for insights that could help them develop interventions that benefit lower-scoring students. They would examine findings about low-scoring students for insights that help them understand these students and point the way to help them achieve at a higher level.

Student performance on tests that involve a meaningful scoring guide can be used in a similar manner in some of the other qualitative-research sampling strategies described by Patton. Among them are typical case sampling, maximum variation sampling (selecting cases that illustrate the range of variation in the phenomena to be studied), and stratified purposeful sampling (selecting cases at defined points of variation).

The other situation in which scoring guides and meaningful reference groups can be useful in judging the importance of qualitative research results involves studies in which the phenomenon being studied includes an outcome measure. For example, a researcher might study the effects of an educational intervention on an individual student or group of students considered as a case. Various types of data typically would be collected. One type might be performance on a quantitative test. If the test scores indicate meaningful performance levels on a scoring guide or can be tied to a meaningful reference group, practitioners will be able to comprehend the scores and judge whether the intervention made an important difference in the learning of the individual or group that was studied. An intervention judged to be important might be pursued further. For example, it could be developed into a full-fledged program or procedure and be investigated using representative samples rather than cases.

Conclusions

[discuss extension to other research designs]

My purpose in this paper has been to consider whether tests of statistical significance and effect sizes provide useful information for judging the importance of research results to individuals who have a stake in the practice of education. I have shown that tests of statistical significance provide no useful information for this particular purpose. Effect sizes are more useful, because they provide information about magnitude of difference, which is of interest to practitioners. Therefore, they should be reported routinely in studies that purport to have implications for practice.

While effect sizes are useful, they are not comprehensible to most practitioners, and they do not portray the magnitude of difference in a manner that facilitates decision-making. I have suggested two alternatives to effect sizes that do not have these limitations: (1) tests for which there is a scoring guide that distinguishes important differences in levels of performance; and (2) the formation of meaningful reference groups to which the research sample's scores can be compared. I also argued that, in certain situations, the reporting of confidence limits around a sample statistic can be both comprehensible and important to practitioners.

My reservations about the use of tests of statistical significance and effect sizes parallel the increasing interest among researchers in the consequential validity of tests. Consequential validity refers to the extent to which the values implicit in the constructs measured by a test and in the intended uses of the test are consistent with the values and needs of test users, test-takers, and other stakeholders.

I do not think it is too far a stretch to apply the concept of consequential validity to the methods used to judge the importance of research results. If we do so, I think that tests of statistical significance will not fare well. Such terms as "test of statistical significance" and "a statistically significant result" can and do mislead practitioners into thinking that a research result has value for schooling. Because statistically significant results are valued by journal editors and the research community in general, individual researchers might feel a subtle coercion to design their study to maximize the likelihood of achieving these results (by increasing the sample size and other methods mentioned earlier in the paper).

Effect sizes have consequential validity in that they are designed to describe the magnitude of differences, which is information of value to practitioners. By this, I mean that practitioners continually search for interventions that will improve students' learning; effect sizes provide information about whether a research result pertaining to an intervention indicates that a learning gain has occurred. At the same time, effect sizes lack consequential validity in that effect sizes are too abstract and contain too little information about the magnitude of differences to be of use to practitioners.

The alternative statistics that I recommended in this paper might have deficiencies in consequential validity of which I'm not aware. If so, we will need to develop better forms of statistical analysis. At this time, though, the greatest need is for researchers to become more sensitive to the need for consequential validity in their statistical analyses. If researchers wish to make claims about the importance of their research results for practice, they need to perform statistical analyses that are comprehensible and that inform practitioners' efforts to improve education.

References

[To be added]

Table 1

Percentage of Students at Each Level of Performance on a Scoring Guide Assessing Ability to Do Essay-Type Mathematical Problems

Performance Standard
Experimental Group
Pre........Post.......Change
Control Group
Pre........Post.......Change

(6) Exemplary Response

clear, elegant explanation; clear diagram; all important problem elements are identified; strong supporting arguments

2% 10% + 8%

8% 7% - 1%

(5) Competent Response

reasonably clear explanation; might or might not include diagram; the most important problem elements are identified; solid supporting arguments

8% 15% + 7%

2% 3% + 1%

(4) Minor Flaws, But Satisfactory

satisfactory completion of the problem; explanation or diagram might be muddled; understands the underlying mathematical ideas

15% 30% + 15%

20% 20% 0%

(3) Serious Flaws, But Nearly Satisfactory

Fails to complete or omits significant part of the problem; computation, explanations, and use of mathematical terms might be muddled

15% 10% - 5%

15% 20% + 5%

(2) Begins, But Fails to Complete the Problem

Explanation or diagram shows no understanding of the problem situation; major computational problems

40% 20% - 20%

40% 35% - 5%

(1) Unable to Begin Effectively

Words and diagram do not represent the problem; does not attempt a solution

20% 15% - 5%

15% 15% 0%

Performance Standard	Experimental Group Pre........Post.......Change	Control Group Pre........Post.......Change
(6) Exemplary Response clear, elegant explanation; clear diagram; all important problem elements are identified; strong supporting arguments	2% 10% + 8%	8% 7% - 1%
(5) Competent Response reasonably clear explanation; might or might not include diagram; the most important problem elements are identified; solid supporting arguments	8% 15% + 7%	2% 3% + 1%
(4) Minor Flaws, But Satisfactory satisfactory completion of the problem; explanation or diagram might be muddled; understands the underlying mathematical ideas	15% 30% + 15%	20% 20% 0%
(3) Serious Flaws, But Nearly Satisfactory Fails to complete or omits significant part of the problem; computation, explanations, and use of mathematical terms might be muddled	15% 10% - 5%	15% 20% + 5%
(2) Begins, But Fails to Complete the Problem Explanation or diagram shows no understanding of the problem situation; major computational problems	40% 20% - 20%	40% 35% - 5%
(1) Unable to Begin Effectively Words and diagram do not represent the problem; does not attempt a solution	20% 15% - 5%	15% 15% 0%