Skip to main content

Welcome, the Hub connects all projects

Research and Evaluation

All Topics for this Forum

Topic: "What are others doing in their evaluations to take into account instrument changes across the span of a project?"

Topic Posts

Topic started by: Erin Stack on 2/25/15

Specifically, what can you do with student data that comes from different assessments across time, for example, transitioning from a state assessment to Smarter Balanced, PARCC, or a new state assessment?

Discussion forums have been archived, posting is no longer available.

This topic has 2 posts, showing all.

Some Statistical Issues with Changing Assessments

posted by: Michael Culbertson on 3/6/2015 3:53 pm

I see two primary challenges with shifting assessments (there are likely more): differences in scale and differences in coverage. A difference in scale means, for example, that a score of 75 on assessment A is not the same as a score of 75 on assessment B or that a gain of 5 points on assessment A is no the same as a gain of 5 points on assessment B. A difference in coverage means that even though both assessments are, say, third-grade math tests, the assessments may emphasize different or parts of the third-grade mathematics curriculum. I think differences in scale are generally easier to deal with than differences in coverage.

For differences in scale, if the change in assessments happens between pre- and post-tests for all of your participants, the ANCOVA or linear regression framework is your friend. Due to the change in scale, the difference in pre- and post-test scores (the gain score) is not terribly meaningful (which is a problem for single-group designs); however, despite the difference in scale, the pre- and post-test are still going to be correlated (assuming the tests still measure the same thing), because correlation is "scale free." One of the main reasons for including pre-tests in multiple group designs (e.g., with a treatment and a comparison group) is to capitalize on these correlations to explain additional variation in your post-test measure and thereby increase the power of your analysis. In the ANCOVA/regression framework, explanatory pre-intervention covariates (including your pre-test) can be any variable that is meaningfully correlated with your outcome variable (post-test), so a change in scale from pre-test to post-test poses little difficulty.

If the change in assessment happens for some participants but not others, things are a little more complicated. This might happen, for example, when participants from multiple cohorts are being pooled together: Cohorts that take place before the change in assessments may have both pre- and post-tests from the old assessment, and cohorts after the change may have pre- and post-tests from the new assessment. Depending on how long the intervention is, some participants may also have a pre-test from the old assessment and a post-test from the new assessment. Here, standardizing each measure (subtracting the mean and dividing by the standard deviation) puts the two assessments on a common scale (but only if they really only differ in their scale, see below!). I would recommend using state or national norms for the means and standard deviations, since they will likely be more reliable than statistics calculated from your particular sample. These norms can often be found in the technical documentation for the assessments, which many state testing programs publish on their website.

Differences in coverage are more subtle. With respect to power, differences in coverage will decrease the correlation between the two assessments, reducing the variance in the post-test explained by the pre-test. If the change in assessments happens between pre- and post-tests for all participants, this is unfortunate, but not technically a problem (unless the effect you're trying to detect is on the verge of your minimum detectable effect size). In the case of different cohorts taking different pre-tests, the correlation between pre-test and post-test will be different for the different cohorts, according to which pre-test they took. You can account for this simply in the ANCOVA/regression framework by including a dummy variable that indicates which assessment the participant took as well as the interaction of this assessment indicator and the pre-test covariate. This allows each pre-test to correlate with the post-test to a different extent. Having a mix of participants' post-tests makes interpretation of the effect of interest a bit more complicated, but you can similarly account for differential effects on the different post-tests with another dummy variable indicating which post-test the participant took and the interaction of the post-test indicator with your variable of interest (e.g., treatment group).

The more difficult aspect of differences in coverage comes from the usage of pre-tests as a means of accounting for differences in non-equivalent groups (e.g., in non-randomized designs). If you switch assessments between pre- and post-test, your pre-test measure will only account for group differences in the content covered by the pre-test. Insofar as the two assessments differ in coverage, the analysis may not account for some pre-intervention differences in the post-test assessment's content. Basically, your analysis is open to a critique along the lines of "Your effect isn't really due to your intervention; it's due to the fact that your two groups started out with different levels of achievement in content area X, which is assessed by test B and not by test A." The risk of this potential bias will depend on how different the assessments are and how likely it is for groups that have the same level of achievement in the content covered by both assessments nevertheless to have different levels of achievement in the content covered only by the post-test. To make an argument that your analysis is valid and unbiased, you will have to draw mostly on expert opinion and reasoning about the nature of the specific differences between the two assessments, though showing that scores on the two assessments are highly correlated would surely bolster such an argument.

But, the issue of differences in coverage raises another issue in MSP evaluation: To what extent does the assessment you are using cover the content that you expect your intervention to address? How appropriate is a broad-based state assessment for measuring the changes you can expect from your (targeted?) intervention? Are the differences in assessment coverage relevant to the theory of your program?

I've outlined a few theoretical statistical issues. What are some of the practical issues MSP evaluators are running into with changes in assessments? How are differences in assessment coverage related to the theory of action for your MSP program?

Assessments and the Complexity of Comparisons

posted by: Mary Townsend on 3/7/2015 10:25 am

Thanks Michael for a rather concise summary of a very complex issue. My challenge is getting district leaders to understand that different assessments are for different audiences. For my money teachers need to be given the time and permission to analyzing actual student work for their instructional decision making. Teachers might be liberated from the tyranny of looking at data that may or may not really be informative. Comparisons at any level have the potential for miss information.