Problem Statement

Benchmark tests are designed to measure students’ abilities to be able to identify the students in need of interventions. Extreme variances in students benchmark percentile scores on student retakes leads to questions about the reliability of these benchmark tests. How can a student take a test within the same 21 day period and receive a new score that is so different than the first? Therefore the intent of this study is to study the reliability of STAR testing in reading and math at the middle school level. This will be achieved through the analysis of retest-test occurring within 21 day with the research question being answered, “To what extent does the test administered to the same students twice yield the same results from one administration to the next?”(NWEA, 2004, p. 2). When students (IV) in 6th and 7th grade retake the benchmark test within the same testing period (fall, for example), changes in percentile scores (DV) do not fit into range of error provided by the original test, thereby putting into question the reliability of the test.

Students typically take benchmark tests in reading and math three times a year: fall, winter, and spring. One way these scores are used is for placement in Scientific Research Based Interventions (SRBI) (also more commonly known as Response to Intervention, or RTI). The tier parameters for reading are: above the 40th percentile considered Tier I and will continue to receive regular classroom instruction; between the 25th and 40th percentile, Tier II interventions are needed; and if they score below the 25th percentile, they receive Tier III interventions. For math, students who score above the 40th percentile are Tier I, students who score between the 16th and 40th percentile receive Tier II interventions, and students who score below the 16th percentile receive Tier III interventions. Tier II interventions for math occur during workshop (study period at the end of the day) with a math classroom teachers where students receive extra help with math homework and target specific skills However, Tier II interventions for reading are pull out, typically from their applied academics (specials, such as PE and Music) classes or workshop twice a week to meet with an interventionist in a small group. Tier III reading interventions are typically more intense, being serviced between 2-5 times per week, usually during workshop. For Tier III math students, they are pulled out from their applied academics classes and workshop twice a week to meet with a one-on-one interventionist.

Therefore, if a student’s score on the benchmark test is above a certain tiered level they will not receive services for this tier. If the test results are initially showing results lower than their actual abilities, the services they are receiving will be inappropriate. If a student scores in Tier III but they are really capable of Tier II work, they will be placed in much more intensive interventions then needed. This is a waste of school resources’ and instruction time. During this time, the student could otherwise be in class and interventionists could otherwise be working more intensely with a student who has a greater need instead of dispersing time and energy with unnecessary instruction. Also from personal experience, students’ who typically end up scoring much higher on their second benchmark test of the year are likely to the ones most resistant to receiving services because they don’t need them as they are currently being provided.

There have been many studies researching the validity of benchmark tests to see if the content taught in class matches the content of the benchmark test. For example, Brown and Coughlin (2007) looked at commercially available testing products used in the Mid-Atlantic region and how valid these tests were in connection with the state assessments. Herman and Baker’s (2005) analysis of the six criteria to make benchmark tests work focuses on their validity but not their reliability. There also have been papers which describe what reliability is, the different types of reliability, what the reliability of specific tests are, and how to calculate it. Studies have been completed, such as Filbin (2008) and Rodriquez (2005) that look at the effect of test length and number of answer options, respectively, on the effect of reliability. I found no studies that detailed the reliability of benchmark tests when implemented in school settings. The test companies themselves make reliability coefficients available (.81–.87 Math retests,.82–.91 Reading retests (Brown & Coughlin, 2007)) and other studies suggests reliability in general (for example, all benchmark tests should have a “reliability of .85 or higher as measured by Cronbach’s Alpha” (Daniel & Wheeler, 2006)), but no studies, to my knowledge, have been completed that analyze whether this reliability coefficients made available by testing companies e reliability matches the reliability coefficients when the tests are administered in school settings. Therefore this study will fill a gap in research knowledge and will benefit the school itself, the district, and any other districts that use STAR testing.

More specifically, audiences who will benefit from this study include administrators, curriculum directors, and reading/math specialists. This group typically makes decisions regarding benchmark tests. If this study shows implementation of STAR benchmark tests are not reliability or consistent with the predicted reliability given by Renaissance Learning (the parent company of STAR testing), they will be interested to learn of this disparity. They also might be interested in under what conditions students perform the better in. Most retests are taken in a reading or math lab but original tests are typically taken as a whole class in a computer lab. If the majority of students score higher on the retest, a causal conclusion can be drawn that students do better in a smaller setting with the factors that go along with these settings, such as less distractions and less pressure to rush and guess to finish alongside their classmates. Students will also benefit from this study as they will be assured they’re full abilities are being measured, and if they aren’t a changed can be made in testing to insure it is. This quantitative study will be correlational in nature since this study will investigate the relationship between two variables, namely students’ abilities and the percentile scores on multiple tests to measure reliability. The following research questions will be addressed through this study:

  • Is the STAR a reliable measure of student abilities in reading and math in 6th and 7th grade?
  •  Do student scores on the STAR in reading and math when retaken in the same testing period, remain in the range of scores (-/0/+) predicted by the first test?
  • Does the reliability of the test when proctored in a school setting match the reliability provided by the testing company?
  •  What would influence student’s scores changing beyond the range of scores provided by the first test?
  •  How can these variables be controlled to result in scores that reflect student’s actual abilities?



Brown, R. S., & Coughlin, E. (2007). The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region (Issues & Answers Report, REL 2007–No. 017). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic. Retrieved from

Daniel, H., & Wheeler, B. (2006). The uses of benchmark tests to improve student learning. ThinkLink Learning/Discovery Education. Retrieved from

Filbin, J. (2008). Lessons from the initial peer review of alternate assessments based on modified achievement standards. Washington, DC: U.S. Department of Education: Office of Elementary and Secondardary Education—Student Achievement and School Accountability.

Herman, J.L. & Baker, H.L. (2005). Making benchmark testing work. Educational Leadership. 63(3), 48-54.

Northwest Evaluation Association. (2004). Reliability and validity estimates: NWEA achievement level tests and measures of academic progress. Lake Oswego, OR: Author.

Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s