Reliability of Benchmark Testing at the Middle School Level
Benchmark tests are designed to measure students’ abilities to identify students in need of interventions. Benchmark tests are assumed to be reliable sources for this information. However, extreme variances in students benchmark percentile scores on retakes leads to questions about the reliability of these benchmark tests. How can a student take a test within the same 21 day period and receive a new score so different than the first? Therefore the intent of this study is to study the reliability of STAR testing in reading and math at the middle school level. This will be achieved through a split-test analysis of one testing period with the research question being answered, To what extent does the test administered to the students yield the same results from one half compared to the other? When an analysis of students results (IV) on 6th and 7th grade benchmark test within the same testing period (fall, for example) using a split-test analysis, the reliability (DV) of the STAR reading and math test will be shown.
Review of the Literature
Niemi, Wang, Wang, Vallone, and Griffin (2007) describe the process of creating a valid and reliable assessment. This report, although its main focus is on ensuring tests are valid, includes a lengthy section on finding and ensuring the reliability of benchmark tests. This study shows it is important to look at the reliability of STAR math and reading tests, in addition to its validity. As with the validity of these tests, their reliability affects a large teasing area and a large number of schools. The study being implemented will seek to find whether the implementation of these tests in real school situations is reliable. If reliability does not match that put forth by the testing company the schools using these tests need to be aware of this discrepancy and how they can improve their implementation of the test to ensure that they are getting the same reliability as in optimum testing conditions.
Brown and Coughlin (2007) looked at four benchmark tests which are used within the Mid-Atlantic region: Northwest Evaluation Association’s Measures of Academic Progress (NWEA MAP), Renaissance Learning’s STAR Math and Reading assessments, Scholastic’s Study Island Math and Reading assessments, and CTB/McGraw-Hill’s TerraNova Math and Reading assessments. They analyzed the validity of these tests, whether the content being measured matched the curriculum and standards being taught in the individual states. Specifically, the STAR assessments were rated on a 3 point scale, with 3 being true, 2 being somewhat true, and 1 being not true. In terms of the items being measured, “Is the assessment score precise enough to use the assessment as a basis for decisions concerning individual students?” was rated as 3, stating that the test is precise enough. Comments from the researchers included “Adaptive test score standard errors are sufficiently small to use as a predictive measure of future performance” (Brown & Coughlin, 2007, p.9). However, to the question “Is the overall predictive accuracy of the assessment adequate? “, the researchers gave a score of 1 and commented “Criterion relationships vary across grade and outcome, but there is evidence that in some circumstances the coefficients are quite large. The average coefficients (mid-.60s) are modest for Math and higher for Reading (.70–.90). However, these are coefficients of concurrent validity, not predictive validity” (2007, p. 9). There appear to be very few flaws in this study.
The study conducted by Christ (2007) uses “a hybrid model that combines the quasi-simplex structure with a split-halves of scale scores is used to test the competing formulations of the [quasi simplex model]” for five scales within the National Survey of Child and Adolescent Well-Being ( p.1). One of the five scales, the Woodcock-McGrew-Werder Mini-Battery of Achievement (MBA) is most similar to STAR Reading and Math assessments as both measure “skills and knowledge in reading and math” in students age six and older (Christ, 2007 , p. 13). Using the Wald Chi-Square Tests of QSM Assumptions, MBA Reading had a score for Constant Reliability of 116.356 and MBA Math had a score for Constant Reliability of 26.668. Using the Reliability Estimates and Standard Errors from the Hybrid Model, Constant Reliability for MBA Reading was consistently 0.887, and for MBA Math was consistently 0.918. Flaws in this study include not having section on the implications and limitations of the research nor does it include any conclusions. It ends after the results are presented. In addition, although it makes in depth references to other studies in the beginning of the study there is no formal literature review. Overall, the parts of a quantitative study are poorly designated or missing within the paper. Although this study uses a hybrid model and not a pure split-half analysis, it provides substantial background and is a similar to the analysis of STAR assessments. It explains in detail split halves reliability including the Spearman-Brown split-half reliability coefficient equation. Like Christ’ study which analyzed the reliability of test which place judgment on children in the care or watch of the study to determine the best conditions for them, an analysis STAR Reliability will see whether these tests are giving results that are reliable, thereby designating the proper students for various levels of intervention and class placement.
Research Questions and Hypotheses
This investigation will examine when students (IV) in 6th and 7th grade take the benchmark test in one testing period (fall, for example), using a split-test analysis, the reliability (DV) of the STAR reading and math test will be shown. The following research questions will be addressed with descriptive and inferential statistics through this study:
Research Question 1: Is the STAR a reliable measure of student abilities in reading and math in 6th and 7th grade?
Research Question 2: Do student scores on the STAR math match the split test reliability provided by STAR of 0.836 and 0.857, respectively, for grades 6 and 7?
Research Question 3: Do student scores on the STAR reading match the split test reliability provided by STAR of 0.89 and 0.90, respectively, for grades 6 and 7?
Research Question 4: Does the reliability of the test when proctored in a school setting match the reliability provided by the testing company?
Research Question 5: Are there environmental variables that influence split test reliability to not be consistent?
Research Question 6: How can these variables be controlled to result in scores that reflect student’s actual abilities, and make the test more reliable?
Significance of the Proposed Study
Students typically take benchmark tests in reading and math three times a year: fall, winter, and spring. One way these scores are used is for placement in Scientific Research Based Intervention (SRBI) (also more commonly known as Response to Intervention, or RTI). The tier parameters for reading are: above the 40th percentile considered Tier I and will continue to receive regular classroom instruction; between the 25th and 40th percentile, Tier II interventions are needed; and if they score below the 25th percentile, they receive Tier III interventions. For math, students who score above the 40th percentile are in Tier I; students who score between the 16th and 40th percentile receive Tier II interventions; and students who score below the 16th percentile receive Tier III interventions. Tier II reading interventions are pull out, typically from their applied academics (specials, such as P.E. and Music) classes or workshop twice a week to meet with an interventionist in a small group. Tier III reading interventions are typically more intense, being serviced between 2-5 times per week, usually during workshop or during an applied academics period, receiving interventions in place of attending a special. Tier II mathematics interventions occur during workshop (study period at the end of the day) with the students regular math teacher providing extra help with math homework and target specific skills. Tier III mathematics interventions are pulled out from their applied academics classes and workshop twice a week to meet with a one-on-one interventionist.
Therefore, if a student’s score on the benchmark test is above a certain tiered level they will not receive services for this tier. If the test results are initially showing results lower than their actual abilities, the services they are receiving will be inappropriate. If a student scores in Tier III but they are really capable of Tier II work, they will be placed in a much more intensive intervention then needed. This is a waste of school resources’ and instructional time, as during the time the student is receiving unneeded interventions, the student could otherwise be in class and interventionists could otherwise be working more intensely with a student who has a greater need instead of dispersing time and energy with unnecessary instruction. Also from personal experience, students’ who end up scoring much higher on their winter benchmark are likely to be most resistant to receiving services in the fall because the intervention material is too low.
Methods and Procedures
This quantitative study will follow a correlational research design as the relationship between two variables, namely two split halves of a test, measuring the degree of association between two-split halves (half items) of a test will be studied. This correlational study will follow the explanatory design, as this study is looking to find the “extent to which two variables co-vary” the two variables being students’ abilities and their percentile score (Creswell, 2012, p. 340). Other characteristics of this study that match an explanatory design include collecting data at one point in time, namely the test results from one testing period, and based on the study results, conclusions will be drawn from the statistical test results, namely reliability of the test.
Sample and Data Collection
The population for this study consists of 6th and 7th graders at a suburban middle school in Connecticut. The sampling procedures will be a random sampling of 50 6th graders and 50 7th graders in order to obtain an overall view of reliability on tests. Data will be gathered via a computer-based test. This study is looking for general reliability and not reliability connected to a certain gender, ethnic or socioeconomic group; therefore the students chosen will be of a sampling of ethnicities, socioeconomic groups and students receiving special education services. The confidentiality of the data will be maintained through the assignment of numbers 1-50 along with the grade and gender. For example, a sixth grade girl might be coded as 6G2.
Instruments used in this study are STAR Reading and STAR Math assessments and Mplus, a “latent variable modeling program with a wide variety of analysis capabilities” (Mplus, 2013). The STAR benchmark tests are based on the Common Core State Standards implemented in the school, therefore validity is strong. However reliability is in question as this is why this study is being completed. Mplus software has been used effectively in other studies. The only reason for a malfunction with Mplus would be through incorrect data entry and equation use. Mplus however offers tutorials and training on their products which will decrease this risk.
Data collection will begin after sixth and seventh grade students take the STAR Reading and Math assessments in Fall 2013. Copies of the tests for the 100 students who have been randomly selected (and approved by parents and administration) will be obtained from STAR. Data (test questions and scores) will be entered into a database where then a split-test reliability will be applied. The test will be divided between even and odd questions, and then rescored. Using Mplus software, this data will then be analyzed. Depending on the closeness of the reliability scores on both halves of the test, the reliability of the tests as implemented will be determined.
Limitations of the Study
This study has some limitations. First, it is difficult to know the exact conditions during testing and therefore outside factors, such as disruptive noise, disruptive students, and individual student conditions (overtired, hungry) are unknown. These factors could affect the students’ scores but reflect poorly on the reliability. Generally however, this would affect test results in a way where the whole test would decrease, not just one part. However, if a distraction was presented halfway through the test then the test could be affected in this way without the researcher’s knowledge. Second, the researcher works at the school in which the research is being conducted. Therefore, there may be a possible risk of data skewing through selecting students less randomly than perceived in hopes of finding certain results.
A third limitation is small scale of study and shot time period. The study would be more fair and balanced if it looked at the reliability of all students tests or if it was expanded to include more grades or schools. This study only looks at middle school students in a suburban town. By expanding the study to an urban school and a rural school, for example, one would be able to get a larger picture of how reliability plays out in multiple settings. Also, the study only looks at fall data. It is possible students are less settled into school when they take the test in the first two weeks then they are when they take it in January and May. If this study looked at the split test reliability of all three test periods, a more accurate picture of the test reliability would develop. Therefore, it might be hard to generalize the results outside of this specific school and grade level and even testing period.
Brown, R. S., & Coughlin, E. (2007). The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region (Issues & Answers Report, REL 2007–No. 017). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic. Retrieved from http://ies.ed.gov/ncee/edlabs
Christ, S. (2007). Reliability Estimation and Testing Using Structural Equation Models. Conference Papers — American Sociological Association, 1.
Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4th ed.). Upper Saddle River, NJ: Pearson Education.
Daniel, H., & Wheeler, B. (2006). The uses of benchmark tests to improve student learning. ThinkLink Learning/Discovery Education. Retrieved from http://www.leadered.com/06Symposium/pdf/USES%20OF%20BENCHMARK%20TESTS.pdf
Filbin, J. (2008). Lessons from the initial peer review of alternate assessments based on modified achievement standards. Washington, DC: U.S. Department of Education: Office of Elementary and Secondardary Education—Student Achievement and School Accountability.
Herman, J.L. & Baker, H.L. (2005). Making benchmark testing work. Educational Leadership. 63(3), 48-54.
Mplus. (2013). Mplus at a Glance. Retrieved from http://www.statmodel.com/glance.shtml.
Niemi, D., Wang, J., Wang, H., Vallone, J., Griffin, N., & National Center for Research on Evaluation, S. A. (2007). Recommendations for Building a Valid Benchmark Assessment System: Second Report to the Jackson Public Schools. CRESST Report 724. National Center For Research On Evaluation, Standards, And Student Testing (CRESST),
Renaissance Learning. (2013). Sample Questions (Math). Retrieved from http://www.renlearn.com/sm/sample.aspx.
Renaissance Learning. (2013). Sample Questions (Reading). Retrieved from http://www.renlearn.com/sr/sample.aspx
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24(2), 3–13.
Informed Consent Letter (Pilot Study)
I am currently conducting research on the reliability of STAR benchmark testing at John Williams Middle School STAR Benchmark testing is used to evaluate your child’s progress three times a year. Your child has been invited to participate in the study designed to evaluate how consistent students’ results are on the test. This study will allow me to explore how reliable the testing is among students. Reliability is best explained as if the student took a test multiple times, they would get around the same score each time. However, your child will only be taking the test once, along with the rest of the middle school. The study consists of analyzing the students’ responses to questions on the test after they have taken the fall test in September. The study will last up to four months, completing in December.
If you choose to allow your child to participate in this study, simply complete the attached permission form and return it to Katie Kehoegreen. Participation in this study is voluntary. If you complete the permission form, you will not be contacted again regarding the study. All student data will be anonymous; the researchers will not link your child’s data to your child. By completing this permission form and returning it, you are consenting to allow your child’s anonymous results on the STAR test to be used for research purposes.
You will have three weeks to decide whether or not to have your child participate in this study. I will be happy to answer any question you have about this study. If you have further questions about this project or if you have a research questions concerning your students rights as a research subject, you may contact me at the phone number below or contact my thesis advisor, Dr. Clem Washington (860-465-5362); email: Washingtonc@easternct.edu.
I hope that you will take a few minutes to complete the attached permission form and return in three weeks. Thank you for your consideration.
John Williams Middle School
Eastern Connecticut State University