Review of Literature
Brown and Coughlin (2007) describe the process of creating a valid and reliable assessment. This report, although its main focus is on ensuring tests are valid, includes a lengthy section on finding and ensuring the reliability of benchmark tests. Niemi, et al. looked at four benchmark tests which are used within the Mid-Atlantic region: Northwest Evaluation Association’s Measures of Academic Progress (MAP), Renaissance Learning’s STAR Math and Reading assessments, Scholastic’s Study Island Math and Reading assessments, and CTB/McGraw-Hill’s TerraNova Math and Reading assessments. They analyzed the validity of these tests, whether the content being measured matched the curriculum and standards being taught in the individual states. Specifically, the STAR assessments where rated on a 3 point scale, with 3 being true, 2 being somewhat true, and 1 being not true. In terms of the items being measured, “Is the assessment score precise enough to use the assessment as a basis for decisions concerning individual students?” was rated as 3, stating that the test is precise enough. Comments from the researchers included “Adaptive test score standard errors are sufficiently small to use as a predictive measure of future performance” (Brown & Coughlin, 2007, p.9). However, to the question “Is the overall predictive accuracy of the assessment adequate? “, the researchers gave a score of 1 and commented “Criterion relationships vary across grade and outcome, but there is evidence that in some circumstances the coefficients are quite large. The average coefficients (mid-.60s) are modest for Math and higher for Reading (.70–.90). However, these are coefficients of concurrent validity, not predictive validity” (2007, p. 9). There appear to be very few flaws in this study.
This study shows it is important to look at the reliability of STAR math and reading tests, in addition to its validity. As with the validity of these tests, their reliability affects a large teasing area and a large number of schools. The study being implemented will seek to find whether the implementation of these tests in real school situations is reliable. If reliability does not match that put forth by the testing company the schools using these tests need to be aware of this discrepancy and how they can improve their implementation of the test to ensure that they are getting the same reliability as in optimum testing conditions.
The study conducted by Christ (2007) uses “a hybrid model that combines the quasi-simplex structure with a split-halves of scale scores is used to test the competing formulations of the [quasi simplex model]” for five scales within the National Survey of Child and Adolescent Well-Being ( p.1). One of the five scales, the Woodcock-McGrew-Werder Mini-Battery of Achievement (MBA) is most similar to STAR Reading and Math assessments as both measure “skills and knowledge in reading and math” in students age six and older (Christ, 2007 , p. 13). Using the Wald Chi-Square Tests of QSM Assumptions, MBA Reading had a score for Constant Reliability of 116.356 and MBA Math had a score for Constant Reliability of 26.668. Using the Reliability Estimates and Standard Errors from the Hybrid Model, Constant Reliability for MBA Reading was consistently 0.887, and for MBA Math was consistently 0.918. Flaws in this study include not having section on the implications and limitations of the research nor does it include any conclusions. It ends after the results are presented. In addition, although it makes in depth references to other studies in the beginning of the study there is no formal literature review.
Overall, the parts of a quantitative study are poorly designated or missing within the paper. Although this study uses a hybrid model and not a pure split-half analysis, it provides substantial background and is a similar to the analysis of STAR assessments. It explains in detail split halves reliability including the Spearman-Brown split-half reliability coefficient equation. Like Christ’ study which analyzed the reliability of test which place judgment on children in the care or watch of the study to determine the best conditions for them, an analysis STAR Reliability will see whether these tests are giving results that are reliable, thereby designating the proper students for various levels of intervention and class placement.
Research Questions and Hypotheses
When students (IV) in 6th and 7th grade take the benchmark test in one testing period (fall, for example), using a split-test analysis, the reliability (DV) of the STAR reading and math test will be shown.. This quantitative study will be correlational in nature since it investigates the relationship between two variables, namely students’ abilities and consistency of the percentile scores when split-test analysis is applied. The following research questions will be addressed through this study:
- Is the STAR a reliable measure of student abilities in reading and math in 6th and 7th grade?
- Do student scores on the STAR math match the split test reliability provided by STAR of 0.836 and 0.857, respectively, for grades 6 and 7?
- Do student scores on the STAR reading match the split test reliability provided by STAR of 0.89 and 0.90, respectively, for grades 6 and 7?
- Does the reliability of the test when proctored in a school setting match the reliability provided by the testing company?
- Are there environmental variables that influence split test reliability to not be consistent?
- How can these variables be controlled to result in scores that reflect student’s actual abilities, and make the test more reliable?
Design and Methodology
This quantitative study will follow a correlational research design as the relationship between two variables, namely two split halves of a test, measuring the degree of association between two-split halves (half items) of a test will be studied. This correlational study will follow the explanatory design, as this study is looking to find the “extent to which two variables co-vary” the two variables being students’ abilities and their percentile score (Creswell, 2012, p. 340).Other characteristics of this study that match an explanatory design include collecting data at one point in time, namely the test results from one testing period, and based on the study results, conclusions will be drawn from the statistical test results, namely reliability of the test.
The population for this study consists of 6th and 7th graders at a suburban middle school. The sampling procedures will be a random sampling of 50 6th graders and 50 7th graders in order to obtain an overall view of reliability on tests. This study is looking for general reliability and not reliability connected to a certain gender, ethnic or socioeconomic group, therefore the students chosen will be of a sampling of ethnicities, socioeconomic groups and students receiving special education services. The confidentiality of the data will be maintained through the assignment of numbers 1-50 along with the grade and gender. For example, a sixth grade girl might be coded as 6G2.
Instruments used in this study are STAR Reading and STAR Math assessments and Mplus, a “latent variable modeling program with a wide variety of analysis capabilities” (Mplus, 2013). The STAR benchmark tests are based on the Common Core State Standards implemented in the school, therefore validity is strong. However reliability is in question as this is why this study is being completed. Mplus software has been used effectively in other studies. The only reason for a malfunction with Mplus would be through incorrect data entry and equation use. Mplus however offers tutorials and training on their products which will decrease this risk.
Data collection will begin after sixth and seventh grade students take the STAR Reading and Math assessments in Fall 2013. Copies of the tests for the 100 students who have been randomly selected (and approved by parents and administration) will be obtained from STAR. Data (test questions and scores) will be entered into a database where then a split-test reliability will be applied. The test will be divided between even and odd questions, then rescored. Using Mplus software, this data will then be analyzed. Depending on the closeness of the reliability scores on both halves of the test, the reliability of the tests as implemented will be determined.
Brown, R. S., & Coughlin, E. (2007). The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region (Issues & Answers Report, REL 2007–No. 017). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic. Retrieved from http://ies.ed.gov/ncee/edlabs
Christ, S. (2007). Reliability Estimation and Testing Using Structural Equation Models. Conference Papers — American Sociological Association, 1.
Creswell, J. W. (2012). Educational research: Planning, conducting, and evaluating quantitative and qualitative research (4th ed.). Upper Saddle River, NJ: Pearson Education.
Mplus. (2013). Mplus at a Glance. Retrieved from http://www.statmodel.com/glance.shtml.