spring 2007

home page link preview of next issue link gallery page link people page link fieldwork page link feature stories links
 
 
   
   

High stakes

While high-stakes testing is a valuable tool for analyzing student success, it’s important that the results be correctly interpreted.

Gregory C. Hill, Ph.D.
Boise State University

   
   
   
   
   
   
   
   
     

 

 

 

As a public management scholar, I am in constant conflict over the affects of accountability measures on the performance of organizations in the public sector. Public schools provide a valuable testing ground for an empiricist such as me; there is a trough of quantifiable data available to address many performance-related questions, and an equally full slate of scholars trying to feed from that trough.

Critics say the danger of nationwide mandatory testing is that the test administrators will misinterpret the results. Courtesy 1010 News.

Prominently leading the charge as the dependent variable for outcomes is test scores. We often justify the use of test scores as being “salient” in the community, uniformly distributed to a large number of school districts, and generally accepted by political principals with something to gain or lose from the results. But is high-stakes testing measuring what we hope it’s measuring? Are we, in fact, holding public schools accountable to taxpayers, political principals, parents, future employers, or any others who have an interest in the education of Idaho children?

          In this analysis, I present the argument that high-stakes testing is a valuable and essential function of not only preserving democratic theory in terms of holding public servants accountable, but a reasonable and acceptable mechanism for assessing the outcomes of these hallowed public organizations. The caveat, however, is that while high-stakes testing is valuable and integral to the success of the children, we must be sure the interpretation of the outputs are what we say they are. In other words, do high-stakes tests as they are currently administered tell us if schools are failing our children, as the No Child Left Behind legislation envisions it doing, or do they measure something else? I argue that we are not measuring a school’s ability to maximize the potential in an individual child, but are instead measuring the ability of a school to reach an aggregate minimum standard: two fundamentally different outcomes. I present for debate an alternative model for testing school-age children. The model builds on what exists, adds a time-serial quality, and tests not only for minimum achievement, but maximum potential through a combination of aptitude and achievement testing.
          An oft-cited conventional quip of guidance counselors and career prognosticators is that a bachelor’s degree is roughly the equivalent of a high school diploma a generation ago. If this is indeed true, and it may very well be, the ramifications of primary and secondary education in the United States today are important. The statement itself invokes, however, that the equalization of intelligence among today’s youth is also paramount in order to provide them with the minimum education needed to gain access to a university. The logic is simple: the educational system, once designed to provide a basic knowledge of marketable skills, must now provide not only those basic skills but also prepare students to successfully earn a bachelor’s degree. This will then provide for students what a high school diploma could have provided them just a generation ago. Surely then, some way to ensure that schools are maximizing the taxpayer’s investment is germane.
          As a teacher, there is something morally disconcerting about the notion of aptitude. I would like to believe that, given the information I provide, all students who enter my classroom will be able to master the subject. However, I need to accept that some students’ best work is ‘B’ work, or in other words, some students are more apt to excel at what I teach than others. In fact, on the face of it, aptitude appears to be rather undemocratic. Testing processes currently employed, which aggregate students into a collective whole, diminish the very concept of democracy, or individual influence in our society. The analysis proposed here is focused on disaggregating the students. Disaggregation to the individual level is more, not less, concerned with individualism and the democratic ideals under which we operate. Looking at the individual’s aptitude for the individual’s sake does not diminish democratic notions of fairness and equity but endorses it. To be sure, measuring the ability of individuals against themselves reduces the argument to the very basic unit of analysis: you. What could be more democratic than that? But before we get too far ahead of ourselves, let us look at the recent history of high-stakes testing and how it has been applied in Idaho.

High stakes testing in Idaho

          Testing and education have always coexisted, in some form or another. Socrates examined his students in the form of more questions. Horace Mann was calling for standardized testing as early as 1845. Our parents often opine about the merits of learning the three Rs and participating in spelling bees. Surely, testing has always been integral to our educational culture. Traditionally, education has been left to the states to administer. Indeed, some of the most fiercely fought federalism battles have been about who has the right to educate. States often assert that due to the unique nature and circumstances of the region, they should be responsible for determining educational achievement. Conversely, the federal government argues that with money comes accountability, and if they are going to appropriate funds to the states, they should have some say in the matter. In January 2002, President George W. Bush signed legislation that, to some degree, centralized the testing of primary and secondary children at the federal level. The legislation reads, in part, “The NCLB Act will strengthen Title I accountability by requiring States to implement statewide accountability systems covering all public schools and students. These systems must be based on challenging State standards in reading and mathematics, annual testing for all students in grades 3-8, and annual statewide progress objectives ensuring that all groups of students reach proficiency within 12 years. Assessment results and State progress objectives must be broken out by poverty, race, ethnicity, disability, and limited English proficiency to ensure that no group is left behind. School districts and schools that fail to make adequate yearly progress (AYP) toward statewide proficiency goals will, over time, be subject to improvement, corrective action, and restructuring measures aimed at getting them back on course to meet State standards. Schools that meet or exceed AYP objectives or close achievement gaps will be eligible for State Academic Achievement Awards.”

Aspects of high-stakes testing

          Basically, high-stakes testing in the United States can be defined as testing that is tied to money. Schools must meet a minimum established pass rate in order to avoid penalties, and to continue to receive federal money at the state level. Many of the actors have a stake in the outcome. If we view them through an incentive structure, it simplifies the layout. First, students have an incentive to pass the exams so they can both graduate from high school and attend a university. Second, the pass rate is reflected in teachers’ reputations and, potentially, is tied to their compensation. Third, the school principal has an incentive to see students achieve pass rates because he or she must report to the district superintendent, who has control over the principal’s employment. Fourth, the superintendent has a reputation to protect and again, like the principal, has employment concerns. The State Board of Education has an incentive to see schools meet their achievement goals because they are accountable to both the voting public and the federal legislation. Thus, they can lose their funding and elected positions all in one fell swoop. We can see, then, that testing is not something to be taken for granted.

Poor performance on standardized tests can devastate a school district. Courtesy Local School Directory

          States are given the liberty to administer whatever test they feel is appropriate. In Idaho, we administer the Idaho Standard Achievement Test, or ISAT. The exam is administered in grades 3-8 and once in high school, per NCLB requirements. Idaho also administers a nationwide examination to fourth- and eighth-graders called the National Assessment of Educational Progress. Thus, there are two objective performance indicators to measure the success of Idaho schools in educating children. The results are confounding. Looking at the data we notice, for example, that when comparing the outputs of the 2005 ISAT scores for fourth-grade reading with the NAEP scores for the same grade, 87 percent of ISAT-takers were proficient compared to only 33 percent of the NAEP-takers. If we compare the two results we are left with conflicting feelings on the value of high-stakes testing. A similar pattern emerges with math performance. Irrespective of whether the examination is given at the state or national level, the exams do not necessarily test the ability of teachers to maximize students’ potential. What they do examine is the ability of students to meet some agreed-upon minimum standard, which by definition leads to a bell curve. The leptokurtic nature of the curve not withstanding, the curve will always exist. If we are indeed to believe that “No child left behind” means what it implies, why are we not assessing children on their ability to meet their actual potential?

Meeting student potential

          It is true that I am no chemist. I struggle with concepts of chemistry, and in fact, was so perplexed by organic chemistry that I changed career paths. No, not the simple concepts of ionic and covalent bonding, but more complex processes that I dare not even attempt to recall. Even though I had taken chemistry courses in both high school and college, it was clear that I was not apt to do well in this subject. Were my consistent less-than-A grades in chemistry a function of my laziness? The many hours in libraries, labs, and study groups would suggest otherwise. It is possible that I had poor instruction and the system failed me in the acquisition of chemistry knowledge. It is just as possible that I simply am not hard-wired to understand the concepts of chemistry. Thus, reaching my potential in chemistry may in fact be earning a B- and recognizing that it’s as good as it gets. How does this relatively embarrassing example of my undergraduate failures illustrate my point? Simply: when we examine students on standardized achievement tests as if they all have at least a minimum level of aptitude in a certain subject, we are not really testing for adequate progress. A method for dealing with this inequity in testing is twofold; time serial and a comparison between aptitude and achievement. It seems clear that combining a test of aptitude with a test of achievement will give educators and politicians a clearer understanding of the actual learning going on in the classroom. Furthermore, disaggregating from the district level to the individual level and reviewing performance over time will also brighten the murky waters of high-stakes examinations. I’ll explicate the logic in reverse order, starting with individuals over time.

Time series testing

          One way to identify educational achievement is to determine if, over a period of time, students are getting smarter. In fact, this is one of the fundamental characteristics of NCLB. In its own words, NCLB expounds, “Schools that do not make progress must provide supplemental services, such as free tutoring or after-school assistance; take corrective actions; and, if still not making adequate yearly progress after five years, make dramatic changes to the way the school is run.” With this time series emphasis in mind, let us look to a hypothetical school district in an urban setting in Idaho in 2000. The first time the students are examined the aggregate district score is below the acceptable NCLB requirements, say at 30 percent proficiency in math. The schools are then encouraged to make some changes, provide tutoring, whatever they can do to raise their aggregate score. Each year thereafter the district shows modest improvement, but does not cross the adequate performance threshold of, say, 65 percent. After five years (2005), according to NCLB, “drastic changes” will impact the way the school is run.
          What if, however, a hypothetical cohort of students in the school is scoring 30 percent in 2000? If that same cohort scores 60 percent on the 2002 exam, isn’t this maximizing potential, even if the minimum achievement standard is 65 percent? A change from 30 percent to 60 percent is equivalent to a change from 60 percent to 90 percent, except the school, and ultimately the students and teachers, do not get credit for it. 30 percent to 60 percent is still considered failing, while 60 percent to 90 percent is considered acceptable educational achievement. This simple hypothetical example leads to the next aspect of a new testing model, namely, testing the students against themselves.

Aptitude and achievement

          Again, if the argument is to maximize the potential in individual students, then high-stakes testing is an appropriate method of evaluation. The testing, however, ought to be exhaustive enough to provide a clear picture of the student’s potential as well as the student’s achievement. Thus, coupling the time series analysis with an aptitude- achievement analysis would seem an appropriate measure of individual potential actualization. What do I mean by examining student potential?
          To return to my chemistry example, no matter how hard I studied chemistry, at some point I came to realize that I did not have the intellectual capacity to balance complicated chemical equations. So was my professor failing me because he couldn’t teach me these techniques? Likewise, if a fourth-grade student cannot balance mathematical equations at 65 percent proficiency, has the teacher failed? Or does the student simply lack the proficient aptitude for mathematics? One way to determine a “failing” grade on a proficiency test is to first determine the aptitude of the student, and then compare it to the student’s achievement. An aptitude test provides information on what the student is capable of learning, while an achievement test provides information on what the student is being taught. If there is a correlation between aptitude and achievement, then we can logically determine that the student is proficient. A student with a high aptitude score but a low achievement score, after controlling for external factors, signals a breakdown in the educational process. Conversely, a student with low aptitude scores and high achievement point to some degree of success in that the individual has been motivated to succeed beyond the natural capacity. Tracking the student’s progress over time, as proposed in the previous section, allows for concrete, traceable performance in individual students and the abilities of teachers and administrators, as well as providing a clearer picture for the student on a potential career path (and maybe avoiding many painful semesters of organic chemistry). Thus, we have multiple testing mechanisms to determine if, truly, the intellectual capacity and potential of students are being maximized, which is how I interpret the phrase, “no child left behind.”

 


Further Reading

Fernandez, Sergio. 2005. “Developing and Testing an Integrative Framework of Public Sector Leadership: Evidence from the Public Education Arena.” Journal of Public Administration Research and Theory, 15: 197-217.

“Four Pillars of No Child Left Behind” at http://www.ed.gov/nclb/overview/intro/4pillars.html.

Gonzalez-Juenke, Eric. 2005. Management tenure and network time: how experience affects bureaucratic dynamics. Journal of Public Administration Research and Theory, 15(1): 113-131.

Hicklin, Alisa. 2004. Network Stability: Opportunity or Obstacles? Public Organizational Review, 4:121-133.

Hill, Gregory C. 2005. Managerial succession and organizational performance. Journal of Public Administration Research and Theory, 15: 585-597.

 

 

“No Child Left Behind” at http://www.ed.gov/nclb/landing.jhtml.

O’Toole, Jr., Laurence J., and Kenneth J. Meier. 1999. Modeling the impact of public management: Implications of structural context. Journal of Public Administration Research and Theory, 9: 505-526.

2000. Networks, hierarchies, and public management: Modeling the nonlinearities. In Governance and performance: New perspectives, ed. Carolyn J. Heinrich and Laurence E. Lynn, Jr. Washington, D.C.: Georgetown University Press.

2002. Plus ca change: Public management, personnel stability, and organizational performance. Journal of Public Administration Research and Theory, 13: 43-64.