2012 Presentations

Click on link to view abstracts, click pdf to view full paper


Examination of unusual similarity by examinee pair characteristics
    Jeff Allen - ACT

Abstract: Detection of unusually high item response similarity usually conditions on the pair’s test performance (e.g., raw scores), but often relies on assumptions of invariance of other examinee pair characteristics. In this study, we will examine the appropriateness of such assumptions for selected examinee pair characteristics. We will examine whether the incidence of unusually high item responses varies according to whether the members of examinee pairs:

  • Attend the same high school
  • Took the same courses in high school
  • Are from the same racial/ethnic group
  • Are the same gender
  • Are siblings

The primary measure of item response similarity will be the number of identical incorrect item responses. For any given pair of examinees, degree of item response similarity will be expressed as p-values indicating the overall proportion of examinee pairs in benchmark samples that exceeded the given number of identical incorrect responses, conditioned on the examinee pairs’ set of raw scores (number of items answered correctly). The benchmark samples will consist of same high school / different test center examinee pairs who took the ACT in previous years (2004-2005 through 2010-2011).

Two criteria for unusually high item response similarity will be considered: p-values of .01 (one in one hundred) and smaller, and p-values of .001 (one in one thousand) and smaller. Descriptive statistics and logistic regression analyses will be used to relate examinee pair characteristics to the two criteria. Measures of central tendency (e.g., mean identical incorrect responses) and variation will also be examined across examinee pair characteristics.

The study will be based on the ACT tests of educational achievement and will be conducted separately for four subject area tests (English, mathematics, reading, and science). High school coursework, race/ethnicity, and gender will be based on students’ self-reported data. Implications for practices for detecting unusually high item response similarity will be discussed.


Macro level systems of statistical evidence indicative of cheating
    Michael Chajewski, Young-Koung Kim, Judit Antal, and Kevin Sweeney - College Board

As modes of test administration change and methods for preventing and identifying test fraud improve, new methods of committing test fraud are being invented. Cheating in the form of over-the-shoulder answer copying has become less a factor given the broader spectrum of undesirable testing behaviors. Technological improvements (small cameras, the internet, etc.) have increased the possibility of pre-knowledge across time zones. Improving screening methodology to augment existing procedures is a necessity to maintain integrity of test scores. Traditional investigations are often based on one or a few lines of investigation, derived from single theoretical bases, with the results viewed separately. Investigating possible fraud may benefit from the introduction of additional descriptive estimates of student test taking behaviors. Contemporarily, reliability and validity are conceptualized as systems of evidence, not individual indices. Test security screens should follow suit in constructing systems of evidence indicative of cheating.

The research reported in this presentation concerns a new screening methodology that allows for investigations of macro-level units such as classrooms, schools, testing centers, etc. Cohort and administration level screenings pose significant operational limitations due to sheer volume of information. The screening system in this research was developed to manage this practical consideration. The new method combines existing response string screening indices into macro-level profile indicators to estimate baseline rates of unusual response similarities. The method integrates Classical Test and Modern Test Theory approaches into a system of macro-unit screening, which will have utility beyond traditional answer copying analyses to various forms of collaborative effort, form/item pre-knowledge, and/or answer key disclosure. As new test forms are administered domestically or internationally, the macro-unit profiles could be compared against established baselines to identify abnormally high agreement within units. The new methodology would not be intended to replace test security investigations of single students or student pairs.


Nested factor analytic model comparison as a means to detect aberrant response patterns
     Mike Clark - Pearson

As modes of test administration change and methods for preventing and identifying test fraud improve, new methods of committing test fraud are being invented. Cheating in the form of over-the-shoulder answer copying has become less a factor given the broader spectrum of undesirable testing behaviors. Technological improvements (small cameras, the internet, etc.) have increased the possibility of pre-knowledge across time zones. Improving screening methodology to augment existing procedures is a necessity to maintain integrity of test scores. Traditional investigations are often based on one or a few lines of investigation, derived from single theoretical bases, with the results viewed separately. Investigating possible fraud may benefit from the introduction of additional descriptive estimates of student test taking behaviors. Contemporarily, reliability and validity are conceptualized as systems of evidence, not individual indices. Test security screens should follow suit in constructing systems of evidence indicative of cheating.

The person-fit methodology used in this study utilizes a novel application of thelco and M-lco person-fit statistics (Ferrando, 2007, 2009). In this study, test-takers’ responses will be fit to one- and two-factor FA models using Mplus, with the expectation that the two-factor model will have significantly better person-fit for individuals who had prior access to compromised items. Factor loadings from the two-factor model may be beneficial as indicators of possibly compromised items, as only these items should have strong loadings onto the secondary factor. Results of a moderately-sized simulation study will be discussed, with hit rates and Type I and II error rates compared for the presently proposed lco difference method and the more traditional lz person-fit statistic (Drasgow, Levine, and Williams, 1985).


Using multiple data forensic methods to detect potential testing improprieties
     Karla Egan and Jessalyn Smith - CTB

In 2001, Qualls wrote, “The days of ignoring allegations of cheating on standardized achievement tests are in the past,” (p. 9). Ten years later, this statement seems overly optimistic. Even though many states have mechanisms for reporting security breaches, the impetus for reporting incidents of cheating often relies on the districts and schools themselves. Since a report of cheating may invalidate test scores and cause a school to miss AYP, it is often not in a school or district’s self-interest to report themselves to the state department of education. In states that rely on self-reporting, it is not unreasonable to think that breaches of test security are underreported. State education agencies should not rely on self reports; instead, they should look for proactive ways to prevent breaches of test security and look to data forensics to help detect possible test-security breaches.

In this paper, we will explore three years of data from a large-scale testing program using multiple methods to detect potential testing improprieties, including analyses of erasures and patterns of item responses. Additionally, we will utilize the linear regression method that has been favored by reporters in their investigations of cheating (e.g., Gillum, USA Today, 3/6/2011; Walsh-Sarnecki & Pratt, Detroit Free Press, 3/6/2011). We will examine demographic differences in the schools detected by the three methods as well as patterns in security anomalies over time (e.g., have they increased as schools were in danger of missing AYP?) We will conclude the paper with recommendations that states should take to improve their test security practices.


Patterns of erasure behavior for a large-scale assessment
     Chi-Yu Huang, Deborah Harris, Andrew Mroch, and Yang Lu - ACT

Erasure analysis of multiple-choice answer sheets has become increasingly used to screen for test irregularities, including proctors who change students' answers after testing. Erasure behavior on each item is recorded: no erasure, changing answer from wrong to right, changing answer from wrong to wrong, or changing answer from right to wrong. Erasure analysis used in statewide assessments has typically focused on group–level analyses (e.g., classrooms). When a proctor/teacher changes a group of students’ answer sheets, the erasure behavior is likely to be consistent across the group. This type of pattern is one way that a group of students’ responses could be flagged based on erasure analyses.

The purpose of this study is to examine the role of erasure analysis used in a large-scale assessment where the erasure behaviors are more likely to occur on an individual examinee level. For this assessment, examinees testing in the same room may not have a relationship among one another and proctors usually do not know examinees and have no particular motivations to change examinees’ answers. Therefore, it is likely that erasures will occur when individual examinees change answers based on (a) their own knowledge or reconsideration of items (typical erasures) or (b) based on copying answers from others (aberrant). This study will investigate the following questions:

  • Do erasure behaviors differ between high vs. low ability examinees?
  • Does a relationship exist between erasure behaviors and item difficulty/item characteristics?
  • Do erasure behaviors differ across test forms (in the same subject)?
  • Do erasure behaviors differ across tested subject areas?
  • What flagging index and flagging criteria are appropriate when erasure analysis is used on individual level?

The analyses conducted in this study will provide practical baseline information about likely characteristics of individual erasure patterns that could be used to form the basis for using erasure behavior to flag individual examinees for aberrant erasure behavior.


Cheating on statewide assessments: Briefing on GAO work and audience participation
     Jamila Kennedy - US Government Accountability Office

Statewide assessments are used to measure student achievement and serve as the basis for school accountability systems. In 2009, the U.S. Government Accountability Office (GAO) reported that the U.S. Department of Education had not incorporated best practices in assessment security into its peer review protocols, and that certain states rely on inadequate security procedures that may negatively impact the reliability of their assessment systems.[1] Since that report was released, numerous reports have surfaced about incidences of cheating in school districts across the country. The Chairman, Senate Subcommittee on Oversight of Government Management, the Federal Workforce, and the District of Columbia, has requested that GAO evaluate (1) the extent to which states have implemented leading practices to prevent and detect cheating on statewide assessments; (2) how selected school districts implement or enhance state policies and procedures to prevent and detect cheating; and (3) to what extent the Department of Education provides oversight and assistance to states to help prevent and detect cheating.

[1] GAO, No Child Left Behind Act: Enhancements in the Department of Education’s Review Process Could Improve State Academic Assessments, GAO-09-911 (Washington, D.C.: September 2009).


Test Security for Multistage Tests: A Quality Control Perspective
     Charles Lewis, Yi-Hsuan Lee and Alina A. von Davier - Educational Testing Service

A multistage test (MST) is a computer-based assessment that may be thought of as a compromise between a linear test and a computer-adaptive test (CAT). As such, MSTs may be vulnerable to at least some of the major security threats associated with each of these types of test (e.g., copying for linear tests and item pre-knowledge for CATs). The degree of vulnerability of any particular MST to these threats, as well as others, will depend (among other things) on details of the MST assembly and administration design. To supplement these preventative measures, routine statistical monitoring of response and timing data for items, modules and tests, as well as the screening of performance of individual test takers and clusters of test takers, is essential. We strongly believe that test security procedures are properly understood as an aspect of quality control for a testing program, and that the goal of these procedures should be to report only valid test scores, while treating all test takers fairly.

For monitoring and screening procedures to be effective, it is critical that statistical baselines be developed for the different characteristics of the data that are being monitored. Since an MST has a different structure than either a linear test or a CAT, it is reasonable to expect that test takers’ behavior for MSTs will differ systematically from behavior observed for either linear tests or for CATs. While model-based screening procedures can be very useful in a test security context, we must be careful to avoid over-reliance on strong assumptions when deciding whether to question test takers’ scores or remove items from a pool. What we are looking for are data patterns that deviate substantially from our baselines in ways that are consistent with what would be expected from behavior that may lead to invalid scores.


Methods for detecting potential group-based security threats
     Dennis Maynes- Caveon Test Security

This paper reviews methods that have been used to detect group-based test security threats. Most of the methods have been devised and developed by practitioners and service providers to testing programs. As a result, research concerning Type I, Type II and overall error rates is lacking. The paper identifies studies where the methods have been used. The methods can be categorized into the following classes:

  • Erasure analysis,
  • Score change analysis,
  • Person-fit analysis,
  • Similarity analysis,
  • Response time analysis,
  • And other measures,

The paper briefly lists and discusses known types of group-based test security threats. In this discussion, statistical analyses that have been used to detect the presumed threats will be listed. General methods based on means and counts are presented, along with the advantages of each. Some specific methods (e.g., goodness of fit tests) will also be mentioned. In this section, the importance of protecting against alpha-inflation in the data mining context will be stressed. This small segment includes a discussion of the maximum order statistic and the Bonferroni correction.

Each detection class is presented. The presentation for each one includes references to studies that have been conducted. Because research concerning the effectiveness of the methods is lacking, the presentation provides author observations concerning appropriateness of assumptions, including strengths and weaknesses of the methods.

The main purpose of this presentation is a “call to validate” the methods that are currently in use. The presentation also intends to provide an overarching framework or, minimally, a suggested taxonomy for researching methods that are designed to detect group-based test fraud.

  • No Paper Available


Score differencing methods to detect potential test fraud
     Dennis Maynes- Caveon Test Security

Person-fit analysis has been extensively studied as a means of detecting cheating on tests (see Karabatsos, 2003). In the security context, person-fit analysis has been used to detect inconsistent performance which may be related to pre-knowledge of the exam questions. When the test items may be partitioned into items which may have been disclosed and items which may be presumed to be secure (i.e., because of low exposure rates), performance between the two items subsets may be compared probabilistically. Applying knowledge of disclosed and secure items to the person-fit problem allows the analyst to make stronger inferences with respect to exam security than using general person-fit statistics.

This paper discusses score differencing using Item Response Theory (IRT) models and using the beta-binomial model (i.e., when all items are equally difficult for an examinee). Distribution theory for the difference scores using IRT is derived. Some applications of the method are discussed.

After presenting the general theory, the paper applies the theory to the case when the items may be assumed to be equally difficult. It is demonstrated that the assumption of equally difficult items lends itself to the design of an embedded verification test (EVT). The embedded verification test is a subset of items that has been created to answer questions concerning the veracity of the exam score, when it is known that the exam content is highly susceptible to disclosure (e.g.,through braindump websites).

The EVT design methodology is presented, including a method for designing the test so as to control error rates. Results from a practical experiment conducted by an actual certification program of the EVT design methodology are presented. After discussing the results of the experiment, considerations concerning policy and procedure for the use of the EVT to strengthen security are briefly summarized.

  • No Paper Available


Use of response similarity analysis to detect test fraud
     Dennis Maynes- Caveon Test Security

Response similarity analysis is arguably the most important test security analysis that can be conducted in order to identify and mitigate test security risks, because (1) the analysis provides very strong results, and (2) much test fraud is committed by collusion (i.e., two or more individuals working together to breach exam security) or answer-copying.

The paper presents an argument which favors using a response similarity statistic as opposed to other answer-copying statistics. As a part of this argument, clustering to detect groups of individuals that have possibly compromised the exam security is described. After establishing why response similarity statistics are important, the paper derives a bivariate response similarity statistic which uses evidence from identical incorrect answers and identical correct answers. This statistic is shown to be more powerful (using theoretical arguments) than similar univariate statistics.

The paper presents methods for interpreting the results from a similarity analysis, when pairs and groups of non-independently taken tests are detected. The null distribution for detecting non-independent test taking of the bivariate statistic is studied through simulation. Simulation results are also presented concerning the null distribution of detecting group-based non-independent test taking.

Extensions to the similarity analysis for individuals and for groups, including source-copier attribution analysis, pass-rate change analysis, and score-change analysis are briefly mentioned. Conditions that violate the assumption of independent testing, besides cheating, are also discussed.

While a power study is being planned, the paper does not provide research concerning power of the statistic. The simulations in the paper were performed to confirm that the statistic was behaving appropriately when the null hypothesis of independent test taking is true.

  • No Paper Available


Law school admission council statistical fraud detection procedures
     Peter Pashley and Jennifer Lawlor - Law School Admissions Council

The Law School Admission Council (LSAC) is best known for administering the Law School Admission Test (LSAT). The LSAT is a high-stakes, large-volume assessment of verbal and analytical reasoning skills and is relied on for the purposes of admission to law schools around the world. Currently, the LSAT uses a paper-and-pencil format and is administered in large testing centers four times a year. While sections within a form of the test are presented in varying orders for security purposes, the potential for a test taker to copy from a neighboring test taker still exists. Collusion and impersonation are also concerns. Over the years, LSAC has developed an array of security measures, including a search for response inconsistencies to identify potential cases of fraud.

While the variety of detection tools routinely applied to the LSAT will be outlined, greater focus will be given to a recently developed technology to detect aberrant responding behavior. This new methodology compares test-taker performance for prespecified portions of the test and flags test takers for whom the differences are statistically significant. The application of this method is flexible in that partitions of the test may be user defined. Meaningful comparisons may include exposed versus unexposed items, scored versus unscored (i.e., pretest) items, and compromised versus uncompromised items. In paper-and-pencil testing, unusual agreement of incorrect answers for neighboring test takers may also be investigated. While LSAC’s application focuses on paper-and-pencil testing, extensions to computerized adaptive tests also exist. These procedures will be outlined in this presentation and examples of actual cases given.

A compare and contrast analysis of multiple methods
     Christie Plackner - Data Recognition Corporation

While test cheating behavior has been acknowledged as an issue in high stakes testing for some time, recent news coverage has brought it to the forefront of the minds of assessment professionals across the nation. A survey of states by USA Today (Sept 12, 2011) found that twenty states (and Washington D.C.) conducted erasure analyses, but that few did anything meaningful with the results. We hypothesize that this is not necessarily due to negligence, but more due to a lack of understanding. For example, although it is true that erasure analysis is a valuable tool in detecting teacher or administrator impropriety in terms of physical evidence, it does not accurately predict inappropriate behavior as a single indicator.

It is naïve for one to assume that the only way a teacher or administrator will influence a student’s test answer is to erase a wrong answer and replace it with a correct one. It has been found, for example, that student answers may also be influenced by having answers written on the board, making copies of the test ahead of time and teaching to the questions, having lower achieving students sit next to higher achieving ones, or by receiving instructions to not fill in the answer sheet until confirming the answer with their teacher.

This study will investigate the interrelationships among multiple methods using correlations and factor analytic analysis whose methods, in addition to erasure analysis, include:

  • Scale score/gains losses in cohort and non-cohort,
  • Performance level gains/losses in cohort and non-cohort,
  • Measurement model misfit (e.g., Rasch residuals),
  • Pattern analysis (e.g., Jacob and Levitt, 2003), and
  • Across and within subject predictions (cohort and non-cohort),


Data forensics: Erasure behavior and AYP consequences
     Vince Primoli- Data Recognition Corporation

This paper asks the question, do the penalties facing schools and teachers affect the likelihood of irregularities. The relationship between erasure behavior and Adequate Yearly Progress (AYP) status of schools is examined. As consequences can dictate actions, it is presumed that aberrant erasure rates stemming from acts of impropriety can be dictated by the formalized system of AYP accountability. The AYP statuses of schools are bucketed into categories:

  • Made AYP (Made) – met all AYP criteria for all relevant subjects.
  • Level 1 (L1) – did not meet AYP criteria for first time in current year.
  • Level 2 (L2) – did not meet AYP criteria for at least two years.
  • Level 3 (L3) – did not meet AYP criteria for at least four years.
  • Making Progress (MP) –met AYP criteria for first year of two-year probationary period and must repeat next year to be considered “back on track.”

These categories constitute a progression when examined across years; therefore directional relationships are considered. For instance, two-thirds of schools met AYP standards in both of any two subsequent years in this study; therefore, one would expect this norm group to have lower incidence of aberrant erasure rates due to acts of impropriety. Schools showing directional AYP gains are compared to this norm group. Particularly, MP-Made (schools moved from Making Progress in one year to Made AYP the next) and L3-MP (schools moved from Level 3 in one year to Making Progress the next) are found in starkest contrast to the majority that Made AYP in both years.

Summary statistics and graphics illustrate relationships and similar evidence is provided on selected school-cases. The practical implementation of data forensics ultimately leads to examination of individual school-cases. This paper provides an example of what that looks like from an erasure analysis perspective.


Local outlier detection in data forensic: data mining approach to flag unusual schools
     Mayuko Simon - Data Recognition Corporation

Statistical methods used in data forensics are adept at finding global outliers, observations that are extreme relative to the study population. Some schools, however, may not display sufficient extremity to make them outliers in comparison to all schools. Observations that are extreme with respect to their peers but not necessarily to the entire data set are local outliers. Traditional data forensic techniques lack the ability to detect local outliers; hence, we turn to data mining. Data mining offers other benefits, most prominently, an inherently multivariate and non-parametric modeling approach.

Data used was from a large-scale standardized state assessment program. A method with origins in data mining was applied to identify schools with outlying scores relative to their peer schools. An iterative algorithm, with multiple regression as its centerpiece, identified the importance of variables proportional to the corresponding regression coefficient as computed within the peer group. The variables used to predict current performance were previous scores and standard deviations, erasure rates, and student counts. Once the weights were computed, the score of each school was assessed relative to peer schools. Bootstrapping drew the empirical score distribution within the group of peer schools and calculated the empirical p-value for the observed score.

Upon examining the data for the flagged schools, four groups of schools emerged. Two groups exhibited apparently suspect behavior. A third group contained high-achieving schools. The fourth group of schools—although extreme in comparison to peer schools—did not show suspicious behavior.

A strength of this method is increased sensitivity. While schools were incorrectly flagged for exceptional achievement, the majority of flagged schools exhibited suspicious behavior. Some schools were flagged by different statistical methods, as well, but it also correctly flagged some schools that were not detected by competing methods. Manual inspection of these schools’ data revealed suspicious behavior.


A hierarchical linear modeling approach for detecting cheating and aberrance
     William Skorupski - University of Kansas,  Karla Egan - CTB

The purpose of this study is to demonstrate, through Monte Carlo simulation, the utility of a newly presented method for detecting group-level cheating and aberrance (Skorupski & Egan, 2011). The method relies on vertically scaled test scores over grades levels. Using these data, the change in individual scores nested within groups (classrooms or schools) over time may be modeled. The approach is based on a hierarchical linear model (HLM), and evaluates unusually large group-by-time interaction effects as evidence of potential cheating or aberrance. The authors have previously demonstrated this method using real data from a large, statewide, testing program. Some external evidence of suspected cheating was available and used to cross-validate those schools flagged as potential cheaters. This approach provided some encouraging success, but because real data were used, the accuracy of the method couldn’t be demonstrated. The purpose of the current study is to directly evaluate how well the method can identify simulated groups which are known a priori to demonstrate such aberrance (which could be evidence of cheating or something else unusual). This evaluation will be conducted by considering marginal recovery of the known parameter values from the HLM, as well as a determination of power and Type I error rates for identifying aberrant versus non-aberrant groups.


A parametric approach to detect a disproportionate number of identical item responses on a test
     Leonardo S. Sotaridona, Arianot Wibowo, and Irene Hendrawan - Measurement Incorporated

Group-level cheating analysis is particularly useful when testing personnel are interested in monitoring possible occurrences of cheating in schools, districts, counties, or the entire state. Unlike the previous method, Sotaridona & Choi (2007), the method presented here uses polytomous IRT model to estimate the item response probability P i k j . Let i=1,2,…,N denote multiple-choice items with options k=1,2,…Ki, the number of match item responses by an examinee pair ( j, j') , j'j. The expected probability that ( j, j') will match on their response to i is . The standardized value is asymptotically N(0,1), and are expectation and variance of Mjj respectively. Let (µu, ó2) denotes the mean and vairance of Zjj' within class u. When there is copying in a class, µu would deviate from its expectation µ, e.g., larger than 0. Hence, the null hypothesis H0: µu - µ = 0 is tested against H1: µu> 0. A class is flagged for cheating if Tu z *, where for a level of significance α, Pr(Tuz* ) = α

An adjusted test statistic replaced (µ,ó2 ) with the sample estimate (û,ô2 ), and evaluated from Tu. The goals of this paper are to replicate the study by Sotaridona & Choi using different real datasets and significantly expanded simulations studies for better generalizability, and to compare the statistical properties of the two methods. Also, the detection rates were investigated under varying conditions such as class size, number and type of items copied, and ability level. The present study was able to replicate the results obtained by Sotaridona & Choi. The error rates of the new method appeared slightly inflated, so it is recommended to set the significance level lower than the actual target value in order to compensate for the liberal nature of the test. With at least 20% copying, the new method showed promising detection rates and are consistently more powerful than the nonparametric method after adjusting the error rates. Furthermore, the detection rates increase with class size, item difficulty, and number of items copied. Detailed study design, analysis of the results, and guidance on selecting reasonable significance level are discussed.


Answer changing behavior on computer-based tests: Implications for detecting test fraud
     Gail C. Tiemann and Neal M. Kingston - University of Kansas

In high-stakes educational achievement testing, data forensics has risen as a key approach for supporting the integrity and interpretability of student test scores. Answer-changing, one of a family of test integrity analyses, has been studied in the context of paper-pencil tests for decades. Since 1929, researchers have examined answer-changing from a variety of perspectives and contexts, documenting both general frequency as well as relationships with a variety of student and item-level variables (Mathews, 1929; Benjamin et. al., 1984; Al-Hamly, 2005; van Der Linden et al., 2011). More recently, answer-changing analysis has been used as a mechanism for flagging potential cheating behavior (Qualls, 2001; DRC 2011). Indeed the approach, also called erasure analysis, was one of several methods used by investigators to uncover recent and widespread cheating in Atlanta public schools (Severson, 2011).

While student answer-changing behavior has previously been described from the context of paper-pencil tests, this study will explore answer-changing on computer-based tests. A sample of computer-based summative achievement tests in mathematics and reading for grades 3, 8 and 11 from one state testing program will be used to examine test, school and student variables (item difficulties and discriminations, student ability and demographic characteristics, response time) and their relationships to answer changing frequencies (raw frequencies and wrong-to-right proportions). Group differences in answer-changing behaviors (subject, grade-level, district) will also be reported.

As state testing programs are increasingly called on to ensure the accuracy and integrity of student test data (Duncan, 2011), results from this study will advance our understanding of answer changing in the new era of computer-based achievement tests, as well as contribute to decades-long scholarship related to student answer-changing behaviors.


How to detect cheating badly
     Howard Wainer- National Board of Medical Examiners

With the 2001 passage of No Child Left Behind and the 2009 follow-on, Race to the Top, there has been increasing pressure on schools and teachers to show gains in students’ test scores. This pressure was not accompanied by improved family background, increased community support for education or vastly increased financial resources. Nonetheless there have been a substantial number of instances in which there have been incredible gains. In a fair number of these cases the improvements were so remarkable that cheating was suspected. Investigations of such allegations were usually carried out by the school district, or more often, the state’s department of education. Looking at a number of these reveals the lack of experience these institutions have had in such investigations; mistakes of design, analysis and inference were made that render their conclusions too equivocal for serious purposes.

In this talk I will discuss a mixture of such cases that have occurred over the last two years


Identifying non-effortful student behavior on adaptive tests: Implications for test fraud detection
     Steven L. Wise and Lingling Ma - Northwest Evaluation Association

Whenever we administer a test to measure a student’s level of proficiency in some educational domain, our primary goal is to obtain a valid score. Because of the pervasive role that test scores play in a variety of educational inferences about students (e.g., academic progress, graduation, entrance into college), it is important that we have methods for evaluating the trustworthiness of such information. Individual score validity (ISV) refers to the degree to which a test score is a reasonable indicator of an examinee’s proficiency on a construct of interest.

Non-effortful test-taking is a major contributor to low ISV on educational assessments. Without adequate effort, the resulting test score usually underestimates what a student actually knows and can do. In such cases, the student’s test score will not provide trustworthy information to educators.

A CAT has advantages over paper-and-pencil tests and non-adaptive computer-based tests (CBTs) in both the identification and management of non-effortful behavior. Consequently, we have developed reliable and valid methods for identifying non-effortful test-taking behavior on NWEA’s MAP assessment. We will discuss these methods, and demonstrate their use with data from the MAP assessment.

While the detection of non-effortful behavior has clear value for identifying low-ISV scores, it may additionally have utility in detecting a type of potential test fraud. The U.S. Department of Education has recently encouraged states to use measures of student growth as part of an evaluation of teacher effectiveness. If growth is considered as the change in student proficiency from fall to spring of an academic year, there will be an incentive for teachers to try to “game” the system by encouraging (either subtly or overtly) non-effortful behavior during fall testing, thereby depressing their students’ fall test performance, and overstating the growth gains that occur. Effort analysis methods could be used to detect instances in which teachers have engaged in this type of deceptive behavior.

One of 34 U.S. public institutions in the prestigious Association of American Universities
44 nationally ranked graduate programs.
—U.S. News & World Report
Top 50 nationwide for size of library collection.
5th nationwide for service to veterans —"Best for Vets: Colleges," Military Times
Connect with us online

KU Today