Back to Results

< Previous   Next >

The detection of differential item functioning (DIF) is an essential step in increasing the validity of a test for all groups. The item response theory (IRT) model comparison approach has been shown to be the most flexible and powerful method for DIF detection; however, it is computationally-intensive, requiring many model-refittings. The Wald test, originally employed by Lord for DIF detection, is asymptotically equivalent to this approach and requires only one model fitting. In this research, the Wald test for DIF detection was improved from Lord's original conception through modern error estimation, concurrent calibration, maximum marginal likelihood item parameter estimation, conditional DIF tests, and extensions to commonly used IRT models as well as multiple groups. This research examined the Type I error and power of the Wald test by varying the magnitude of DIF, the mean difference between groups, test length, and the sample size per group. Data were simulated under the graded response model and the three-parameter logistic (3PL) model. An additional simulation study compared the IRT model comparison approach to the Wald test under the two-parameter logistic model. The results indicated that the Wald test performs well detecting DIF. The performance improves with larger sample sizes, greater magnitudes of DIF, greater test lengths, and the random assignment estimation procedure. The use of larger sample sizes and greater test lengths is most critical for situations employing the 3PL model. The Wald test also performs well compared to the IRT model comparison approach, although the results of the two methods should converge asymptotically. This research also demonstrated the flexibility of the Wald test through its straightforward extension to multiple groups. An example was used to demonstrate the effectiveness of the Wald test and compare it to the IRT model comparison approach. The Wald test was able to accurately identify the source of DIF. However, the IRT model comparison approach appeared more powerful but confounded the results of the DIF tests, due to combining groups. Several considerations for designing a DIF detection framework given multiple groups were outlined, particularly the superiority of the Wald test when given unequal sample sizes.