Out-of-level testing refers to the practice of assessing a student with a test that is intended for students at a higher or lower grade level. Although the appropriateness of out-of-level testing for accountability pu...
详细信息
Out-of-level testing refers to the practice of assessing a student with a test that is intended for students at a higher or lower grade level. Although the appropriateness of out-of-level testing for accountability purposes has been questioned by educators and policymakers, incorporating out-of-level items in formative assessments for accurate feedback is recommended. This study made use of a commercial item bank with vertically scaled items across grades and simulated student responses in a computerized adaptive testing (CAT) environment. Results of the study suggested that administration of out-of-level items improved measurement accuracy and test efficiency for students who perform significantly above or below their grade-level peers. This study has direct implications with regards to the relevance, applicability, and benefits of using out-of-level items in CAT.
In this article, the effect of the upper and lower asymptotes in item response theory models on computerized adaptive testing is shown analytically. This is done by deriving the step size between adjacent latent trait...
详细信息
In this article, the effect of the upper and lower asymptotes in item response theory models on computerized adaptive testing is shown analytically. This is done by deriving the step size between adjacent latent trait estimates under the four-parameter logistic model (4PLM) and two models it subsumes, the usual three-parameter logistic model (3PLM) and the 3PLM with upper asymptote (3PLMU). The authors show analytically that the large effect of the discrimination parameter on the step size holds true for the 4PLM and the two models it subsumes under both the maximum information method and the b-matching method for item selection. Furthermore, the lower asymptote helps reduce the positive bias of ability estimates associated with early guessing, and the upper asymptote helps reduce the negative bias induced by early slipping. Relative step size between modeling versus not modeling the upper or lower asymptote under the maximum Fisher information method (MI) and the b-matching method is also derived. It is also shown analytically why the gain from early guessing is smaller than the loss from early slipping when the lower asymptote is modeled, and vice versa when the upper asymptote is modeled. The benefit to loss ratio is quantified under both the MI and the b-matching method. Implications of the analytical results are discussed.
This article introduces two new item selection methods, the modified posterior-weighted Kullback-Leibler index (MPWKL) and the generalized deterministic inputs, noisy and gate (G-DINA) model discrimination index (GDI)...
详细信息
This article introduces two new item selection methods, the modified posterior-weighted Kullback-Leibler index (MPWKL) and the generalized deterministic inputs, noisy and gate (G-DINA) model discrimination index (GDI), that can be used in cognitive diagnosis computerized adaptive testing. The efficiency of the new methods is compared with the posterior-weighted Kullback-Leibler (PWKL) item selection index using a simulation study in the context of the G-DINA model. The impact of item quality, generating models, and test termination rules on attribute classification accuracy or test length is also investigated. The results of the study show that the MPWKL and GDI perform very similarly, and have higher correct attribute classification rates or shorter mean test lengths compared with the PWKL. In addition, the GDI has the shortest implementation time among the three indices. The proportion of item usage with respect to the required attributes across the different conditions is also tracked and discussed.
The alignment between a test and the content domain it measures represents key evidence for the validation of test score inferences. Although procedures have been developed for evaluating the content alignment of line...
详细信息
The alignment between a test and the content domain it measures represents key evidence for the validation of test score inferences. Although procedures have been developed for evaluating the content alignment of linear tests, these procedures are not readily applicable to computerizedadaptive tests (CATs), which require large item pools and do not use fixed test forms. This article describes the decisions made in the development of CATs that influence and might threaten content alignment. It outlines a process for evaluating alignment that is sensitive to these threats and gives an empirical example of the process.
This article reviews the software package SimuMCAT that simulates unidimensional and multidimensional computerized adaptive testing with various types of items (dichotomous/polytomous) and loading structures (simple-/...
详细信息
This article reviews the software package SimuMCAT that simulates unidimensional and multidimensional computerized adaptive testing with various types of items (dichotomous/polytomous) and loading structures (simple-/complex-structured). In addition, the software allows users to choose from five different item selection procedures, two stopping rules for variable-length tests, as well as test constraints to satisfy test blueprint and limit item exposure.
This article discusses four-item selection rules to design efficient individualized tests for the random weights linear logistic test model (RWLLTM): minimum posterior-weighted D-error (D-B), minimum expected posterio...
详细信息
This article discusses four-item selection rules to design efficient individualized tests for the random weights linear logistic test model (RWLLTM): minimum posterior-weighted D-error (D-B), minimum expected posterior-weighted D-error (EDB), maximum expected Kullback-Leibler divergence between subsequent posteriors (KLP), and maximum mutual information (MUI). The RWLLTM decomposes test items into a set of subtasks or cognitive features and assumes individual-specific effects of the features on the difficulty of the items. The model extends and improves the well-known linear logistic test model in which feature effects are only estimated at the aggregate level. Simulations show that the efficiencies of the designs obtained with the different criteria appear to be equivalent. However, KLP and MUI are given preference over D-B and EDB due to their lesser complexity, which significantly reduces the computational burden.
In the digital world, any conceptual assessment framework faces two main challenges: (a) the complexity of knowledge, capacities and skills to be assessed;(b) the increasing usability of web-based assessments, which r...
详细信息
ISBN:
(纸本)9783319091501;9783319091495
In the digital world, any conceptual assessment framework faces two main challenges: (a) the complexity of knowledge, capacities and skills to be assessed;(b) the increasing usability of web-based assessments, which requires innovative approaches to the development, delivery and scoring of tests. Statistical methods play a central role in such framework. Item response models have been the most common statistical methods used to address such kind of measurement challenges, and they have been used in computer-based adaptive tests, which allow the item selection adaptively, from an item pool, according to the person ability during test administration. The test is tailored to each student. In this paper we conduct a simulation study based on the minimum error-variance criterion method varying the item exposure rate (0.1, 0.3, 0.5) and the test maximum length (18, 27, 36). The comparison is done by examining the absolute bias, the root mean square-error, and the correlation. Hypotheses tests are applied to compare the true and estimated distributions. The results suggest the considerable reduction of bias as the number of item administered increases, the occurrence of ceiling effect in very small size tests, the full agreement between true and empirical distributions for computerized tests of length smaller than the paper-and-pencil tests.
The given paper discusses an original method of the evaluation of outcomes of adaptivetesting in the case of the strategy of the multilevel testing. The multiplicity/set of the outcomes of testing consists of atypica...
详细信息
The given paper discusses an original method of the evaluation of outcomes of adaptivetesting in the case of the strategy of the multilevel testing. The multiplicity/set of the outcomes of testing consists of atypical different-dimensional elements. The given paper defines the criteria of their comparison, describes the principles of ordering of the given multiplicity and draws the getting of a final score. The paper presented considers the ordering method of outcome set for multi-stage testing (MST) of 1-3-3 model. The ordering method of outcome set is used for the estimation of results of computerized adaptive testing (CAT). This method is not tied to a specific testing procedure. Acknowledgment of this is its usage for the 1-3-3 model, which is described in the paper. To sort the set of testing outcomes the function-criteria described in the initial article are used here and a comparative analysis of obtained results is performed. The criterion of ordering of the outcomes of testing may not be the only option. The given paper illustrates this fact through a comparative discussion of two samples. An original procedure of testing is used for the presentation of the essence of the method. The given procedure is aimed to be illustrative, because a described method of assessment can be used for similar strategies. The ordered outcome set is estimated by a hundred-point system according to the normal distribution. Applied results of our scientific research is developed as "Adaptester" portal and available on the following address: https://***.
Several forced-choice (FC) computerizedadaptive tests (CATs) have emerged in the field of organizational psychology, all of them employing ideal-point items. However, despite most items developed historically follow ...
详细信息
Several forced-choice (FC) computerizedadaptive tests (CATs) have emerged in the field of organizational psychology, all of them employing ideal-point items. However, despite most items developed historically follow dominance response models, research on FC CAT using dominance items is limited. Existing research is heavily dominated by simulations and lacking in empirical deployment. This empirical study trialed a FC CAT with dominance items described by the Thurstonian Item Response Theory model with research participants. This study investigated important practical issues such as the implications of adaptive item selection and social desirability balancing criteria on score distributions, measurement accuracy and participant perceptions. Moreover, nonadaptive but optimal tests of similar design were trialed alongside the CATs to provide a baseline for comparison, helping to quantify the return on investment when converting an otherwise-optimized static assessment into an adaptive one. Although the benefit of adaptive item selection in improving measurement precision was confirmed, results also indicated that at shorter test lengths CAT had no notable advantage compared with optimal static tests. Taking a holistic view incorporating both psychometric and operational considerations, implications for the design and deployment of FC assessments in research and practice are discussed.
The article presents adaptivetesting strategies for polytomously scored technology-enhanced innovative items. We investigate item selection methods that match examinee's ability levels in location and explore way...
详细信息
The article presents adaptivetesting strategies for polytomously scored technology-enhanced innovative items. We investigate item selection methods that match examinee's ability levels in location and explore ways to leverage test-taking speeds during item selection. Existing approaches to selecting polytomous items are mostly based on information measures and tend to experience an item pool usage problem. In this study, we introduce location indices for polytomous items and show that location-matched item selection significantly improves the usage problem and achieves more diverse item sampling. We also contemplate matching items' time intensities so that testing times can be regulated across the examinees. Numerical experiment from Monte Carlo simulation suggests that location-matched item selection achieves significantly better and more balanced item pool usage. Leveraging working speed in item selection distinctly reduced the average testing times as well as variation across the examinees. Both the procedures incurred marginal measurement cost (e.g., precision and efficiency) and yet showed significant improvement in the administrative outcomes. The experiment in two test settings also suggested that the procedures can lead to different administrative gains depending on the test design.
暂无评论