Which statistic is useful in describing sources of test score variability?

CTT has several weaknesses that have led to the development of other models for test scores. First, the concept of reliability is dependent on the group used to develop the test. If the group has a wide range of skill or abilities, then the reliability will be higher than if the group has a narrow range of skill or abilities. Thus reliability is not invariant with respect to the sample of test-takers, and is therefore not a characteristic of the test itself; in addition, neither are the common measures of item discrimination (such as the item-total correlation) or item difficulty (percent getting the item correct). As if this were not bad enough, the usual assumption that the standard error is the same for test-takers at all ability levels is usually incorrect. (In some extensions of CTT, this assumption is dropped, but these extensions are not well-known or widely used.) CTT also does not adequately account for observed test score distributions that have floor and ceiling effects, where a large proportion of test-takers score at either the low or high end of the test score range.

CTT has difficulties in handling some typical test development problems, horizontal and vertical equating. The problem of horizontal equating arises when one wishes to develop another test with the same properties as (or at least with a known relationship to) an existing test. For example, students who take college admissions tests such as the ACT or SAT should get the same score regardless of which version of the test they take. Vertical equating involves developing a series of tests that measure a wide range of ability or skill. For example, while we could develop completely independent tests of arithmetic for each elementary school grade, we might instead want to link them to have one continuous scale of mathematical skill for grades 1 through 6. While vertical and horizontal equating is not impossible within CTT, they are much more straightforward with item response theory (IRT).

These problems of CTT are partly due to some fuzziness in the theory (the population to be sampled is usually not considered in any detail in the theory). But they are also due to the failure of most data collection to be carried out on a random sample from any population, let alone a population considered appropriate for the test being investigated. In practice, convenience samples are used; these generally fit some criteria that the investigator specifies, but are otherwise taken as they are available to the researcher. Finally, CTT as originally conceptualized was never intended to address some of the practical testing problems described above.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0080430767007221

Alpha Reliability

Doris McGartland Rubio, in Encyclopedia of Social Measurement, 2005

Definitions

Classical test theory indicates that:

(1)X=T+E

where X is the observed score, T is the true score, and E is the error. Reliability is:

(2)r=1−E

Because error and reliability directly correspond to one another, the type of reliability that we assess for a measure depends on the type of error that we seek to evaluate. When the measurement error within a measure is of concern, we seek to ascertain how much variability in the scores can be attributed to true variability as opposed to error. Measurement error within a measure manifests as a result of content sampling and the heterogeneity of behavior sampled. Content sampling refers to the sampling of items that make up the measure. If the sampled items are from the same domain, measurement error within a measure will be lower. Heterogeneity of behavior can lead to an increase in measurement error when the items represent different domain of behaviors. Other sources of measurement error within a test can occur, including guessing, mistakes, and scoring errors.

Internal consistency indicates the extent to which the responses on the items within a measure are consistent. Coefficient alpha is the most widely used reliability measure of internal consistency. Others measures of internal consistency include split-half and Kuder-Richardson 20.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985003959

Reliability Assessment

Edward G. Carmines, James A. Woods, in Encyclopedia of Social Measurement, 2005

Reliability

Random error is present in any measure. Reliability focuses on the assessment of random error and estimating its consequences. Although it is always desirable to eliminate as much random error from the measurement process as possible, it is even more important to be able to detect the existence and impact of random error. Because random error is always present to at least a minimum extent, the basic formulation in classical test theory is that the observed score is equal to the true score that would be obtained if there were no measurement error plus a random error component, or X = t + e, where X is the observed score, t is the true score, and e is the random disturbance. The true score is an unobservable quantity that cannot be directly measured. Theoretically, it is the average that would be obtained if a particular phenomenon was measured an infinite number of times. The random error component, or random disturbance, indicates the differences between observations.

Classical test theory makes the following assumptions about measurement error:

Assumption 1: The expected random error is zero,

Ee=0.

Assumption 2: The correlation between the true score and random error is zero,

ρt,e=0.

Assumption 3: The correlation between the random error of one variable and the true score of another variable is zero,

ρe1,t2=0.

Assumption 4: The correlation between errors on distinct measurements is zero,

ρe1,e2=0.

From these assumptions, we see that the expected value of the observed score is equal to the expected value of the true score plus the expected value of the error:

EX=Et+Ee.

However, because, by assumption 1, the expected value of e is zero, E(e) = 0, then,

EX=Et.

This formula applies to repeated measurements of a single variable for a single person. However, reliability refers to the consistency of repeated measurements across persons and not within a single person. The equation for the observed score may be rewritten so that it applies to the variances of the single observed score, true score, and random error:

VarX=Vart+e=Vart+2Covt,e+Vare.

Assumption 2 stated that the correlation (and covariance) between the true score and random error is zero, so 2 Cov(t, e) = 0. Consequently,

VarX=Vart+Vare.

So the observed score variance equals the sum of the true score variance and the random error variance. Reliability can be expressed as the ratio of the true score variance to the observed score variance:

ρx=Var(t)Var(X).

That is, ρx is the reliability of X as a measure of t. Alternatively, reliability can be expressed in terms of the error variance as a proportion of the observed variance:

ρx=1−[Var(e)Var(X)].

This equation makes it clear that reliability varies between 0 and 1. If all observed variance consists of error, then reliability will be 0, because 1 − (1/1) = 0. At the other extreme, if there was no random error in the measurement of some phenomenon, then reliability will be 1, because 1 − (0/1) = 1.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985001985

Inter-Rater Reliability

Robert F. DeVellis, in Encyclopedia of Social Measurement, 2005

Classical Test Theory and Reliability

According to classical test theory, a score obtained in the process of measurement is influenced by two things: (1) the true score of the object, person, event, or other phenomenon being measured and (2) error (i.e., everything other than the true score of the phenomenon of interest). Reliability, in general, is a proportion corresponding to a ratio between two quantities. The first quantity (denominator) represents the sum total of all influences on the obtained score. The other quantity (numerator) represents the subportion of that total that can be ascribed to the phenomenon of interest, often called the true score. The reliability coefficient is the ratio of variability ascribable to the true score relative to the total variability of the obtained score. Inter-rater reliability is merely a special case of this more general definition. The distinguishing assumption is that the primary source of error is due to the observers, or raters as they are often called.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000955

Classical Test Theory

Wim J. van der Linden, in Encyclopedia of Social Measurement, 2005

Parameter Estimation

The statistical treatment of CTT is not well developed. One of the reasons for this is the fact that its model is not based on the assumption of parametric families for the distributions of Xjt and TJt in Eqs. (5) and (6). Direct application of standard likelihood or Bayesian theory to the estimation of classical item and test parameters is therefore less straightforward. Fortunately, nearly all classical parameters are defined in terms of first-order and second-order (product) moments of score distributions. Such moments are well estimated by their sample equivalents (with the usual correction for the variance estimator if we are interested in unbiased estimation). CTT item and test parameters are therefore often estimated using “plug-in estimators,” that is, with sample moments substituted for population moments in the definition of the parameter.

A famous plug-in estimator for the true score of a person is the one based on Kelley's regression line. Kelley showed that, under the classical model, the least-squares regression line for the true score on the observed score is equal to

(33)E(T|X=x)=ρXT2x+(1−ρXT2)μX.

An estimate of a true score is obtained if estimates of the reliability coefficient and the population mean are plugged into this expression. This estimator is interesting because it is based on a linear combination of the person's observed score and the population mean with weights based on the reliability coefficient. If ρXT2=1, the true-score estimate is equal to observed x; if ρXT2=0, it is equal to the population mean, μX. Precision-weighted estimators of this type are typical of Bayesian statistics. For this reason, Kelley's result has been hailed as the first Bayesian estimator known in the statistical literature.

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985004497

Generalizability Theory

Richard J. Shavelson, Noreen M. Webb, in Encyclopedia of Social Measurement, 2005

Generalizability Coefficient

The generalizability coefficient is analogous to classical test theory's reliability coefficient (the ratio of the universe-score variance to the expected observed-score variance; an intraclass correlation). For relative decisions and a p × I × O random-effects design, the generalizability coefficient is:

(7)Eρ2(XpIO,μp)=Eρ2=Ep(μp−μ)2EOEIEp(XpIO−μIO)2=σp2σp2+σδ2

Read moreNavigate Down

View chapterPurchase book

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985001936

Graded Response Model

Fumiko Samejima, in Encyclopedia of Social Measurement, 2005

Unique Maximum Condition for the Likelihood Function

The beauty of IRT lies in that, unlike classical test theory, θ can be estimated straightforwardly from the response pattern, without the intervention of the test score. Let Axg(θ) be:

(28)Axg(θ)≡∂∂θlogPxg(θ)=∑u≤xg∂∂θlogMu(θ)+∂∂θlog[1−M(xg+1)(θ)].

Because, from Eqs. (21) and (28), the likelihood equation can be written:

(29)∂∂θlogL(x|θ)=∂∂θlogPυ(θ)=∑xg∈υ∂∂θlogPxg(θ)=∑xg∈υAxg(θ)≡0

a straightforward computer program can be written for any model in such a way that Axg(θ) is selected and computed for each xg ∈ v for g = 1, 2,…, n and added, and the values of θ that makes the sum total of these n functions equal to zero are located as the MLE of θ for the individual whose performance is represented by v. Samejima called Axg(θ) the basic function, because this function provides the basis for computer programming for estimating θ from the response pattern v.

From Eqs. (28) and (29), it is obvious that a sufficient condition for the likelihood function to have a unique modal point for any response pattern is that the basic function Axg(θ) be strictly decreasing in θ, with a nonnegative upper asymptote and a nonpositive lower asymptote. It can be seen from Eqs. (22) and (28) that this unique maximum condition will be satisfied if the item response information function Ixg(θ) is positive for all θ, except at a finite or enumerably infinite number of points.

The normal ogive model, the logistic model, the logistic positive exponent model, the acceleration model, and models derived from Bock's nominal response model all satisfy the unique maximum condition. Notably, however, the three-parameter logistic model for dichotomous responses, which has been widely used for multiple-choice test data, does not satisfy the unique maximum condition, and multiple MLEs of θ may exist for some response patterns.

What are the sources of error variance?

The possible sources of error variance are: Test construction, Test Administration, Test Scoring & Interpretation, Other sources of error also exist. How can test construction influence error variance?

Which statistical tool is used only for calculating the internal consistency of a test in which items are dichotomous?

Cronbach's alpha is a measure used to assess the reliability, or internal consistency, of a set of scale or test items.

What is test

Test-retest reliability is a measure of reliability obtained by administering the same test twice over a period of time to a group of individuals. The scores from Time 1 and Time 2 can then be correlated in order to evaluate the test for stability over time.

Is a statistic used to estimate or infer how far an observed score deviates from a true score?

The standard error of measurement is the tool used to estimate or infer the extent to which an observed score deviates from a true score. We may define the standard error of measurement as the standard deviation of a theoretically normal distribution of test scores obtained by one person on equivalent tests.