549-Chapter 4

Different types of reliability

Three general types of estimates:(1) Test-retest(2) Parallel forms (3) Split-half Internal consistency (Cronbach alpha)

What is the classical test theory. What are the underlying assumptions?

�True score theory� based on following principles:(1) Person�s score (a.k.a. �observed� or �obtained� score) comprised of two components � a true score and error. Thus, X = T+ e where: X = observed score T = True score e= error in measurement(2) A true score exists for every measurable attribute of every person-hypothetical score based on an infinite number of administrations of test - reflects score if there was no meas. error(3) Obtained test scores contain error - Error in measurement is typically indicated by the Standard Errorof Measurement(SEM)-provides estimate of how much individual�s score would be expected to change onre-testing with same/equivalent form of test ***SEM � can be used to estimate a band or interval within which person�s true score would be expected to fall *Importance of SEM � alerts evaluator to fact that scores are not exact, should be considered estimate of level i.e., can appear higher or lower than true score-Note � inverse relationship between value of SEM and reliability coefficientAssumptions- *individual�s true score is uniform across repeated administrations of same test. **i.e.,score is fixed, can be �set in stone� in reality � is score set in stone?*errors are obtained randomly and are normally distributed

What types of info does the reliability coefficient yield? How do we interpret the correlation coefficient in terms of test scores and true variability accounted for.

Range from 0.0 to 1.0 *0.0 = complete lack of reliability (all error)/1.0 = perfect reliability/coefficient is <1.0 = extent to which meas. error is presentcan be interpreted as a percentage*.90 = 10% of variation in scores attributable to meas. errorGeneral reliability guidelines:.90 and above = test highly reliable.70 - .89 = moderate reliability<.70 = low reliability (below.6 = unacceptable)

What are other sources of error that can affect reliability?

(1) questionable measurement precision(2) item sampling-longer test = increased reliability (3) construction of test items *should be objective * testtakers should have to do little to interpret questions, *way items are worded(4) Test administration- *environmental factors **examiner can influence test taker **fluctuations in temp of room/ mood of test taker(5) Scoring of the test *Objectivity - extent to which scores are free of evaluator�s bias - objective scores reflect true individual differences, not judgment / opinion of evaluator -essay tests typically less reliable than multiple choice �introduce more subjectivity in scoring(6) Difficulty of the test- tests that are too easy or too difficult = lower reliability *e.g.,range restriction - reliability higher when scores spread out over entire scale � test shows real differences i.e., need variability of scores(7) Factors related to test-taker- e.g., fatigue, illness, anxiety, inattention, hyperactivity, behavioral outbursts

Define Test-Retest reliability

correlation between scores on a particular measure at two points in time *i.e., stability of examinees� scores between testing and re-testing when same questions and apparatus are used

Appropriate applications of test-retest:

*Evaluation of temporal stability trait-like, dispositional characteristicse.g., stability of perfectionism,intelligence*Directly measured characteristics e.g., baseline levels of hyperactivity w/o intervention

Test Retest not typically useful for:

*Measures consisting of limited, fixed sample of items- test-retest tends to overestimate reliability*Many psychological variables that vary depending on day of administration ne.g., aggression, motivation, depression, anxiety state-like characteristics

Cautions of Test Retest

(1) Carryover effects(2) Practice effects - Specific type of carryover effect - improvement in score can result simply by being exposed to testfirst time- scores on second admin. tend to be higher than on first admin.(3) Time interval between test administrations is crucial = select and evaluate carefully - in general, shorter the interval between administrations, greater the likelihood of carryover effects **Well-evaluated tests will report test-retest estimates at different time intervals

Parallel forms reliability

*a.k.a. �equivalent forms� Typically preferable to test-retest*Defined: Two forms of same test developed - both forms measure common domain - should be = difficulty same rules applied for item selection *reliability coefficient = correlation between scores on both forms*Both forms must be administered to same set of examinees (also true for test-retest)*Preferable to administer tests on same day*very rigorous assessment of reliability *Not commonly used in practice*highly challenging to develop one form of test, let alone 2*can be impractical to administer same examinees both forms of measure*Pearson r used to estimate reliability for parallel forms (and test-retest)*Due to difficulties assoc. with creating two forms, test developers tend to base reliability estimates on one form****i.e., evaluate �internal consistency� by dividing one test into sub-components

Parallel forms reliability assumptions

*examinee�s true score is equivalent for both forms*SEM�s for each form are = level of difficulty for items on each form is the same

Split half reliability

Defined: one test is split into two parts or halves/two halves are scored separately (odd-even system is common)*Scores from two halves of test are correlated*process tends to underestimate reliability of test overall*Underestimates because each �subtest� is half as long as entire test **i.e., correlation of halves is typically viewed as a reliable estimate of half of the test*Spearman-Brown formula corrects for this - estimates test reliability if each half had been length of fulltest **typically referred to as �corrected� split-half reliability - formula - r = 2r/1 + r (r = Pearson r) **Results obtained by Spearman-Brown usually accurate only when assumptions are met **typically raises estimate of reliability for total testImportant assumption: variances of both halves of test are = if this condition not satisfied, formula should not be used

Internal Consistency

Decisions about how to split test into two halves can cause problems*unequal variances*separate scoring of halves*ensuring halves are of = difficulty levelKuder andRichardson (1973) � developed procedure for estimating reliability without splitting test into halves � KR20

KR20

*Avoids problems of split-half methods*Simultaneously considers all possible ways of splitting the items*Mathematical proofs show estimates yielded by KR20 are similar to split-half reliabilities obtained by dividing tests in all possible ways*Only appropriate for tests in which items are scored either correct (i.e., 1) or incorrect (i.e., 0). **e.g., typical classroom tests (multiple choice, true-false,fill-in-blanks).KR21 � more simplistic method of estimating reliability*rests on assumption that items are all of = difficulty

Cronbach alpha

a.k.a. �coefficient alpha�*Many times, responses to test items can�t be classified as �right� or �wrong� **e.g., Likert scales with responses ranging from 1-5.*Cronbach alpha used in this case **considered to be most general formula for determining reliability estimate through internal consistency

Other considerations for internal consistency

*All measures of internal consistency evaluate extent to which items measure same trait or ability*If have subscales measuring different abilities/traits, test as a whole will not be internally consistent*Split-half and internal consistency estimates are only appropriate for power tests, not speeded tests

Overview of test reliability

*Reliability defined: nextent to which a test or measure yields consistent results across administrations - quantifying examinee consistency/inconsistency*same or similar score obtained across administrations = high reliability*Reliability � 1st characteristic of psychometric soundness*Lack of reliability = inconsistent measurement of performance -scores do not accurately reflect variable being measured **e.g., bathroom scale with loose spring **e.g., outdoor thermometer*Test scores � highly susceptible to measurement error **Scores cannot be trusted unless we know they are obtained consistently **example - using an assessment tool to make employment decisions 40% of information yielded from tool is attributable to real individual differences. Is this measure useful for making important employment decisions?

Reliability con't

*Reliability = measure of extent to whichobtained scores are free of measurement error*Variations across administrations are resultof random (chance) errorsIn finding test�s reliability, want to determine:(1) amount of variability (differences in scores) related topurpose of the measure(2) variability due to measurement error - Multiple factors that extend beyondparameters of test introduce error in measurement