Exam 2 Study Guide Psychological Testing
Exam 2 Study Guide Psychological Testing Psych 3325
Popular in Introduction to Psychological Testing
verified elite notetaker
Popular in Psychology (PSYC)
This 35 page Study Guide was uploaded by AmberNicole on Saturday October 1, 2016. The Study Guide belongs to Psych 3325 at East Carolina University taught by Dr. Gary Stainback in Fall 2016. Since its upload, it has received 18 views. For similar materials see Introduction to Psychological Testing in Psychology (PSYC) at East Carolina University.
Reviews for Exam 2 Study Guide Psychological Testing
Report this Material
What is Karma?
Karma is the currency of StudySoup.
You can buy or earn more Karma at anytime and redeem it for class notes, study guides, flashcards, and more!
Date Created: 10/01/16
Chapter 5: Reliability The Concept of Reliability Reliability is the consistency in measurement Reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance Observed score = true score plus error (X = T + E) Error refers to the component of the observed score that does not have to do with the test takers true ability or trait being measured Reliability estimates Variance = standard deviation squared o Variance equals true variance plus error variance Reliability is the proportion of the total variance attributed to true variance Measurement error: all of the factors associated with the process of measuring some variable, other than the variable being measured The concept of reliability Measurement error breaks down into random error and systematic error Random error is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process (I.e., noise) Systematic error is a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured Sources of error variance Test construction: variation may exist within items on a test or between tests (I.e., item sampling or content sampling) Test administration: Sources of error may stem from the testing environment. Also, test taker variables such as pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication. Examiner- related variables such as physical appearance and demeanor may play a role. Test scoring and interpretation: Computer testing reduces error in test scoring but many tests still require expert interpretation (e.g. projective tests). Subjectivity in scoring can enter into behavioral assessment Other sources of error variance: Surveys and polls usually contain some disclaimer as to the margin of error associated with their findings o Sampling error: the extent to which the population of voters in the study actually was representative of voters in the election o Methodological error: Interviewers may not have been trained properly, the wording in the questionnaire may have been ambiguous, or the items may have somehow been biased to favor one or another of the candidates o Example: The Dewey-Truman election (classic case of measurement error) Reliability estimates Test-retest reliability: An estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test o Most appropriate for variables that should be stable over time (e.g., personality) and not appropriate for variables expected to change over time (e.g. mood) o Estimates tend to decrease as time passes o With intervals over 6 months the estimate of test-retest reliability is called the coefficient of stability Parallel-forms or alternate-forms o Coefficient of equivalence: The degree of the relationship between various forms of a test o Parallel forms: For each form of the test, the means and the variances of observed test scores are equal o Alternate forms: Different versions of a test that have been constructed so as to be parallel. Do not meet the strict requirements of parallel forms but typically item content and difficulty is similar between tests o Reliability is checked by administering two forms of a test to the same group. Scores may be affected by error related to the state of testtakers (e.g. practice, fatigue, etc.) or item sampling Split half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once. Entails three steps. o Step 1: Divide the test into equivalent halves o Step 2: Calculate a Pearson r between scores on the two halves of the test o Step 3: Adjust the half test reliability using the Spearman-Brown formula which allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test Other methods of estimating internal consistency o Inter item consistency: The degree of relatedness of items on a test. Able to gauge the homogeneity of a test o Kuder-Richardson formula 20: Statistic of choice for determining the inter-item consistency of dichotomous items o Coefficient alpha: Mean of all possible split-half correlations, corrected by the Spearman-Brown formula. The most popular approach for internal consistency. Values range from 0 to 1. o Average Proportional Distance (APD): Focuses on the degree of difference between scores on test items. It involves averaging the difference between scores on all of the items and then dividing by the number of response options on the test, minus 1. Measures of inter-scorer reliability o Interscorer reliability: The degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure It is often used with behavioral measures Guards against biases or idiosyncrasies in scoring Coefficient of inter-score reliability – The scores from different reaters are correlated with one another The purpose of a reliability estimate will vary depending on the nature of the variables being studied If the purpose is to break down error variance into its constituent parts a number of tests would be used o 67%: True variance o 18%: Error due to test construction o 5%: Administration error o 5%: Unidentified error o 5%: Scorer error The nature of the test will often determine the reliability metric. Some considerations include the following o The test items are homogeneous or heterogeneous in nature o The characteristic, ability, or trait being measured is presumed to be dynamic or static o The range of test scores is or is not restricted o The test is a speed or a power test o The test is or is not criterion-referenced True-Score Model vs. Alternatives The true-score model is often referred to as Classical Test Theory (CTT) which is perhaps the most widely used model due to its simplicity True Score: A value that according to classical test theory genuinely reflects an individual's ability (or trait) level as measured by a particular test o CTT assumptions are more readily met than Item Response Theory (IRT) o A problematic assumption of CTT has to do with the equivalence of items on a test Domain-sampling theory: Domain sampling theory estimates the extent to which specific sources of variation under defined conditions are contributing to the test score Generalizability theory: Based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation o Instead of conceiving of variability in a person's scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score o A universe is described in terms of its facets, including the number of items in the test, the amount of training the test scorers have had, and the purpose of the test administration Item Response Theory: Provides a way to model the probability that a person with X ability will be able to perform at a level of Y o IRT refers to a family of methods and techniques o IRT incorporates considerations of item difficulty and discrimination o Difficulty relates to an item not being easily accomplished, solved, or comprehended o Discrimination refers to the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or other variables being measured The standard error of measurement Standard error of measurement, often abbreviated as SEM, provides a measure of the precision of an observed test score. An estimate of the amount of error inherent in an observed score or measurement o Generally, the higher the reliability of the test, the lower the standard error o Standard error can be used to estimate the extent to which an observed score deviates from a true score o Confidence interval: a range or band of test scores that is likely to contain the true score The Standard error of the difference The standard error of difference: a measure that can aid a test user in determining how large a difference in test scores should be expected before it is considered statistically significant It can be used to address three types of questions o How did this individual's performance on test 1 compare with his or her performance on test 2 o How did this individual's performance on test 1 compare with someone else's performance on test 1? o How did this individual's performance on test 1 compare with someone else's performance on test 2? Chapter 6: Validity The concept of validity Validity is a judgment or estimate of how well a test measures what it purports to measure in a particular context Validation is the process of gathering and evaluating evidence about validity o Both test developers and test users may play a role in the validation of a test o Test users may validate a test with their own group of test takers – local validation Validity is often conceptualized according to three categories o 1. Content validity This is a measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test o 2. Criterion-related validity This is a measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests or measures o 3. Construct validity This is a measure of validity that is arrived at by executing a comprehensive analysis of How scores on the test relate to other test scores and measures How scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure Face Validity Face Validity: a judgment concerning how relevant the test items appear to be o If a test appears to measure what it purports to measure "on the face of it," it could be said to be high in face validity o Many self-report personality tests are high in face validity, whereas projective tests, such as the Rorschach tend to be low in face validity (I.e. it is not apparent what is being measured) o A perceived lack of face validity may lead to a lack of confidence in the test measuring what it purports to measure Content validity Content validity: A judgment of how adequately a test samples behavior representative of the universe of behavior that the test was designed to sample o Do the test items adequately represent the content that should be included in the test? o Test blueprint: A plan regarding the types of information to be covered by the items, the number of items tapping each area of coverage, the organization of the items in the test, etc. Quantifying content validity may be especially important for employment tests o Lawshe (1975) developed a method whereby raters judge each item as to whether it is essential, useful but not essential, or not necessary for job performance o If more than half the raters indicate that an item is essential, the item has at least some content validity o Culture and the relativity of content validity The content validity of a test varies across cultures and time Political considerations may also play a role Criterion-Related Validity A criterion is the standard against which a test or a test score is evaluated Characteristics of a criterion are an adequate criterion is relevant for the matter at hand, valid for the purpose for which it is being used, and uncontaminated, meaning it is not part of the predictor. Criterion related validity breaks down into concurrent validity or predictive validity o Concurrent validity is an index of the degree to which a test score is related to some criterion measure obtained at the same time (Concurrently) o Predictive validity is an index of the degree to which a test score predicts some criterion, or outcome, measure in the future. Tests are evaluated as to their predictive validity. Criterion Related Validity The validity coefficient: A correlation coefficient that provides a measure of the relationship between test scores and scores on the criterion measure o Validity coefficients are affected by restriction or inflation of range Incremental validity is the degree to which an additional predictor explains something about the criterion measure that is not explained by predictors already in use o To what extent does a test predict the criterion over and above other variables? Expectancy data o An expectancy table shows the percentage of people within specified test score intervals who subsequently were placed in various categories of the criterion (e.g. placed in "passed" or "failed" category) o In a corporate setting test scores may be divided into intervals (e.g. poor, adequate, excellent) and examined in relation to job performance (e.g. satisfactory or unsatisfactory). Expectancy tables, or charts, may show us that the higher the initial rating, the greater the probability of job success. Construct Validity Construct validity: the ability of a test to measure a theorized construct (e.g. intelligence, aggression, personality, etc.) that it purports to measure o If a test is a valid measure of a construct, high scorers and low scorers should behave as theorized o All types of validity evidence, including evidence from the content- and criterion-related variables of validity, come under the umbrella of construct validity Evidence of construct validity o Evidence of homogeneity- how uniform a test is in measuring a single concept o Evidence of changes with age – Some constructs are expected to change over time (e.g. reading rate) o Evidence of pretest-posttest changes – test scores change as a result of some experience between a pretest and a posttest (e.g. therapy) o Evidence from distinct groups – scores on a test vary in a predictable way as a function of membership in some group (e.g. scores on the Psychopathy Checklist for prisoners vs. Civilians) o Convergent evidence – scores on the test undergoing construct validation tend to correlate highly in the predicted direction with scores on older, more established, tests designed to measure the same (or a similar) construct o Discriminant evidence – validity coefficient showing little relationship between test scores and other variables with which scores on the test should not theoretically be correlated o Factor analysis – A new test should load on a common factor with other tests of the same construct Validity and Test Bias Bias: A factor inherent in a test that systematically prevents accurate, impartial measurement o Bias implies systematic variation in test scores o Prevention during test development is the best cure for test bias Rating error: a judgment resulting from the intentional or unintentional misuse of a rating scale o Raters may be either too lenient, too severe, or reluctant to give ratings at the extremes (central tendency error) o Halo effect- a tendency to give a particular person a higher rating than he or she objectively deserves because of a favorable overall impression Fairness: The extent to which a test is used in an impartial, just, and equitable way. Chapter 7: Utility What is Utility? Utility: the usefulness or practical value of testing to imporve efficiency Factors affecting utility o Psychometric soundness – Generally, the higher the criterion validity of a test the greater the utility o There are exceptions because many factors affect the utility of an instrument and utility is assessed in many different ways o Valid tests are not always useful tests Factors affecting utility Costs – one of the most basic elements of a utility analysis is the financial cost associated with a test Cost in the context of test utility refers to disadvantages, losses, or expenses in both economic and noneconomic terms Economic costs may include purchasing a test, a supply bank of test protocols, and computerized test processing Other economic costs are more difficult to calculate such as the cost of not testing or testing with an inadequate instrument Non-economic costs include things such as human life and safety Benefits – we should take into account whether the benefits of testing justify the costs of administering, scoring, and interpreting the test Benefits can be defined as profits, gains, or advantages Successful testing programs can yield higher worker productivity and profits for a company Some potential benefits include: an increase in the quality of workers' performance; an increase in the quantity of workers' performance; a decrease in the time needed to train workers; a reduction in the number of accidents; a reduction in worker turnover Non-economic benefits may include a better work environment and improved morale Utility analysis Utility analysis: a family of techniques that entail a cost benefit analysis designed to yield information relevant to a decision about the usefulness and/or practical value of a tool of assessment Some utility tests are straightforward, while others are more sophisticated, employing complicated mathematical models Often utility tests address the question of "which test gives us the most bang for the buck?" Endpoint of a utility analysis yields an educated decision as to which of several alternative courses of action is most optimal (in terms of costs and benefits) Expectancy data – likelihood that a test taker will score within some interval of scores on a criterion measure Taylor – Russell tables provide an estimate of the percentage of employees hired by the use of a particular test who will be successful at their jobs, given different combinations of three variables: the test's validity, the selection ratio used, and the base rate Here validity refers to the validity coefficient, selection ration refers to a numerical value that reflects that relationship between the number of people to be hired and the number of people available to be hired, and base rate refers to the percentage of people hired under the existing system for a particular position Naylor-Shine tables entail obtaining the difference between the means of the selected and unselected groups to derive an index of what the test (or some other tool of assessment) is adding to already established procedures For both Taylor-Russell and Naylor-Shine tables the validity coefficient comes from concurrent validation procedures Many other variables may play a role in selection decisions besides test results, including applicants' minority status, general physical or mental health, or drug use. Decision theory and utility Cronbach and Gleser (1965) presented o A classification of decision problems o Various selection strategies ranging from single stage processes to sequential analysis o Quantitative analysis of the relationship between test utility, the selection ratio, cost of the testing program, and expected value of the outcome o A recommendation that in some instances job requirements be tailored to the applicant's ability instead of the other way around (adaptive treatment) Practical considerations The pool of job applicants – some utility models are based on the assumption that for a particular position there is a limitless pool of candidates o However, some jobs require such expertise or sacrifice that the pool of qualified candidates may be very small o The economic climate also affects the size of the pool o The top performers on a selection test may not accept a job offer The complexity of the job – the same utility models are used for a variety of positions, yet the more complex the job the bigger the difference in people who perform well or poorly The cut score in use – cut scores may be relative, in which case they are determined in reference to normative data (e.g. selecting people in the top 10% of test scores) Fix cut scores are made on the basis of having achieved a minimum level of proficiency on a test (e.g. a driving license exam) Multiple cut scores – the use of multiple cut scores for a single predictor (e.g. students may achieve grades of A, B, C, D, or E) Multiple hurdles are achievement of a particular cut score on a test is necessary in order to advance to the next stage of evaluation in the selection process (e.g. Miss America Contest) Methods of setting cut scores The Angoff method: judgments of experts are averaged to yield cut scores for the test o Can be used for personnel selection, traits, attributes, and abilities o Problems arise if there is low agreement between experts The Known groups method: entails collection of data on the predictor of interest from groups known to possess, and not to possess, a trait, attribute, or ability of interest o After analysis of a data, a cut score is chosen that best discriminates the groups o One problem with known groups method is that no standard set of guidelines exist to establish guidelines IRT Based Methods: In an IRT framework, each items is associated with a particular level of difficulty o In order to"pass" the test, the test taker must answer items that are deemed to be above some minimum level of difficulty, which is determined by experts and serves as the cut score o Makes use of the item mapping method and bookmark method Other methods o R.L. Thorndike (1949) proposed a norm referenced method called the method of predictive yield o Took into account the number of positions to be filled, projections regarding the likelihood of offer acceptance, and the distribution of applicant scores o Discriminant analysis – a family of statistical techniques used to shed light on the relationship between identified variables (such as scores on a battery of tests) and two (and in some cases more) naturally occurring groups (such as persons judged to be successful at a job and persons judged unsuccessful at a job) Chapter 8: Test Development Test conceptualization --> test construction --> test tryout --> analysis --> revision (back to test tryout) Test development is an umbrella term for all that goes into the process of creating a test Test conceptualization The impetus for developing a new test is some thought that "there ought to be a test for..." The stimulus could be knowledge of psychometric problems with other tests, a new social phenomenon, or any number of things There may be a need to assess mastery in an emerging occupation Some preliminary questions o What is the test designed to measure? o What is the objective of the test? o Is there a need for this test? o Who will use this test? o Who will take this test? o What content will the test cover? o How will the test be administered? o What is the ideal format of the test? o Should more than one form of the test be developed? o What special training will be required of test users for administering or interpreting the test? o What types of responses will be required of test takers? o Who benefits from an administration of this test? o Is there any potential for harm as the result of an administration of this test? o How will meaning be attributed to scores on this test? Item development in norm referenced and criterion-referenced tests Generally a good item on a norm-referenced achievement test is an item for which high scorers on the test respond correctly. Low scorers respond incorrectly Ideally, each item on a criterion-oriented test addresses the issue of whether the respondent has met certain criteria Development of a criterion referenced test may entail exploratory work with at least two groups of test takers: one group known to have mastered the knowledge or skill being measured and anoterh group known not to have mastered it Test items may be pilot studied to evaluate whether they should be included in the final form of the instrument Test construction Scaling: the process of setting for assigning numbers in measurement Types of scales: Scales are instruments to measure some trait, state or ability. May be categorized in many ways (e.g. multidimensional, unidemensional, etc.) Test construction – scaling methods Numbers can be assigned to responses to calculate test scores using a number of methods Rating scales – a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the test taker Likert scale – each item presents the test taker with five alternative responses (sometimes seven), usually on an agree-disagree or approve – disapprove continuum o Likert scales are typically reliable All rating scales result in ordinal level data Some rating scales are unidimensional, meaning that only one dimension is presumed to underlie the ratings Other rating scales are multidimensional, meaning that more than one dimension is thought to underlie the ratings For each pair of options, testtakers receive a higher score for selecting the option deemed more justifiable by the majority of a group of judges The test score would reflet the number of times the choices of a test taker agreed with those of the judges The test score would reflect the number of times the choices of a test taker agreed with those of the judges Comparative scaling: entails judgments of a stimulus in comparison with every other stimulus on the scale Categorical scaling: Stimuli (e.g. index cards) are placed into one of two or more alternative categories Guttman scale: Items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured All respondents who agree with the stronger statements of the attitude will also agree with milder statements The method of equal-appearing intervals can be used to obtain data that are interval in nature Test construction – writing items Item pool: the reservoir or well from which items will or will not be drawn for the final version of the test o Comprehensive sampling provides a basis for content validity of the final version of the test Item format: includes variables such as the form, plan, structure, arrangement, and layout of individual test items o Selected-response format- items require testtakers to select a response from a set of alternative responses o Constructed response format: items require test takers to supply or to create the correct answer, not merely to select it Multiple-choice format has three elements: (1) a stem, (2) a correct alternative or option and (3) several incorrect alternatives or options variously referred to as distractors or foils Other commonly used selective response formats include matching and true- false items Writing items for computer administration o Item bank- a relatively large and easily accessible collection of test questions o Computerized adaptive testing (CAT) - an interactive, computer- administered test-taking process wherein items presented ot the testtakers are based in part on the testtaker's performance on previous items CAT is able to provide economy in testing time and number of items presented CAT tends to reduced floor effects and ceiling effects Test construction – scoring items Cumulatively scored test- assumption that the higher the score on the test, the higher the test taker is on the ability, trait, or other characteristic that the test purports to measure Class scoring: responses earn credit toward placement in a particular class or category with other test takers whose pattern of responses is presumably similar in some way (e.g. diagnostic testing) Ipsative scoring: comparing a test taker's score on scale within a test to another scale within that same test Test tryout Test should be tried out on the same population that it was designed for 5-10 respondents per item Should be administered in the same manner, and have the same instructions, as the final product What is a good item? o A good item is reliable and valid o A good item discrimination test takers –high schoolers on the test overall answer the item correctly Item analysis The nature of the item analysis will vary depending on the goals of the test developer Among the tools test developers might employ to analyze and select items are o An index of the item's difficulty o An index of the item's reliability o An index of the item's validity o An index of item discrimination Item difficulty index – the proportion of respondents answering an item correctly o For maximum discrimination among the abilities of the test takers, the optimal average item difficulty is approximately .5, with individual items on the test ranging in difficulty from about .3 to .8 o Item reliability index – indication of the internal consistency of the scale o Factor analysis can also provide an indication of whether items that are supposed to be measuring the same thing load on a common factor The item-validity index – allows test developers to evaluate the validity of items in relation to a criterion measure The item-discrimination index – indicates how adequately an item separates or discriminates between high scorers and low scorers on an entire test A measure of the difference between the proportion of high scorers answering an item correctly and the proportion of low scorers answering the item correctly Analysis of item alternatives – the quality of each alternative within a multiple-choice item can be readily assessed with reference to the comparative performance of upper and lower scorers Item Characteristic curves (ICC) o A graphic representation of item difficulty and discrimination Guessing – Test developers and users must decide whether they wish to correct for guessing but to date no entirely satisfactory solution to correct for guessing has been achieved Item fairness – the degree, if any, a test item is biased o A biased test item is an item that favors one particular group of examinees in relation to another when differences in group ability are controlled Speed tests –item analysis of tests taken under speed conditions yield misleading or uninterpretable results. The closer an item is to the end of the test, the more difficult it may appear to be Qualitative Item Analysis Qualitative methods: techniques of data generation and analysis that rely primarily on verbal rather than mathematical or statistical procedures o Qualitative item analysis is a general term for various non statistical procedures designed to explore to explore how individual test items work o Think aloud test administration – respondents are asked to verbalize their thoughts as they occur during testing o Expert panels – experts may be employed to conduct a qualitative item analysis o Sensitivity review – items are examined in relation to fairness to all prospective testtakers. Check for offensive language, stereotypes, etc. Test Revision Revision in New Test Development o Items are evaluated as to their strengths and weaknesses – some items may be eliminated o Some items may be replaced by others from the item pool o Revised tests will then be administered under standardized conditions to a second sample o Once a test has been finalized, norms may be developed from the data and it is said to be standardized Revision in the life cycle of a test o Existing tests may be revised if the stimulus material or verbal material is dated, some outdated words become offensive, norms no longer represent the population, psychometric properties could be improved, or the underlying theory behind the test has changed o In test revision the same steps are followed as with new tests (I.e. test conceptualization, construction, item analysis, tryout, and revision) o Cross validation and co validation Cross validation refers to the revalidation of a test on a sample of testtakers other than those on whom test performance was originally found to be a valid predictor of some criterion Item validity inevitably become smaller when administered to a second sample – validity shrinkage Co – validation – a test valdiation process conducted on two or more tests using the same sample of test takers Co – validation is economical for test developers o Quality assurance Test developers employ examiners who have experience testing members of the population targeted by the test. Examiners follow standardized procedures and undergo training Anchor protocols are also used in quality assurance An anchor protocol is a test protocol scored by a highly authoritative scorer that is designed as a model for scoring and a mechanism for resolving scoring discrepancies A discrepancy between scoring in an anchor protocol and the scoring of another protocol is referred to as scoring drift Revision in the life cycle of a test The use of IRT in Building and Revisiting tests o Items are evaluated on item characteristic curves (ICC) in which performance on items is related to underlying ability o Three possible applications of IRT in building and revising tests include Evaluating existing tests for the purpose of mapping test revisions Determining measurement equivalence across test taker populations Developing item banks TEXTBOOK NOTES Chapter 5: Reliability Reliability is a synonym for dependability or consistency Reliability refers to consistency in measurement A reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance The concept of reliability If we use X to represent an observed score, T to represent a true score, and E to represent error, then the fact that an observed score equals the true score plus error may be expressed as follows o X = T + E Variance the standard deviation squared o Variance from true differences is true variance, and variance from irrelevant Random sources is error variance Reliability refers to the proportion of the total variance attributed to true variance Measurement error refers to, collectively, all of the factors associated with the process of measuring some variable, other than the variable being measured Measurement error, much like error in general, can be categorized as being either systematic or random o Random error is a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process "noise" o Systematic error refers to a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured 12 inch ruler may be a tenth of one inch longer than 12 inches Sources of error variance Sources of error variance include test construction, administration, scoring, and/or interpretation Test construction o Item sampling or content sampling terms that refer to variation among items within a test as well as to variation among items between tests Test administration o Test environment: room temperature, level of lighting, and amount of ventilation and noise o Test taker variables: Pressing emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication o Examiner related variables are potential sources of error variance o Examiner's physical appearance and demeanor – even the presence or absence of an examiner – are some factors for consideration here Test scoring and interpretation o For a behavioral measure of social skills in an inpatient psychiatric service, the scorers or raters might be asked to rate patients with respect to the variable "social relatedness" o Scorers and scoring systems are potential sources of error variance o Nonsystematic error in such an assessment situation include forgetting, failing to notice abusive behavior, and misunderstanding Reliability estimates Test-retest reliability estimates o Test-retest reliability is an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test o When the interval between testing is greater than six months, the estimate of test-retest reliability is often referred to as the coefficient of stability Parallel-forms and alternate-forms reliability estimates o Degree of the relationship between various forms of a test can be evaluated by means of an alternate forms or parallel forms coefficient of reliability, which is often termed the coefficient of equivalence o Parallel forms of a test exist when, for each form of the test, the means and the variances of observed test scores are equal In theory, the means of scores obtained on parallel forms correlate equally with the true score Scores obtained on parallel tests correlate equally with other measures o Parallel forms reliability refers to an estimate of the exent to which item sampling and other errors have affected test scores on versions of the same test when, for each form of the test, the means and variances of observed test scores are equal o Alternate forms are simply different versions of a test that have been constructed so as to be parallel o Alternate forms reliability refers to an estimate of the extent to which these different forms of the same test have been affected item sampling error, or other error o Obtaining estimates of alternate forms reliability and parallel forms reliability is similar in two ways to obtaining an estimate of test-retest reliability Two test administrations with the same group are required Test scores may be affected by factors such as motivation, fatigue, or intervening events such as practice, learning, or therapy o Deriving this type of estimate entails an evaluation of the internal consistency of the test items o Internal consistency estimate of reliability or as an estimate of inter- item consistency Split half reliability estimates o Split half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once o Useful measure of relaibility when it is impractical or undesirable to assess reliability with two tests or to administer a test twice (because of factors such as time or expense) Step 1: Divide the test into equivalent halves Step 2: Calculate a Pearson r between scores on the two halves of the test Step 3: Adjust the half-test reliability using the Spearman-Brown formula o Acceptable way to split a test is to randomly assign items to one or the other half of the test o Another acceptable way to split a test is to assign odd numbered items to one half of the test and even numbered items to the other half This method yields and estimate of split half reliability that is also referred to as odd even reliability o Split a test is to divide the test by content so that each half contains items equivalent with respect to content and difficulty o Mini-parallel-forms with each half equal to the other- or as nearly equal as humanly possible – in format, stylistic, statistical, and related aspects o The Spearman-Brown formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test o Specific application of a more general formula to estimate the reliability of a test that is lengthened or shortened by any number of items o Usually, but not always, reliability increases as test length increases o If test developers or users wish to shorten a test, the Spearman-Brown formula may be used to estimate the effect of the shortening on the test's reliability o A Spearman-Brown formula could also be used to determine the number of items needed to attain a desired level of reliability o Reliability of the instrument might be raised by creating new items, clarifying the test's instructions, or simplifying the scoring rules o Internal consistency estimates of reliability, such as that obtained by use of the Spearman-Brown formula, are inappropriate for measuring the reliability of heterogeneous tests and speed tests Other methods of estimating internal consistency o Inter-item consistency refers to the degree of correlation among all the items on a scale A measure of inter-item consistency is calculated from a single administration of a single form of a test An index of inter-item consistency, in turn, is useful in assessing the homogeneity of the test Tests are said to be homogeneous if they contain items that measure a single trait Homogeneity is the extent to which items in a scale are unifactorial o Heterogeneity describes the degree to which a test measures different factors A heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait o The more homogeneous a test is, the more inter-item consistency it can be expected to have o Test homogeneity is desirable because it allows relatively straightforward test score interpretation o Test takers with the same score on a homogeneous test probably have similar abilities in the area tested o Test takers with the same score on a more heterogeneous test may have quite different abilities o Although a homogeneous test is desirable because it so readily lends itself to clear interpretation, it is often an insufficient tool for measuring multifaceted psychological variables such as intelligence or personality o Kuder-Richardson formula 20, or KR-20, so named because it was the th 20 formula developed in a series Where test items are highly homogeneous, KR-20 and split-half reliability estimates will be similar. However, KR-20 is the statistic of choice of determining the inter-item consistency of dichotomous items, primarily those items that can be scored right or wrong (such as multiple choice items) If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method Coefficient alpha o Coefficient alpha may be thought of as the mean of all possible split- half correlations, corrected by the Spearman-Brown formula o Coefficient alpha is appropriate for use on tests containing non dichotomous items o Coefficient alpha is the preferred statistic for obtaining an estimate of internal consistency o Formula yields an estimate of the mean of all possible test retest, split half coefficients o Unlike a Pearson r, which may range in value from –1 to +1, coefficient alpha typically ranges in value from 0 to 1. The reason for this is that, conceptually, coefficient alpha (much like other coefficients of reliability) is calculated to help answer questions about how similar sets of data are Value of alpha above .90 may be too high and indicate redundancy in the items A Pearson r may b ethought of as dealing conceptually with both dissimilarity and similarity Average proportional distance (APD) o APD is a measure that focuses on the degree of difference that exists between item scores o A measure used to evaluate the internal consistency of a test that focuses on the degree of difference that exists between item scores o The general rule of thumb for interpreting an APD is that an obtained value of .2 or lower is indicative of excellent internal consistency, and that a value of .25 to .2 is in the acceptable range o A calculated APD of .25 is suggestive of problems with the internal consistency of the test o One potential advantage of the APD method over using Cronbach's alpha is that the APD index is not connected to the number of items on a measure Cronbach's alpha will be higher when a measure has more than 25 items Measures of inter scorer reliability Inter scorer reliability is the degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure The simplest way of determining the degree of consistency among scorers in the scoring of a test is to calculate a coefficient of correlation: Coefficient of inter scorer reliability Using and interpreting a coefficient of reliability Three approaches to the estimation of reliability o Test-retest o Alternate or parallel forms o Internal or inter item consistency Dynamic versus static characteristics A dynamic characteristic is a trait, state, or ability presumed to be ever- changing as a function of situational and cognitive experiences The best estimate of reliability would be obtained from a measure of internal consistency Made on a trait, state, or ability presumed to be relatively unchanging ( a static characteristic) o In this instance, obtained measurement would not be expected to vary significantly as a function of time, and either the test retest or the alternate-forms method would be appropriate Restriction or inflation of range In using and interpreting a coefficient of reliability, the issue variously referred to as restriction of range or restriction of variance (or conversely, inflation of range or inflation of variance) If the variance of either variable in a correlational analysis is restricted by the sampling procedure used, then the resulting correlation coefficient tends to be lower If the variance of either variable in a correlational analysis is inflated by the sampling procedure, then the resulting correlation coefficient tends to be higher Also of critical importance is whether the range of variances employed is appropriate to the objective of the correlational analysis Speed tests versus power tests When a time limit is long enough to allow test takers to attempt all items, and if some items are so difficult that no test taker is able to obtain a perfect score, then the test is a power test By contrast, a speed test generally contains items of uniform level of difficulty (typically uniformly low) so that, when given generous time limits, all test takers should be able to complete all the test items correctly o In practice, however, the time limit on a speed test is established so that few if any of the test takers will be able to complete the entire test A reliability estimate of a speed test should be based on performance from two independent testing periods using one of the following o Test retest reliability o Alternate forms reliability o Split half reliability from two separately timed half tests o In a split half procedure is used, then the obtained reliability coefficient is for a half test should be adjusted using the Spearman-Brown formula Criterion-referenced tests A criterion referenced test is designed to provide an indication of where a test taker stands with respect to some variable or criterion, such as an educational or a vocational objective Unlike norm referenced tests, criterion referenced tests tend to contain material that has been mastered in hierarchical fashion The true score model of measurement and alternatives to it Classical test theory (CTT), also referred to as the true score (or classical) model of measurement CTT is the most widely used and accepted model in the psychometric literature today True score as a value that according to classical test theory genuinely reflects an individual's ability (or trait) level as measured by a particular test Domain sampling theory and generalizability theory Whereas those who subscribe to CTT seek to estimate the portion of a test score that is attributable to error, proponents of domain sampling theory seek to estimate the extent to which specific cources of variation under defined conditions are contributing to the test score In domain sampling theory, a test's reliability is conceived of as an objective measure of how precisely the test score assesses the domain from which the test draws a sample A domain of behavior, or the universe of items that could conceivable measure that behavior, can be thought of as a hypothetical construct: one that shares certain characteristics with (and is measured by) the sample of items that make up the test In one modification of domain sampling theory called generalizability theory, a "universe score" replaces that of a "true score" o Generalizability theory is based on the idea that a person's test scores vary from testing to testing because of variables in the testing situation o Instead of conceiving of all variability in a person's scores as error, Cronbach encouraged test developers and researchers to describe the details of the particular test situation or universe leading to a specific test score o This universe is described in terms of its facets, which include things like the number of items in the test, the amount of taining the test scorers have had, and the purpose of the test administration o According to generalizability theory, given the exact same conditions of all the facets in the universe, the exact same test score should be obtained This test score is the universe score, and it is analogous to a true score in the true score model o Generalizability study examines how generalizable scores from a particular test are if the test is administered in different situations o Generalizability study examines how much of an impact different facets of the universe have on the test score o Coefficients of generalizability are coefficients that are similar to reliability coefficients in the true score model o Decision study: developers examine the usefulness of test scores in helping the test user make decisions o The decision study is designed to tell the test user how test scores should be used and how dependable those scores are as a basis for decisions, depending on the context of their use Item response theory (IRT) Another alternative to the true score model is item response theory (IRT) and these procedures of item response theory provide a way to model the probability that a person with X ability will be able to perform at a level of Y Synonym for IRT in the academic literature is latent-trait theory o Examples of two characteristics of items within an IRT framework are the difficulty level of an item and the item's level of discrimination; items may be viewed as varying in terms of these, as well as other, characteristics Discrimination signifies the degree to which an item differentiates among people with higher or lower levels of the trait, ability, or whatever it is that is being measured Dichotomous test items (test items or questions that can be answered with only one of two alternative responses, suhc as true false, yes no, or correct incorrect questions) Polytomous test items (test items or questions with three or more alternative responses, where only one is scored correct or scores as being consistent with a targeted trait or other construct) Latent trait models differ in some important ways from CTT Rasch model is a reference to an IRT model with very specific assumptions about the underlying distribution Item response theory (IRT) Relationship between a testtaker's response to an individual test item and that testtaker's standing, in probabilistic terms, on the construct being measured: item response theory Tree assumptions are made regarding data to be analyzed within an IRT framework o Unidimensionality o Local independence o Monotonicity Unidimensionality assumption o Positis that the set of items measures a single continuous latent construct o Theta level is a reference to the degree of the underlying ability or trait that the test taker is presumed to bring to the test Local independence o There is a systematic relationship between all of the test items o Relationship has to do with the theta level of the test taker. When the assumption of local independence is met, it means that differences in responses to items are reflective of differences in the underlying trait or ability Item characteristic curve (ICC), an item resonse curve, a category response curve, or an item trace line Enables test users to better understand the range over theta for which an item is most useful in discriminating among groups of test takers Information function (or information curve) Differential item function (DIF) is a key methodology to identify biased items in questionnaires The standard error of measurement Standard error of measurement, often abbreviated as SEM provides a measure of the precision of an observed test score Provides an estimate of the amount of error inherent in an observed score or measurement Relationship between the SEM and the reliability of a test is inverse; the higher the reliability of a test (or individual subtest within a test), the lower the SEM Standard error of measurement is the tool used to estimate or infer the extent to which an observed score deviates from a true score Standard error of a score is a standard error of measurement is an index of the extent to which one individual's scores vary over tests presumed to be parallel Confidence interval is a range or band of test scores that is likely to contain the true score Standard error of the difference between two scores Comparisons between scores are made using the standard error of the difference, a statistical measure that can aid a test user in determining how large a difference should be before it is considered statistically significant Chapter 6 Validity Validity is a judgment or estimate of hwo well a test measures what it purports to measure in a particular context Inference is a logical result or deduction Tests may be shown to be valid within what we would characterize as reasonable boundaries of a contemplated usage Validation is the process of gathering and evaluating evidence about validity Test users to conduct their own validation studies with their own groups of test takers o Such local validation studies may yield insights regarding a particular population of test takers as compared to the norming sample described in a test manual Local validation studies are absolutely necessary when the test user plans to alter in some way the format, instructions, language, or content of the test Validity is according to three categories o Content validity Measure of validity based on an evaluation of the subjects, topics, or content covered by the items in the test o Criterion related validity This is a measure of validity obtained by evaluating the relationship of scores obtained on the test to scores on other tests or measures o Construct validity Measure of validity that is arrived at by executing a comprehensive analysis of How scores on the test relate to other test scores and measures How scores on the test can be understood within some theoretical framework for understanding the construct that the test was designed to measure Face validity Face validity relat
Are you sure you want to buy this material for
You're already Subscribed!
Looks like you've already subscribed to StudySoup, you won't need to purchase another subscription to get this material. To access this material simply click 'View Full Document'