SCALES AND INDEXES
Creating scales, indexes, or any measurement/assessment instrument that might be called a test is part of the research process that is concerned with calibration. In many ways, calibration is a quick and easy way to achieve precision and accuracy, which are, of course, important goals of measurement. One can just as easily get by without creating a scale or index, but at some point, at least in estimating the reliability and validity of your study, you're going to have to look at item and response patterns. Do the items (questions) you're asking fit together in the most productive way, or do they overlap redundantly? Do the response patterns (answers) hint at ways you can improve your measuring instrument? There's a big difference between scaling and scoring a test, and since most readers are familiar with the typical multiple choice tests found in education, that's where we'll start. It's not uncommon for social sciences to draw upon the field of Education Statistics. You'll probably have to take such a course in graduate school if you plan to do any teaching. Here, you'll learn the basics of item analysis and scale construction.
TEST DEVELOPMENT
A test is a series of questions designed to measure the nature and extent of individual differences. In the educational context, most people are familiar with the achievement-type test, which is designed to capture individual differences in knowledge. A typical item looks like the following:
|
56. Which screening process is recommended for obtaining
the best pool of police applicants? |
The multiple choice format is based on certain principles or rules. First of all, the sentence stem in the question should be short, clearly stated, and well-written. With a knowledge-based achievement test, it's best to stick to questions like "Which of the following is..." or "Which of the following is not...", but you should generally shy away from long-winded, complex questions unless they present a readable scenario like the following:
| 12. On October 31, 1998, Sam robbed the Acme Bank while
wearing a Halloween mask and carrying a gun. While speeding from the
crime scene, Sam lost control of his Jeep Cherokee and ran into a
telephone pole. When the police, who had previously received a bulletin
about the bank robbery, arrived at the accident scene and saw the
Halloween mask and bag of money in Sam’s car, they immediately placed
him under arrest for bank robbery, frisked him, and then asked him:
"Where’s the gun?" Sam replied that the gun was in his glove
compartment. The police retrieved the gun, placed Sam in the patrol car,
and drove him to police headquarters. As Sam was getting out of the
squad car, he asked: "How many years am I going to have to do for
the bank robbery?" Sam's lawyer has moved to suppress both
statements, because Miranda warnings were not administered until after
he was inside police headquarters. The court should suppress: A. Sam’s statement to the police about the location of his gun B. Sam’s statement "How many years am I going to have to do for the bank robbery?" C. neither statement D. both statements |
The answers in multiple choice format (the length depending upon instructor preference: A, B, C, D, E or A, B, C, D) should contain one and only one true and correct answer. That doesn't mean the answer has to the most clear, concise, comprehensive answer ever written; it just means that one response is completely true and correct. Answers should also be short, clearly stated, and well-written. All the items other than the true and correct answer are called distracters. It's important to follow the principle that all distracters be plausible enough to sucker at least 2% of respondents into guessing at it. Poor, ridiculous, or "Mickey Mouse" distractors like "The police should let Sam go because bank robbery isn't all that bad" should be avoided.
Item analysis with our multiple choice achievement test example would involve looking at distracter patterns (the 2% rule), the difficulty level, and the discrimination index. To calculate the latter two, you need to sort all your completed tests in some rank order, from best to worse. Then, you take the top 27% of the best and the bottom 27% of the worst, and work out the following formulas. The procedure is very similar to the Kuder-Richardson, or KR-20, coefficient discussed in the previous lecture under split-half reliability.
| Difficulty Index
# of people in best group who got item right + # of people in
worst group who got item right |
| Discrimination Index
# of people in best group who got item right - # of people in
worst group who got item right |
The Difficulty Index is going to be a number from .00 to .99, and ideally, you want a number in the moderately difficult range (from .50 to .70). The Discrimination Index is going to be a number from -1.00 to +1.00, and ideally, you want a number in the twenties (from .20 to .29). Anything above that means you are favoring your brighter respondents. A zero, near-zero, or below means that you are rewarding chance, or guessing, since four responses spell out to a 25% equal probability of getting it right, and the 27% best-worst dichotomy you made with this formula controls for this. There are tradeoffs between difficulty and discrimination, however. As difficulty goes up, discrimination approaches zero.
SCALE DEVELOPMENT
A scale is a cluster of items (questions) that taps into a unitary dimension or single domain of behavior, attitudes, or feelings. They are sometimes called composites, subtests, schedules, or inventories. Aptitude, attitude, interest, performance, and personality tests are all measuring instruments based on scales. A scale is always unidimensional, which means it has construct and content validity. A scale is always at the ordinal or interval level, but it's conventional for researchers to treat them as interval or higher. Scales are predictive of outcomes (like behavior, attitudes, or feelings) because they measure underlying traits (like introversion, patience, or verbal ability). It's probably an overstatement, but scales are primarily used to predict effects, as the following example shows:
| An Example of a Scale Measuring
Introversion:
I blush easily. |
A great many scales can be found in the literature or in handbooks (Brodsky & Smitherman 1983), and beginning researchers are well-advised to borrow or use an established scale before attempting to create one of their own. However, most researchers are interested in breaking new ground, and have at least some hunch about what are variously called "tipping points", "the last straw", "going over the edge", or "snapping." It's this hardening, intensity, potency, or coming together of behavior, attitudes, and feelings that the researcher is calling a "trait" or something inside the person that is hopefully captured in scale construction. Scaling is all about quantifying the mysterious mental world of subjective experience.
There are four ways to construct scales:
Thurstone scales
Likert scales
Guttman scaling
Semantic differential
Thurstone scales were developed in 1929 for measuring a core attitude when you have multiple dimensions or concerns around that attitude. Take gun control, for instance. A person might have one part of their attitude relating to self-defense; another part of their attitude relating to constitutional rights; and still another part of their attitude relating to child safety. How do you determine which part of the attitude goes to the core of the matter? In Thurstone scaling, the researcher would obtain a panel of judges (say 100 of them) and then dream up every conceivable question you can ask about gun control (say 100 questions). By administering that questionnaire to the panel, the researcher can analyze inter-item agreement among the judges, and then even use the discrimination index (explained above) to weed out what are called the nonhomogenous items. Scaling is all about homogeneity, a term sometimes used as synonymous with being unidimensional. I know in the educational testing context, I said you wanted a discrimination index in the twenties, but using Thurstone scaling, you actually want to favor your brighter respondents and look for higher-scoring items. You will most likely end up with a scale of 15-20 homogeneous and unidimensional items.
Likert scales were developed in 1932 as the familiar five-point bipolar response format most people are familiar with today. These scales always ask people to indicate how much they agree or disagree, approve or disapprove, believe to be true or false. There's really no wrong way to do a Likert scale, the most important thing being to at least have five response categories (for ordinal-treated-as-interval measurment). Some appropriate examples appear below:
|
Never
Seldom Sometimes
Often Always |
The "don't know" is the second example is optional, and some people prefer not to use it since it's an odd response category. The examples showing "About 50/50", "Need more information", or "A bit of both" are preferable to use. You can increase the ends of the scale by adding "very" to create 7-point scales, which tends to reach the upper limits of reliability (Nunnally 1978). It's best to use as wide a scale as possible since you can always collapse the responses into condensed categories later on for analysis purposes.
Guttman scaling was developed in the 1940s and is a technique of mixing questions up in the sequence they are asked so that respondents don't see that several questions are related. A lot of irrelevant questions surround the important questions. The scoring system is based on how closely they follow a pattern of ever-increasing hardened attitude toward some topic in the important questions. Let's take the example of attitude toward capital punishment:
| For each of the following, indicate if you SA, A,
50/50, D, or SD: 1. Crime is a serious problem in the United States. 2. Police should be given more powers. 3. More criminals should be given the death penalty. 4. The U.S. ought to do something about drug exporting countries. 5. The military ought to be used to patrol our streets. 6. Inmates on death row ought to be executed quickly. 7. Most politicians are too soft on crime. 8. Lethal injection is too merciful for those who deserve it. 9. Crime is destroying the social fabric of our society. 10. They ought to jack up the voltage when they electrocute criminals. |
In the above example, items #3, 6, 8, and 10 make up the scale for attitude toward capital punishment. Everything else is irrelevant. You should see how the relevant items lead progressively to a harder and harsher attitude. If most of the respondents you study (or the top 27% of them) hold fast to this hierarchical pattern, you've captured a very one dimensional aspect of your construct. In addition, you can calculate something called the coefficient of reproducibility, which is simply 1 minus the number of breaks with the hardened response pattern divided by the total number of responses. Guttman scaling is very appealing, but it's not all that well-received by the scientific community. A variation is the Bogardus social distance scale, but it has properties of the semantic differential also.
The Semantic Differential is a technique developed in the 1950s to deal with emotions and feelings. It's based on the idea that people think dichotomously or in terms of polar opposites such as good-bad, right-wrong, strong-weak, etc. There are many varieties of the technique, the most popular one asking respondents to place their own slash mark along a line between adjectives. Let's take the example of a scale intending to measure feelings toward rap music as a cause of crime:
| On each line below and between each extreme, place
a slash closest to your first impression: HOW DO YOU FEEL ABOUT THE ARGUMENT THAT RAP MUSIC CAUSES CRIME? Bad ---------------------------------------------------------------------------------------------------------------Good Deep ----------------------------------------------------------------------------------------------------------Shallow Weak ----------------------------------------------------------------------------------------------------------Strong Fair ----------------------------------------------------------------------------------------------------------------Fair Quiet -------------------------------------------------------------------------------------------------------------Loud Modern --------------------------------------------------------------------------------------------------Traditional Simple ------------------------------------------------------------------------------------------------------Complex Fast ---------------------------------------------------------------------------------------------------------------Slow Dirty ------------------------------------------------------------------------------------------------------------Clean |
You can use the semantic differential with any adjectives you choose, and they don't even have to make sense. The point is to collect response patterns that you can analyze for scaling purposes. To quantify a semantic differential, all you do is overlay a Likert-type scale on top of it, and assume the endpoints are extremes such as "very bad" or "very good." You can also use a ruler and obtain precise numerical measurements. Throw out the items that don't correlate well with one another, and you've got a very precise and accurate scale.
INDEX DEVELOPMENT
An index is a set of items (questions) that structures or focuses multiple yet distinctly related aspects of a dimension or domain of behavior, attitudes, or feelings into a single indicator or score. They are sometimes called composites, inventories, tests, or questionnaires. Like scales, they can measure aptitude, attitude, interest, performance, and personality, but the only kind of validity they have is convergent (hanging together), content, and face validity. It is possible to use some statistical techniques (like factor analysis) to give them better construct validity (or factor weights), but it is a mistake to think of indexes as multidimensional (no such word exists) since even the most abstract constructs are assumed to have unidimensional characteristics. Indexes are usually at the ordinal, but mostly interval level. Indexes can be predictive of outcomes (again, using statistical techniques like regression), but they are designed mainly for exploring the relevant causes or underlying symptoms of traits (like criminality, psychopathy, or alcoholism). It's probably an overstatement, but indexes are used primarily to collect causes or symptoms, as the following example shows:
| An Example of an Index Measuring
Delinquency:
I have defied a teacher's authority to their face. |
Indexes are usually administered in the form of surveys or questionnaires. It's only at the time of report writing that you claim to have developed an index. You'll need an ideal response rate of 35% on your questionnaire, and at least a 5-point Likert scale for the response categories. How do create good questionnaires is the subject of another lecture. There are a variety of ways to do surveys. Factor analysis, cluster analysis, or other advanced statistical techniques are typically used for item analysis of surveys.
FACTOR ANALYSIS AND CLUSTER ANALYSIS
These are advanced methods of data analysis that require special training and proficiency at using computerized statistics programs like SPSS. Factor analysis can help develop an index, test the unidimensionality of a scale, assign weights (factor loadings) to items in an index, and statistically reduce a large number of indicators to a smaller set. It works by a process known as ipsative scoring which places all the numbers in a variance-covariance matrix and then performs multiple iterations (repeats) on this matrix until the most statistically meaningful common denominators can be found. These meanings may or may not be theoretically significant. If you're lucky, only one factor, or common denominator will be produced. Ordinarily, factor analysis produces 4-5 such factors, and the researcher then has to justify discarding them in favor of the core set of items for their index or scale.
Cluster analysis is a similar technique, but more in keeping with the way reliability coefficients are produced. It involves iterative computer runs on your data matrix that continually resorts and reclassifies your groupings and categories into the most elegant mathematical matrix. The result is a tree and branch diagram which shows you which items are are most connected to the others. Both factor and cluster analysis are avoided by many researchers in favor of plain old fashioned looking at inter-item correlations.
REVIEW QUESTIONS:
1. What is the difference between the logic of a scale and the logic of an
index?
2. Name some things that would go into a parole success prediction index.
3. Discuss the advantages and disadvantages of using 7-point Likert scales.
4. Discuss the advantages and disadvantages of using a "Don't know"
category.
PRACTICUM:
1. Construct a short Guttman scaling series of questions on a topic of your
choice.
2. Construct some Likert scale items on a series of questions of your choice.
INTERNET RESOURCES
Babbie's
Online Practice Quiz on Scales and Indexes
Guidelines for Writing Likert-type Scales and avoiding No-Response items
Item Analysis and Reliability
List of Tests, Scales and Indexes in the Mental Measurements Yearbook
Prof. Trochim's Lecture Notes on Scaling
SPSS Online Tutorial
Statistics associated with Scales and Measures
The Factor Analysis Glossary
The Cluster Analysis Algorithms
PRINTED RESOURCES
Brodsky, S. & H. Smitherman. (1983). Handbook of Scales for Research in
Crime and Delinquency. NY: Plenum.
Brown, F. (1976). Principles of Educational and Psychological Testing. NY:
Holt, Rinehart, Winston.
Ebel, R. (1972). Essentials of Educational Measurement. Englewood Cliffs,
NJ: Prentice Hall.
Hagan, F. (2000). Research Methods in Criminal Justice and Criminology.
Boston: Allyn & Bacon.
Likert, R. (1932). "A Technique for the Measurement of Attitudes" Archives
of Psychology 140, 55.
Neuman, L. & B. Wiegand. (2000). Criminal Justice Research Methods.
Boston: Allyn & Bacon.
Nunnally, J. (1978). Psychometric Theory. NY: McGraw Hill.
Salkind, N. (2000). Exploring Research. Upper Saddle River, NJ: Prentice
Hall.
Stevens, S. (Ed.) (1951). Handbook of Experimental Psychology. NY:
Wiley.
Thurstone, L. & E. Chave. (1929). The Measurement of Attitudes.
Chicago: Univ. of Chicago Press.
Last updated: Oct 09, 2006
Not an official webpage of APSU, copyright restrictions apply, see
Megalinks in Criminal Justice
O'Connor, T. (Date of Last Update at bottom of page). In Part of web cited
(Windows name for file at top of browser), MegaLinks in Criminal Justice.
Retrieved from http://www.apsu.edu/oconnort/rest of URL accessed on
today's date.