Validity and Reliability
- When you are trying to measure something, you want the results to tell you about what you are trying to
measure.
- In order to get good results, two things must happen:
- your measure should be Valid
- your measure should be Reliable
- The validity of a measure is the extent to which differences in results
of the measurement reflect true differences among individuals on the characteristic
that is supposed to be measured.
- In other words: The measure is sensitive only to what it is supposed to measure.
- For example, the IQ test would be valid if it measured only differences in intelligence.
- It shouldn't matter how tired, nervous, or hung over you are when you take the test;
- only your intelligence should affect your score.
- The reliability of a measure is its consistency.
- A measure is reliable if you get the same result if you repeat the measurement.
- If a measure is perfectly valid, it is sensitive only to differences in what it is supposed to measure.
- This means that a perfectly valid measure must also be reliable.
- If it isn't reliable, it must be giving different results from time to time.
- Something must be causing these different results.
- If this something is anything other than actual changes in what it is supposed to measure, the
measurement is sensitive to something other than what it is supposed to measure.
- This means that it isn't perfectly valid.
- If a measure is perfectly reliable, it may or may not be valid.
- The fact that it always gives the same results doesn't mean those results are valid.
- My nephew has an elastic tape measure and everything he measures is "six."
- He is completely consistent, but rarely are his measurements correct.
- Types of Validity
- Face validity
- Pragmatic or Predictive validity
- Criterion validity
- Construct validity
- Internal validity
- External validity
- Face validity has to deal with the extent to which a measure seems
to measure what it is supposed to measure.
- It "looks like" it ought to be measuring what it is designed to measure.
- It "looks like" it measures all relevant aspects of the construct.
- Face validity is necessary ....
- but it's not enough.
- That is, you must have face validity, but you also need more.
- Pragmatic/Predictive validity and Criterion validity all use
a second measure to assess the validity of the first.
- Of course the second measure must be known to be valid if it is to be used as a standard.
- Predictive or pragmatic validity is easy--does the measure make correct predictions?
- For example, does the GRE exam predict success in graduate school?
- All you have to do is wait and see.
- If it does, i.e. if people who get high exam scores do well in graduate school, the exam has
predictive validity.
- Criterion validity is a bit more complex.
- If you have a second measure you know to be valid, you may be able to check to see if the first
measure agrees with the second one.
- If it does, you have criterion validity.
- How do you determine whether the second measure is valid?
- face validity;
- repeated confirmation, as in predictive validity.
- Construct validity is the most important and most complex kind of
validity.
- It has to do with your "construct" or "concept" being a single unidimensional one, rather than being
a family of several different ones.
- If the construct has several aspects or components, it will be difficult for any one measure to
encompass all of them.
- This means that different approaches to measuring the construct will produce different results
- because each one gets at a different aspect, or . . .
- because each one comes from a different perspective.
- If the construct is a single thing
- (if it isn't a family of different ideas grouped together under a single name)
- you would expect that different operational definitions of the construct would result in identical or
similar measurements.
- If this happens, you would say that the construct and measurement approach together have
construct validity.
- But if your construct is actually a family of related but different ideas . . .
- . . . you would expect that different approaches to measuring it would be likely to connect with
different members of the family,
- and therefore, give a different result.
- If this happens, you have a problem with construct validity.
- This is serious because different researchers would be likely to use different operational definitions
-- different measurement approaches -- and they would get different results, although they think
they are measuring the same construct.
- Several research groups are interested in something they call "religiosity" which is suposed to be how
religious a person is.
- One team looks at how often people go to church.
- Another team looks at how often people pray.
- Another team looks at how spiritual people are.
- Another team looks at how much people obey the rules of their church.
- Another team looks at how much people have integrated their lives with their church (send kids to
religious school, patricipate in church-related social activities, etc.)
- They all get different results, even when they look at the same people.
- Internal validity refers to the internal consistency of the set of
operational and conceptual definitions and the logical relations among them.
- External validity refers to the generalizability of the results.
- For the research project, data were collected from a sample of the population -- maybe 150 people.
- If that sample is a fair representation of the population, then the results based on the sample can be
generalized to the population.
- Measurement is not perfect.
- There is usually (always?) some error.
- Error can be classified into two types:
- systematic error
- random error
- Systematic error is what is normally called "bias."
- It shows up as results consistently being distorted in the same direction.
- It is like when you are consistently off in one direction when shooting at a target.
- Random error is not consistently in one direction.
- It varies from case to case, both in magnitude and in direction.
- It is what is normally call "unreliability."
- It is like when you are shooting at a target and your shots are scattered about.
- We generally worry more about systematic error than random error, even though it may be possible (in
some cases) to identify and compensate for systematic error.
- This is probably because biases tend to work differently for different subsets of individuals in the
population.
- If they worked in the same direction and to the same extent for all cases, they would be easily
compensated for.
- Here you get consistent and accurate measurements of the bull's-eye, the construct you were aiming at.
- If you were measuring something like, for example, self-confidence, the results you would get in
this case would be very close to the actual levels of self-confidence of the people in your study.
- Any deviations from the actual values would be both small and random, in the sense that they are
spread in all directions.
- Here your measurement is consistent, but consistently below the center of the target.
- As in the previous slide, the errors are small, but systematically in one direction.
- You would say this measurement is biased.
- If you were measuring self-confidence and your indicator was the loudness of people's voices
(assuming that confident people speak loudly), your measures would all be biased if your
measuring equipment was improperly adjusted and consistently gave readings lower than they
should be.
- Here your measurements are in the general vicinity of the center of the target, but they are scattered
some distance away.
- These measurements are not reliable.
- However, since they are scattered in all directions, there is no systematic bias.
- Although no single measure is accurate, in the long run (i.e. if your sample is large enough), the
average of the measurements may be reasonably close to the correct average value.
- In this case, the measurements are spread out, but they are all on one side of the center of the target.
- The wide spread indicates a lack of reliability, while the uneven nature of the spread indicates bias.
- Something is pushing the results to one side of the true values; the results are systematically biased.
- These results are neither valid nor reliable.
- One is shown here. In this case, there are three distinct clusters of points, as if there were somehow three
targets or three modes of measurement.
- All the data points fall into one of the three tight clusters.
- What could cause this?
- If the measurement process were somehow tapping into three different constructs -- if you have a
problem with construct validity, this is what you might expect.
- Construct validity is probably the most difficult issue to deal with
when you are studying abstract constructs.
- The best way to deal with it is to use an approach called "multiple methods, multiple measures."
- To do this, you use several approaches to measurement, where each uses different methods or
comes at the construct from a different direction.
- If the various approaches produce results that agree with one another, you can feel more confident
that you in fact have only one construct.
- If they disagree, the validity of the construct would be called into question.
- The hunter trying to shoot at the target is the researcher trying to measure a construct.
- As you can see, there are three targets/ constructs, and the hunter is looking at them through a
mirror, which causes distortions and makes it difficult to aim properly.
- The mirror in the picture represents the operational and conceptual definitions the researcher is
using to guide the study.
- The difficulty is that the hunter/ researcher thinks there is one target and the measurement process
is direct and simple.