Beyond Consistency: Debunking Myths and Unveiling the Real Meaning of Test Reliability

Introduction

Every now and then I like to write about a psychometric concept that is fundamental to understanding the science of psychometric testing and the prediction of human behavior. Of course, there are three such concepts that are often tossed about in conversations, courtrooms, workshops, articles, training seminars, and even sales calls – validity, reliability and fairness. One of these concepts, test reliability, is discussed often, and in nearly every case – incorrectly. My goal is to help test/assessment users become experts in these tools, thereby ensuring that they are used correctly and for the right reasons, therefore I’m going to attempt to clarify the concept of test reliability for you today.

Unfortunately, I see the concept of reliability described, again and again, as consistency of test scores. While this definition is on the edge of the truth, it is overly simplistic at best, and far from correct. I consistently (pun-intended) see very experienced and well-intentioned I/O psychologists put forth this definition in an effort to instruct others, so I’m hopeful that this article will benefit a wide audience. A note here to assuage some readers – I/O academicians generally get this concept correctly, but many practitioners that seem to have forgotten, or never learned, the concept.

What is Test/Assessment Reliability?

In short, the psychometric concept of Reliability is about accuracy – not consistency. I’ll explain… We all possess a ‘true’ level of the trait or ability being assessed by a test or psychometric assessment. This is true for mathematical ability, conscientiousness, agreeableness, etc. But as of now, we have no way of actually getting inside someone’s head and seeing/measuring these ‘true’ trait or ability levels. Perhaps Microsoft, Google, OpenAI, or someone will make that possible in a couple of years, but for now – we just can’t do it. So – our only solution when we need an indication of these trait or ability levels is to administer a test or assessment and derive an estimate of them, otherwise known as observed scores. The more accurately our estimates (observed test scores) match individuals’ true scores – the higher the reliability. Reliability can range from 0-1, so if our estimate is perfect, the reliability (r_xx) would equal 1.

Ok – I’m more of a visual learner, so at this point I think an image will help to describe the concept.

Imagine you’re taking a photo of a mountain in the distance. The photo represents what you see, but there might be things affecting the clarity – perhaps there’s mist, or your hands shake a bit. The clear image of the mountain, without any disturbances, represents the ‘true’ image. The disturbances, like mist or shaky hands, represent ‘errors’ in capturing that image.

In testing, a person’s ‘true’ ability or trait is like that clear mountain image. The test score we see, just like our photo, might have disturbances or ‘errors’. Test reliability is all about measuring how close our test score (or our photo) is to that ‘true’ ability or trait (the clear mountain image).

I think that’s a great example and hope that was helpful. Now, however, we need to get into the math so you can really understand what those reliability coefficients mean. In order to do this, we need to talk about Classical Test Theory (CTT). Don’t worry, we will wrap this up with a discussion about why this all matters.

Classical Test Theory (CTT)

Classical Test Theory (CTT) is a framework for understanding and evaluating the reliability and validity of tests, especially in educational and psychological settings. It operates under several basic assumptions about test scores, and it focuses on two primary components: the true score and the error score.

Here’s a breakdown of the theory:

True Score: This is the “real” score an individual would get if there were no errors in testing. It’s an average score that a person would get if they took the same test an infinite number of times. This is because errors can either raise or lower an observed score, but after enough administrations, they would cancel each other out.

Error Score: This is the difference between the observed score (what the individual got on the test) and the true score. Error can arise from various factors like the test-taker’s mood, environmental distractions, ambiguous questions, or other random factors.

Observed Score: This is the score we actually measure or see (i.e., the test score). In CTT, an individual’s observed score (X) on a test is conceived of as the sum of their true score (T) and error score (E):

X=T+E

Here:

X represents the observed score
T is the true score
E stands for the error component

Reliability Defined

Within CTT, reliability (r_xx) is a major concern and is essentially the proportion of the variance in observed scores that is due to variance in true scores. A test that has high reliability will have a high proportion of true score variance relative to the total variance.

r_xx= (σ_T^2)/(σ_X^2 )

Where:

(σ_T^2) is the variance of true scores.

(σ_X^2 ) is the variance of observed scores.

If the test is perfectly reliable, all variance in observed scores is due to variance in true scores, making r_xx=1.

There, that wasn’t so bad. If you are confused, please look online for a basic statistical description of a distribution, and you will see examples of how to calculate the mean (average) and variance (spread) that will be helpful. Here’s a video that does a pretty good job of it.

How We Estimate Reliability

The reliability coefficients that are reported for tests and assessments are actually estimates of reliability. This is because of our inability to get our hands on individuals’ ‘true’ scores, which are needed to derive the statistic. Our estimates, however, can be quite accurate, especially if we have large samples to work with. While a full discussion of this topic is beyond the scope of this article, you should become familiar with the three most often cited reliability coefficients, or estimates of reliability. These different ways of estimating reliability are unfortunately often incorrectly referred to as different types of reliability by well-meaning practitioners. Reliability is a singular theoretical construct as described above. This is an important point, and it’s important for you to understand what these estimates (i.e., reliability coefficients) really mean so that you can properly evaluate your assessments.

Test-Retest Reliability Coefficient: This involves administering the same test to the same group of individuals at two different points in time and obtaining a correlation coefficient. This method primarily captures stability over time. However, this method assumes that the trait being measured does not change over time.

Parallel Forms Reliability Coefficient: This uses two different forms of a test that are designed to be equivalent. This approach captures consistency across different versions of a test.

Internal Consistency Coefficient: This evaluates the extent to which items within a test are consistent with one another in measuring the same concept. There are several methods used to assess the internal consistency of a test, but here are two commonly used approaches.

Split-half : Dividing the test into two halves (e.g., even and odd items) and correlating the scores.
Cronbach’s alpha: A measure that estimates the average of all possible split-half reliabilities for a test. The formula for Cronbach’s alpha (α) is:

While all of these methods can provide estimates of reliability, it’s important to remember that no test is entirely free from error. It’s also essential to note that while consistency or repeatability is a part of reliability, it’s not the entirety of it. Reliability encompasses not only the stability of scores over time but also the consistency of scores across equivalent forms of a test, the internal consistency of test items, and more.

Why is Test Reliability Important?

Let’s move away from the math and back to practical applications. At this point you might naturally want to know why reliability is such an important construct and why tests should have high levels of reliability. Here’s why:

Getting to the Truth: Just as we’d want our photo to capture the mountain as accurately as possible, when we test someone’s abilities or traits, we want the score to reflect their true standing. If a test is unreliable, it’s like trying to understand the shape of the mountain from a blurry photo.

Avoiding Misjudgment: Suppose a company uses a test to determine if someone is suited for a job. If the test isn’t reliable, it might not capture the person’s true abilities. This is like showing someone a blurry photo and asking them to describe the mountain – they might get it wrong.

Consistency Isn’t Enough: If you took multiple blurry photos of that mountain, they might all look similar (consistent), but they still wouldn’t show the mountain’s true image. In the same way, a test can give consistent results, but that doesn’t mean it’s showing a person’s true abilities. Reliability ensures that test scores are both consistent and accurate reflections of the truth.

To sum it up, test reliability isn’t just about getting the same result over and over; it’s about making sure that result is a true reflection of what’s being measured, much like wanting our photo to truly capture the mountain’s image.