Notes on the meaning and interpretation of Confidence Intervals

Confidence Intervals can be calculated for any sample statistic to describe its degree of precision as an estimate of the population value (or parameter). The discussion below is mostly in terms of confidence intervals on sample means, only because most people are quite familiar with that kind of t-test. But t-tests and confidence intervals are used to describe other statistics like regression coefficients (b-weights), correlations, etc. All that's needed is a way to calculate the standard error of the statistic, that is, the spread of its sample-able values around the true population parameter. Typically that's part of your computer output, along with the CI itself. You can identify CI's for any chosen percentage, e.g., a 90% CI, 95% CI, 99% CI etc. The lower and upper bounds on the interval are referred to as confidence limits. What exactly do these percentages and confidence limits mean?

Here is what they mean; the rest of this document is just an attempt to justify this and make it more intuitive. Confidence intervals can be interpreted as saying something about the population, and something about the sample.
Population: A 95% confidence interval means that if confidence intervals were calculated for 100 replications of the same experiment, they'd all come out different, but 95 of them would be expected to contain the population parameter value (of the mean, b-weight, correlation coefficient, etc.). It makes no sense to say a particular confidence interval has a 95% chance of containing the population parameter. (More on this below...)
Sample: A 95% confidence interval tells you that any values outside of that interval would be significantly different (with p<.05) from your sample value if you tested them as hypothesized population parameter values using a t-test. Likewise it tells you that your sample value would be significantly different from any values outside of the interval. If these sound like the same thing, that's because they are -- you can think of the distance to some value from your sample mean, or from your sample mean to some value, and you'll be doing the same thing conceptually. (More on this below...) This also means that if zero is inside the interval, your sample statistic is not significantly different from zero; if zero lies outside the interval, the sample statistic is significantly different from zero. If you want those decisions to be based on p<.01 instead of .05, you'd construct a 99% CI.

Note that Keith (2006) p. 11 has it ALMOST correct when he calculates a 95% confidence interval for a b-weight regression coefficient of 1.990 that goes from .809 to 3.171, and then interprets it as, "if we were to conduct this study 100 times, 95 times out of 100 the [true (but unknown) population value of] b would be within the range .809 to 3.171." But he should have said "would be within the range that you'd calculate for each of the 100 samples." After all, a second sample would give you different confidence limits than ".809 to 3.171," as would the third, the fourth... and the hundredth. Can they ALL have the same 95% chance of containing the population b (also known as β or "beta", the Greek symbol for the population parameter b)? Which of them to believe? As Howell (2007) says on pp. 181-183, our confidence is in the METHOD, not the particular INTERVAL: we generate 100 intervals using that method, and the result is that 95 of those intervals so generated will contain the population value of b. Keith's error is to say that 95 times out of 100 the population value of b would fall in that specific first interval of .809 to 3.171. He IS correct, though, to say that a statistically significant result of p<.05 is equivalent to your sample producing a 95% confidence interval that does NOT contain zero; if that interval were, say, -.191 to 4.171, which DOES contain zero within it, we would also find that when we test the Ho that b = 0 in the population, the p-value associated with that sample b is in fact greater than .05 (i.e., non-significant).

Notice that we make the significance claim based on a p-value for the sample at hand, which is found using the sample value for the b (or the sample mean if you're doing a familiar test of means instead of regression slopes); and the estimate of its variability in the population is also based on the sample at hand. We likewise construct confidence intervals by inverting the t calculation (see Howell ch. 7) to find an interval centered around the sample value at hand, that has a width based on the population variability estimated from the sample at hand. We would never claim that our p-value represents the significance of the mean from some other sample -- it's quite specific to the one at hand. Why then would we think we could use the exact same information to make a confidence interval that somehow would describe the population value, and would thus apply to every other sample we might examine?

This may become clearer when you consider the computation of confidence intervals. To do that, first consider the t-test for a single mean being tested against a hypothesized population mean of zero: we estimate the standard error of the mean, i.e., the standard deviation of the set of all the sample means that we might draw from the population (using sample size N), around their overall mean value (which is the population mean). Then we hypothesize that that population mean is zero, and see how far our sample mean is from that value of zero, in units of the estimated standard error -- that is, we have our observed value minus our hypothesized population value (zero), and we divide that difference by that estimate of the standard error. This is exactly like a z-score, which tells us how far a score is from the mean in units of its expected departure from the mean (the standard deviation). Where z tells us how many standard deviations from the mean a score is, t tells us how many estimated standard errors from the population value a sample mean is. And then if we want to assign a probability to a z-score we just use the normal distribution, e.g., the probability of getting a z of +1.0 or greater is 16%. (You'll remember that 68% of the normal distribution falls within 1 SD of the mean, meaning 34% above and 34% below, and if that z-score of +1.0 thus includes the 34% above the mean which is itself at the 50% mark, that means 84% of the scores are lower than z = +1.0, and thus the remaining 16% are higher. Or you could look at a table like a normal person.) We can't use the normal distribution to describe the probabilities of getting certain values of t, because the smaller the N is, the more those t values are spread out compared to the normal distribution. Or more accurately, the smaller the df are, where df = N-1. The t distribution gets closer to the normal z the more df there are, but it takes a lot of df for t to be very similar to z. But by consulting the t distribution for the df we have, we CAN find the probability of getting our t or larger, just as we could for a z-score using the normal distribution. So t is a distance from a hypothesized value, where the distance is either positive or negative and it's expressed in units of the number of estimated standard errors. (I have to keep saying "ESTIMATED standard errors" because if we knew the actual population standard error, we could in fact use the normal distribution after all. Estimating the standard error requires a t distribution because it introduces more variability due to the fact that we don't know how good our estimate actually is, coming from a given sample, except that it definitely becomes a better estimate the larger our N is.)

Confidence intervals simply CHOOSE a t, instead of computing one from the data. That is, we don't find out how many estimated standard errors away from the population value our observed mean is, and then find a probability of getting a sample mean that far away. Instead we START with a probability of being "extreme to a certain degree", like the (conventionally chosen) extreme 5% in the tails of the distribution. In that case we want to identify the number of estimated standard errors from the mean (that is, t) that would encompass 95% of the distribution of sample means and thus leave only 5% of the possible sample means more extreme than that. Our table or computer tells us that value of t that is associated with the extreme 5% cutoff, for whatever df we're dealing with. (For few df, those t distributions are really spread out so you have to count off a higher value of t to reach the point in the distribution that has only 5% of the sample means remaining more extreme. For more df, you don't need as high a t; and for infinite df, when t literally becomes identical to z, that value in the normal distribution that leaves 5% of the sample means more extreme is a mere z = 1.96 -- i.e., 1.96 standard deviations above or below the mean). Once we identify the proper t value, it's just a matter of translating it back into the scale of the variable we're looking at. Since t tells us how many estimated standard errors we are from the mean of all the sample means (i.e., the population mean), just multiply t by that estimated standard error to express the same distance in our original units. Of course, the 95% of sample means we're trying to encompass includes means above AND below the overall mean with 2.5% in the upper extreme and 2.5% in the lower extreme, since t just tells us the distance away from the population mean in either direction to encompass that 95%. So once we translate t back into the original scale by multiplying by the estimated standard error, we have to both add AND subtract that distance from our sample mean, thus resulting in an interval from

[sample mean - t*(est std err)] to [sample mean + t*(est std err)].

The interval is an estimate centered around THIS sample's mean and based on THIS sample's estimated standard error; the width and location of the interval will vary with each new sample. But it is true that if we do it a hundred times we'll expect 95% of those intervals to contain the population mean. We don't know anything about the probability of THIS sample's interval being right, because THIS sample only happens once, and when we talk about mathematical probability what we want is something that occurs a bunch of times (100 is a convenient example) so we can say 95/100 of those times, we'll catch the population value in our net. So what, if the interval is different every time -- we're using the same PROCEDURE 100 times. And 95 of those times our sample mean will, by definition, come from the 95% of the sample means that are closest to the population mean. Because of our choice of t at the first step, for 95 of 100 replications our procedure will be based on samples from that population whose means and estimated standard errors do in fact produce an interval that has our population mean in it.

[A note on that choice of t at the first step: above I referred to "the number of estimated standard errors from the mean (that is, t) that would encompass 95% of the distribution of sample means and thus leave only 5% of the possible sample means more extreme than that." Shortly after that I said those extremes were both above and below the mean. That means to cut off the extreme 5%, you don't get to simply choose the t that leaves 5% of the distribution in one tail. You have to choose a t that includes 47.5% of the distribution on either side of the mean, leaving 2.5% outside that range on either end of the distribution for a total of 5%. Some tables or programs make this choice of t more obvious than some others might; just be aware of this. It's the "two-tailed" situation you're almost always interested in. When a program tells you the 95% CI though, it's always accurate and you can pretend you knew exactly what you were doing all along.]

Obviously no one actually replicates their study 100 times to try to locate the population value of interest; we do a study once, and then feel some degree of confidence based on having applied this method. A Bayesian analysis might even quantify that subjective confidence as "95%"! But that would remain a vague expression of your "degree of belief" and still would not be the same thing as being able to say there's a 95% probability of finding the population value in a particular given confidence interval. You just don't ever get to make that claim. That doesn't really matter, though; often the response to a confidence interval is simply being pleased at how narrow (and therefore precise) it appears, or more often being dismayed at how wide and non-specific it is!

Note finally that if THIS sample's interval, generated with the aim of encompassing 95% of the sample means, does in fact include zero, that's equivalent to saying that THIS sample can't tell us that zero is outside of the 95% of sample means that we're encompassing -- in other words it can't tell us that zero is in the most extreme 5%. Our sample then likewise can't tell us that a value of zero has p equal to or less than .05 and is statistically different from the sample mean. That sounds a bit odd -- isn't it the sample mean that has a p-value when being compared to zero, not vice versa as I've phrased it? A t-test encourages us to look at how far away our sample mean is, starting from the hypothesized population value (usually zero), but distance is distance -- why not consider the distance out to the hypothesized population value, starting from the sample mean? The distance from zero to the sample mean is the same as the distance from the sample mean to zero; if the distance expressed by t isn't great enough to exclude zero from the 95% of sample means we consider pretty likely to occur, then our sample mean is said to be NOT statistically significantly different from zero. A nice thing about confidence intervals as opposed to t-tests is that we don't only have information about the distance from the sample mean to zero -- we have information about the distance from the sample mean to ANY hypothesized population value. Is our sample mean different from 2.9? Not if 2.9 falls in the 95% interval. Is it different from 35.673? Not if 35.673 falls in the 95% interval. Is it different from -6? Well, just to imagine another result, if -6 is OUTSIDE the 95% confidence interval we generated from this sample, then this sample's mean is likewise in the extreme 5% of the distribution centered around -6, and therefore IS statistically different from -6, with p<.05. Look: if you want to say where the real population value actually is, it's problematic to use confidence intervals. But if you want to say whether a sample mean differs from some particular hypothesized value of the population mean, confidence intervals are actually a more general use of the same information that goes into a t-test, because they let you compare a sample mean to all possible population values at once instead of making you compute another t for every population value you might hypothesize. Confidence intervals are still subject to most of the other criticisms of significance testing (see Howell (2007) ch. 4 and Cohen, 1994) but given the logic of significance testing, they are a better use of information than simply reporting the p-value associated with a t-test.

Finally, as mentioned above, all of this discussion of confidence intervals applies to any other statistic from which a t value is calculated too: b-weights, r's, etc.