Note: Mentions of exams refer to the exam that covers the topic being described; some of these topics are introduced early and would appear on the first exam, some appear on the second, and some aren't discussed till the end of the semester and therefore obviously won't be tested until the final exam. You will be given explicit information about which topics are on which exams. This page is simply a convenient collection of some useful points.

what p-values mean

- The p-value at the end of a hypothesis test tells you the probability of getting the data you got or data more extreme than that, given that the null hypothesis is true. It does NOT tell you the probability that the null hypothesis (or any other hypothesis) is true (or false), given the data that you got.

- To see why that is, consider how different those two probabilities could be, since they really have nothing to do with one another. The analogy is that the probability of being Italian given that you're in the mafia is pretty high, but the probability of being in the mafia given that you're Italian is almost zero.

the REAL logic of hypothesis testing

We generally use this flawed logic (even in the textbook):

- The probability of getting data as extreme as this is very small if the null hypothesis is true.

- We DID get this data.

- Therefore the null hypothesis is probably not true.

To see that it's flawed, consider this parallel argument (noting, as above, that the population of the mafia is incredibly tiny compared to the total number of Italians):

- The probability of this person being in the mafia is very small if the person is Italian.

- This person IS in the mafia.

- Therefore the person is probably not Italian.

The first conclusion seems okay (especially if you apply it concretely to a long run of coin flips all coming up heads), but the second seems pretty clearly wrong. The reason is that the real logic of the situation involves hidden assumptions: in the second example it's the assumption that there are NOT a lot of other ways of being in the mafia without being Italian, and in the first example it's the assumption that there ARE a lot of other ways the data could occur without the null hypothesis being true. Perhaps the data are even MORE likely under some of those other ways. But we don't identify or quantify those other relevant probabilities, we just leave them lurking in the background unexamined and conclude that at least one of them must be better than the null hypothesis at accounting for our data. Typically the one we favor is our own research hypothesis -- that our treatment does what we think it does. But that is nowhere near proven by our hypothesis testing procedure.

what probability is in the mathematical sense

Probability can mean two things: mathematical probability is the long-run frequency of an event occurring out of a large number of possible occurrences, and personal probability is a subjective degree of confidence that an event will occur. There's no way to talk about the mathematical probability of an event that only happens once even though we might talk informally about our degree of confidence or belief in, for example, the probability of a horse winning a particular race, or of a particular experiment's hypothesis test decision being correct or incorrect. We only consider the mathematical probability of getting a certain sample drawn from a population from which we might have drawn very many other possible samples. Then we use that probability to make a decision about a null hypothesis, but we never know for sure if we made the right decision.

paired samples / related samples / repeated measures t-test for repeated measures designs

To do a paired samples / related samples / repeated measures t-test, you do 3 things (only the first two are new):

1) Make a column of "Difference Scores" for each participant, by subtracting variable 1 from variable 2 (i.e., before - after, time1 - time2, condition1 - condition2, etc.) for each participant. Do the subtraction in the same order for everyone, of course. FORGET the original scores now; you just have ONE sample of difference scores, and you'll do a regular single sample t-test.

2) The null hypothesis will be that in the population, the mean of the difference scores is 0 (H₀: μ=0) -- because that's the situation you'd have if there were no difference between the two conditions. The first mean minus the second mean would be zero, so for the difference scores you computed, the mean of those difference scores is zero if there's no effect of the treatment.

3) Now do your t-test like a regular single-sample t-test (NOT a two-sample or independent samples t-test!). Degrees of freedom is number of DIFFERENCE SCORES minus one, not total number of original scores minus one -- which means it is still number of participants minus one, as in the single sample t-test, since each participant gives you ONE difference score. If your t is big enough, i.e. if your p-value is small enough, you'll reject that null hypothesis, meaning you'll decide that you don't think those difference scores really come from a population of difference scores with a mean of zero. And if the mean ISN'T zero, then the treatment that caused the difference must really have an effect.

methodological considerations for both repeated measures t-test and repeated measures ANOVA:

reasons for using a repeated measures design:

1) It uses fewer subjects.

2) The subjects act as their own control group, eliminating individual differences as an explanation of group mean differences.

3) Some research is explicitly concerned with how subjects change over time, so we need to be able to compare the same subjects' performance at two or more different times.

problems with repeated measures designs:

1) practice effects may improve performance over time

2) fatigue effects may worsen performance over time

3) order effects may cause performance on one condition to be influenced by the particular previous condition(s) or task(s) that subject did

Design strategy: counterbalancing different orders of conditions (e.g., half the participants get condition 1 first while the other half get condition 2 first) spreads these problematic effects equally over both conditions so they don't cause confounds. (This can be extended to three or more conditions in a design.)

assumptions of t-tests and ANOVA

The three fundamental assumptions of t-tests and ANOVA are the same:

- independence of observations

- normality of the scores (in each group), though if n>30, sample means will be normally distributed even if scores are not, thanks to the Central Limit Theorem

- homogeneity of variance in the populations

Checking assumptions:

- independence just requires good experimental design

- SPSS can produce histograms to check normality of each group; there are also Q-Q plots which work better but aren't covered in PSYC 2100.

- we believe the populations have the same variances if in the samples (groups) the biggest variance is no more than 4 times the size of the smallest variance -- this rule applies best to equal sample sizes. But significance tests can also be done on the variances to decide if they're (significantly) different, in which case the hope is for NON-significance on those tests; of those, the text's choice of Hartley's F-max is simplest but worst and will not appear on the exam (nor will the best, which is the Brown-Forsythe test that's not in our textbook though I may mention it in class). SPSS instead uses Levene's test, and all you need to know about it is that you want it to be NON-significant (p-value GREATER than .05), in order to conclude that the variances are NOT significantly different. If it IS significant, it says the variances ARE significantly different in the population (that is, the assumption is violated), in which case you can still use the SPSS results -- you just have to read the second line of the t-test output, labeled "Equal variances NOT assumed", which uses new degrees of freedom modified by a magical formula that you don't have to worry about. Most stats experts these days actually suggest using that second line of the output labeled "Equal variances NOT assumed" NO MATTER WHAT the Levene test says, because it will give you fewer errors in the long run -- and that certainly simplifies things.

For repeated measures ANOVA with three or more levels:

- a fourth assumption is involved, technically called "sphericity" but sometimes referred to as "homogeneity of covariance" or "homogeneity of correlation" or "compound symmetry". It's a little complicated and won't be addressed in PSYC 2100. But for your information, it says: If you have three repeated measures conditions for one group of subjects, take each PAIR of conditions (i.e., 1 and 2, 1 and 3, 2 and 3) and convert them to difference scores as in the repeated measures t-test above. Now each subject has a difference score we could label "1-2", "1-3", and "2-3". If we do that for every subject, we have three columns of difference scores, just like we initially had three columns of raw scores for conditions 1, 2, and 3. Sphericity means that the variances of those three difference score columns have to all be the same in the population. It's like the homogeneity of variance assumption above, but applied to the difference scores. (If there were four conditions, there would be more difference score columns: 1-2, 1-3, 1-4, 2-3, 2-4, 3-4; if there were only two conditions as in a t-test, there would be only one difference score column 1-2 and its variance couldn't be different from itself, so the assumption couldn't be violated!) This is a difficult assumption to meet, so SPSS has tests of the sphericity assumption, and corrections for its violation, built in to the output for Repeated Measures ANOVA.

effect size measures

r² and Cohen's d for t-tests; note that for the two-sample t-test, the denominator of Cohen's d uses the pooled standard deviation, which is just the square root of the pooled variance that gets used in the t-test formula.

R² for ANOVA (which is SS_BET / SS_TOT), the proportion of variance explained or accounted for; R² is also known as η² (eta²).

"Partial eta²" in SPSS is the same idea but using a slightly different denominator: it's SS_BET / (SS_BET + SS_W/IN) (note that "SS_W/IN" is sometimes written as "SS_ERROR"). With just one factor in the design, this is exactly the same thing as "complete eta²" above, because SS_TOT = SS_BET + SS_W/IN making the denominators the same. But with more than one factor in the design, the Between factor could be factor A, or B, or the interaction term A*B. For instance, looking at factor B as the between factor of interest, partial eta² would be SS_B / (SS_B + SS_W/IN). The reason it's a better measure of effect size in that case is that we wouldn't expect factor B to explain ANY of the variation due to factor A or the interaction AB, so we might as well leave those two factors out of the denominator. In the example just mentioned, SS_TOT = (SS_A + SS_B + SS_AB + SS_W/IN), so leaving out the irrelevant parts of the denominator just leaves (SS_B + SS_W/IN).

confidence intervals

A confidence interval (CI) is a range of values that estimates the location of the population mean based on your sample and its variability. It's computed by turning the t formula inside out to sort of solve for μ. The value of t you put into the expression depends on the confidence interval you're constructing -- for 95% confidence, use the t value for α = .05, for 99% confidence, use the t value for α = .01, etc. (and of course, use the t with the appropriate degrees of freedom for your sample). The same logic applies in constructing a confidence interval for the difference between two means (or μ₁-μ₂).

SINGLE SAMPLE CASE:

95% Confidence Interval (single sample):

(M - t_.05*s_M) ≤ μ ≤ (M + t_.05*s_M)

TWO INDEPENDENT SAMPLES CASE:

95% Confidence Interval (two independent samples):

[(M₁-M₂) - t_.05*s_M1-M2] ≤ μ₁-μ₂ ≤ [(M₁-M₂) + t_.05*s_M1-M2]

what a confidence interval tells you...

...about the population: A 95% confidence interval means that if confidence intervals were calculated for 100 replications of the same experiment, they'd all come out different, but 95 of them would be expected to contain the population parameter value of the mean (μ). It makes no sense to say a particular confidence interval has a 95% chance of containing the population parameter; it either does or it doesn't. But if many confidence intervals are constructed using the same procedure, we can talk about the long-run frequency of their containing μ.

...about the sample: A 95% confidence interval tells you that any values outside of that interval would be significantly different (with p<.05) from your sample value if you tested them as hypothesized population parameter values of μ in a t-test. This means that if zero is inside the interval, your sample statistic is not significantly different from zero; if zero lies outside the interval, the sample statistic is significantly different from zero. If you want those decisions to be based on p<.01 instead of .05, you'd construct a 99% CI (by putting the .01 cutoff value of t with the same df into the formula).

...about the precision of your estimate of the sample mean: A wide confidence interval indicates less precision; a narrow confidence interval indicates greater precision. Even though confidence intervals offer the same information as hypothesis testing and p-values mathematically, they are more useful in the sense of informally suggesting how reliable your sample mean might be, apart from any interpretations in terms of hypothesis testing.

post hoc tests

Many "post hoc" tests exist for comparing means "after the fact", i.e., after the main ANOVA result, but it's sufficient for us to cover the Bonferroni adjustment (referred to as the "Dunn test" in the 8th Ed.) and the Tukey test (Ch. 12 p. 417; 8th Ed. p. 427). The exam will NOT cover the Scheffé test, even though it is in the text.

The Bonferroni adjustment or correction controls the overall Type I error rate while making multiple comparisons among groups after an ANOVA. Doing an ANOVA doesn't tell us which groups are different from each other, so we could follow it up with t-tests between all possible pairs to find that out. If we compare all groups to each other when we have three groups, that's three comparisons; with four groups there are six comparisons to make, and so on. But if each comparison is done with α=.05, that would increase the overall rate to (roughly) α=.15 if we did three comparisons, or α=.30 if we did six. The Bonferroni correction simply says, let's keep the overall α=.05 by dividing that .05 by the number of comparisons we're making. With three comparisons, divide α by 3, which is .05/3 = .017; with six, it's .05/6 = .0083. For us to consider a comparison significant now, p<.05 isn't good enough -- we'd need a result that gave us p<.017 (or with six comparisons among four groups p<.0083). If each comparison is done with α=.0083, then doing six increases the α to .05 at the worst. We've controlled the error rate. However, requiring p<.0083 is a much harder criterion to reach, and we'll be more likely to NOT reject the null hypothesis -- even if it's false and should be rejected! That is, the tradeoff for limiting the Type I errors is that we're more likely to make a Type II error of not recognizing a difference or a relationship when it's really there.

The Tukey HSD test is a way of finding a compromise between controlling Type I errors and Type II errors. "HSD" stands for "Honestly Significant Difference" and it tells you how far apart any two means have to be to be significantly different from one another. If the HSD is 4, then means that are at least 4 points apart (e.g., 6 and 11, or 11 and 18) are significantly different but those that are less than four points apart (e.g. 6 and 3, or 11 and 13) are not. To get the value of the HSD, divide MS_W/IN by the group size, take the square root of that, and multiply by a number called "q" or the "Studentized Range Statistic", which you get from a table in the back of the book. To look up q you just have to know what your alpha is (probably .05), what your denominator df are ("df for Error Term"), and how many group means you're comparing. Once you've done q*square root of (MS_W/IN divided by sample size n), that's your HSD and you just check to see which of your means are at least that far apart. For the exam you may not have to calculate Tukey's HSD or even look up q in the table, but you definitely have to know how to use the HSD value, as described above.