Note: Mentions of exams
refer to the exam that covers the topic being described; some of these topics
are introduced early and would appear on the first exam, some appear on the
second, and some aren't discussed till the end of the semester and therefore
obviously won't be tested until the final exam. You will be given explicit
information about which topics are on which exams. This page is simply a
convenient collection of some useful points.
what p-values mean
- The p-value at the end of a hypothesis test
tells you the probability of getting the data you got or data more extreme than
that, given that the null hypothesis is true. It does NOT tell you the probability
that the null hypothesis (or any other hypothesis) is true (or false), given the
data that you got.
- To see why that is, consider how different
those two probabilities could be, since they really have nothing to do with one
another. The analogy is that the probability of being Italian given that you're
in the mafia is pretty high, but the probability of being in the mafia given that
you're Italian is almost zero.
the REAL logic of hypothesis testing
We generally use this flawed logic (even
in the textbook):
- The probability of getting data as extreme
as this is very small if the null hypothesis is true.
- We DID get this data.
- Therefore the null hypothesis is probably
not true.
To see that it's flawed, consider this parallel
argument (noting, as above, that the population of the mafia is incredibly tiny
compared to the total number of Italians):
- The probability of this person being in
the mafia is very small if the person is Italian.
- This person IS in the mafia.
- Therefore the person is probably not Italian.
The first conclusion seems okay (especially
if you apply it concretely to a long run of coin flips all coming up heads), but
the second seems pretty clearly wrong. The reason is that the real logic of the
situation involves hidden assumptions: in the second example it's the assumption
that there are NOT a lot of other ways of being in the mafia without being Italian,
and in the first example it's the assumption that there ARE a lot of other ways
the data could occur without the null hypothesis being true. Perhaps the data are
even MORE likely under some of those other ways. But we don't identify or quantify
those other relevant probabilities, we just leave them lurking in the background
unexamined and conclude that at least one of them must be better than the null hypothesis
at accounting for our data. Typically the one we favor is our own research hypothesis
-- that our treatment does what we think it does. But that is nowhere near proven
by our hypothesis testing procedure.
what probability is in the mathematical
sense
Probability can mean two things:
mathematical probability is the long-run frequency of an event occurring out of
a large number of possible occurrences, and personal probability is a subjective
degree of confidence that an event will occur. There's no way to talk about the
mathematical probability of an event that only happens once even though we might
talk informally about our degree of confidence or belief in, for example, the probability
of a horse winning a particular race, or of a particular experiment's hypothesis
test decision being correct or incorrect. We only consider the mathematical probability
of getting a certain sample drawn from a population from which we might have drawn
very many other possible samples. Then we use that probability to make a decision
about a null hypothesis, but we never know for sure if we made the right decision.
paired samples / related samples /
repeated measures t-test for repeated measures designs
To do a paired samples / related samples
/ repeated measures t-test, you do 3 things (only the first two are new):
1) Make
a column of "Difference Scores" for each participant, by subtracting variable
1 from variable 2 (i.e., before - after, time1 - time2, condition1 - condition2,
etc.) for each participant. Do the subtraction in the same order for everyone, of
course. FORGET the original scores now; you just have ONE sample of difference scores,
and you'll do a regular single sample t-test.
2) The
null hypothesis will be that in the population, the mean of the difference scores
is 0 (H0: μ=0) -- because that's the situation you'd have if there
were no difference between the two conditions. The first mean minus the second mean
would be zero, so for the difference scores you computed, the mean of those difference
scores is zero if there's no effect of the treatment.
3) Now
do your t-test like a regular single-sample t-test (NOT a two-sample or independent
samples t-test!). Degrees of freedom is number of DIFFERENCE SCORES minus one, not
total number of original scores minus one -- which means it is still number of participants
minus one, as in the single sample t-test, since each participant gives you ONE
difference score. If your t is big enough, i.e. if your p-value is small enough,
you'll reject that null hypothesis, meaning you'll decide that you don't think those
difference scores really come from a population of difference scores with a mean
of zero. And if the mean ISN'T zero, then the treatment that caused the difference
must really have an effect.
methodological considerations for both repeated
measures t-test and repeated measures ANOVA:
reasons for using
a repeated measures design:
1) It
uses fewer subjects.
2) The
subjects act as their own control group, eliminating individual differences as an
explanation of group mean differences.
3) Some
research is explicitly concerned with how subjects change over time, so we need
to be able to compare the same subjects' performance at two or more different times.
problems with
repeated measures designs:
1) practice
effects may improve performance over time
2) fatigue
effects may worsen performance over time
3) order
effects may cause performance on one condition to be influenced by the particular
previous condition(s) or task(s) that subject did
Design strategy: counterbalancing different orders of conditions (e.g., half the participants
get condition 1 first while the other half get condition 2 first) spreads these
problematic effects equally over both conditions so they don't cause confounds.
(This can be extended to three or more conditions in a design.)
assumptions of t-tests and ANOVA
The three fundamental assumptions of
t-tests and ANOVA are the same:
- independence
of observations
- normality of the scores (in each group),
though if n>30, sample means will be normally distributed even if scores are
not, thanks to the Central Limit Theorem
- homogeneity of variance in the populations
Checking assumptions:
- independence just requires good experimental
design
- SPSS can produce histograms to check normality
of each group; there are also Q-Q plots which work better but aren't covered in
PSYC 2100.
- we believe the populations have the same
variances if in the samples (groups) the biggest variance is no more than 4 times
the size of the smallest variance -- this rule applies best to equal sample sizes.
But significance tests can also be done on the variances to decide if they're (significantly)
different, in which case the hope is for NON-significance on those tests; of those,
the text's choice of Hartley's F-max is simplest but worst and will not appear on
the exam (nor will the best, which is the Brown-Forsythe test that's not in our
textbook though I may mention it in class). SPSS instead uses Levene's test, and
all you need to know about it is that you want it to be NON-significant (p-value
GREATER than .05), in order to conclude that the variances are NOT significantly
different. If it IS significant, it says the variances ARE significantly different
in the population (that is, the assumption is violated), in which case you can still
use the SPSS results -- you just have to read the second line of the t-test output,
labeled "Equal variances NOT assumed", which uses new degrees of freedom
modified by a magical formula that you don't have to worry about. Most stats
experts these days actually suggest using that second line of the output labeled
"Equal variances NOT assumed" NO MATTER WHAT the Levene test says,
because it will give you fewer errors in the long run -- and that certainly
simplifies things.
For repeated measures ANOVA with three
or more levels:
- a
fourth assumption is involved, technically called "sphericity" but
sometimes referred to as "homogeneity of covariance" or
"homogeneity of correlation" or "compound symmetry". It's a
little complicated and won't be addressed in PSYC 2100. But for your information,
it says: If you have three repeated measures conditions for one group of
subjects, take each PAIR of conditions (i.e., 1 and 2, 1 and 3, 2 and 3) and
convert them to difference scores as in the repeated measures t-test above. Now
each subject has a difference score we could label "1-2", "1-3",
and "2-3". If we do that for every subject, we have three columns of
difference scores, just like we initially had three columns of raw scores for
conditions 1, 2, and 3. Sphericity means that the variances of those three
difference score columns have to all be the same in the population. It's like
the homogeneity of variance assumption above, but applied to the difference
scores. (If there were four conditions, there would be more difference score
columns: 1-2, 1-3, 1-4, 2-3, 2-4, 3-4; if there were only two conditions as in
a t-test, there would be only one difference score column 1-2 and its variance
couldn't be different from itself, so the assumption couldn't be violated!)
This is a difficult assumption to meet, so SPSS has tests of the sphericity assumption,
and corrections for its violation, built in to the output for Repeated Measures
ANOVA.
effect size measures
r2 and Cohen's d for t-tests;
note that for the two-sample t-test, the denominator of Cohen's d uses the pooled
standard deviation, which is just the square root of the pooled variance that gets
used in the t-test formula.
R2 for ANOVA (which is SSBET
/ SSTOT), the proportion of variance explained or accounted for; R2
is also known as η2 (eta2).
"Partial eta2" in
SPSS is the same idea but using a slightly different denominator: it's SSBET
/ (SSBET + SSW/IN) (note that "SSW/IN"
is sometimes written as "SSERROR"). With just one factor
in the design, this is exactly the same thing as "complete eta2"
above, because SSTOT = SSBET + SSW/IN making
the denominators the same. But with more than one factor in the design, the
Between factor could be factor A, or B, or the interaction term A*B. For
instance, looking at factor B as the between factor of interest, partial eta2
would be SSB / (SSB + SSW/IN). The reason it's
a better measure of effect size in that case is that we wouldn't expect factor
B to explain ANY of the variation due to factor A or the interaction AB, so we
might as well leave those two factors out of the denominator. In the example
just mentioned, SSTOT = (SSA + SSB + SSAB
+ SSW/IN), so leaving out the irrelevant parts of the denominator
just leaves (SSB + SSW/IN).
confidence intervals
A confidence interval (CI) is a range of
values that estimates the location of the population mean based on your sample and
its variability. It's computed by turning the t formula inside out to sort of solve
for μ. The value of t you put into the expression depends on the confidence
interval you're constructing -- for 95% confidence, use the t value for α =
.05, for 99% confidence, use the t value for α = .01, etc. (and of course,
use the t with the appropriate degrees of freedom for your sample). The same logic
applies in constructing a confidence interval for the difference between
two means (or μ1-μ2).
SINGLE SAMPLE CASE:
95% Confidence Interval (single sample):
(M - t.05*sM) ≤ μ
≤ (M + t.05*sM)
TWO INDEPENDENT SAMPLES CASE:
95% Confidence Interval (two independent
samples):
[(M1-M2) - t.05*sM1-M2]
≤ μ1-μ2 ≤ [(M1-M2) + t.05*sM1-M2]
what
a confidence interval tells you...
...about the population: A 95% confidence interval means that if
confidence intervals were calculated for 100 replications of the same experiment,
they'd all come out different, but 95 of them would be expected to contain the population
parameter value of the mean (μ). It makes no sense to say a particular confidence
interval has a 95% chance of containing the population parameter; it either does
or it doesn't. But if many confidence intervals are constructed using the same procedure,
we can talk about the long-run frequency of their containing μ.
...about the sample: A 95% confidence interval tells you that
any values outside of that interval would be significantly different (with p<.05)
from your sample value if you tested them as hypothesized population parameter values
of μ in a t-test. This means that if zero is inside the interval, your sample
statistic is not significantly different from zero; if zero lies outside the interval,
the sample statistic is significantly different from zero. If you want those decisions
to be based on p<.01 instead of .05, you'd construct a 99% CI (by putting the
.01 cutoff value of t with the same df into the formula).
...about
the precision of your estimate of the sample mean: A wide confidence interval
indicates less precision; a narrow confidence interval indicates greater precision.
Even though confidence intervals offer the same information as hypothesis testing
and p-values mathematically, they are more useful in the sense of informally suggesting
how reliable your sample mean might be, apart from any interpretations in terms
of hypothesis testing.
post hoc tests
Many "post hoc" tests exist
for comparing means "after the fact", i.e., after the main ANOVA
result, but it's sufficient for us to cover the Bonferroni adjustment (referred
to as the "Dunn test" in the 8th Ed.) and the Tukey test (Ch. 12 p. 417;
8th Ed. p. 427). The exam will NOT cover the Scheffé test, even though it is in
the text.
The Bonferroni adjustment or correction
controls the overall Type I error rate while making multiple comparisons among groups
after an ANOVA. Doing an ANOVA doesn't tell us which groups are different from each
other, so we could follow it up with t-tests between all possible pairs to find
that out. If we compare all groups to each other when we have three groups, that's
three comparisons; with four groups there are six comparisons to make, and so on.
But if each comparison is done with α=.05, that would increase the overall
rate to (roughly) α=.15 if we did three comparisons, or α=.30 if we did
six. The Bonferroni correction simply says, let's keep the overall α=.05 by
dividing that .05 by the number of comparisons we're making. With three comparisons,
divide α by 3, which is .05/3 = .017; with six, it's .05/6 = .0083. For us
to consider a comparison significant now, p<.05 isn't good enough -- we'd need
a result that gave us p<.017 (or with six comparisons among four groups p<.0083).
If each comparison is done with α=.0083, then doing six increases the α
to .05 at the worst. We've controlled the error rate. However, requiring p<.0083
is a much harder criterion to reach, and we'll be more likely to NOT reject the
null hypothesis -- even if it's false and should be rejected! That is, the tradeoff
for limiting the Type I errors is that we're more likely to make a Type II error
of not recognizing a difference or a relationship when it's really there.
The Tukey HSD test is a way of finding
a compromise between controlling Type I errors and Type II errors. "HSD"
stands for "Honestly Significant Difference" and it tells you how far
apart any two means have to be to be significantly different from one another. If
the HSD is 4, then means that are at least 4 points apart (e.g., 6 and 11, or 11
and 18) are significantly different but those that are less than four points apart
(e.g. 6 and 3, or 11 and 13) are not. To get the value of the HSD, divide MSW/IN
by the group size, take the square root of that, and multiply by a number called
"q" or the "Studentized Range Statistic", which you get from
a table in the back of the book. To look up q you just have to know what your alpha
is (probably .05), what your denominator df are ("df for Error Term"),
and how many group means you're comparing. Once you've done q*square root of (MSW/IN
divided by sample size n), that's your HSD and you just check to see which of your
means are at least that far apart. For the exam you may not have to calculate Tukey's
HSD or even look up q in the table, but you definitely have to know how to use the
HSD value, as described above.