EL 1100 2500 2100WQ 379 5104

STAT 379 sec 01
QUANTITATIVE METHODS IN THE BEHAVIORAL SCIENCES, Spring 2008
UConn Storrs Campus, BOUS 160
MON WED 10:00-11:30
Eric Lundquist

[Galton regression line] [Galton regression to the mean]
Galton, Francis (1886). Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute 15, 246-263.

Office: BOUS 136
Office Hours: Mon Wed 4:00-5:00, and by appointment
Phone: (860) 486-4084
E-mail: Eric.Lundquist@uconn.edu

Teaching Assistants:

Donald Edmondson
Office: BOUS 212
Office Hours: Tue 1:00-3:00, and by appointment
E-mail: Donald.Edmondson@uconn.Edu

Carissa Gross
Office: BOUS 366
Office Hours: Fri 1:30-3:30, and by appointment
E-mail: Carissa.Gross@uconn.edu


READING:
  1. Keith, Timothy Z. (2006). Multiple Regression and Beyond. Allyn & Bacon. ISBN: 0205326447 (ISBN-13: 9780205326440)
  2. Grimm, Lawrence G. and Yarnold, Paul R., eds. (1994). Reading and Understanding Multivariate Statistics. APA. ISBN: 1-55798-273-2 (ISBN-13: 978-1-55798-273-5)
  3. On-Line Readings and Reserve Readings (see below)

GRADING:
   
  • Homework:
  • 30%   assigned weekly
       
  • Midterm:
  • 35%   WEDNESDAY MARCH 5, 10:00 AM
       
  • Final:
  • 35%   WEDNESDAY MAY 7, 10:00 AM


    TOPICS AND READING ASSIGNMENTS: to be updated throughout the semester
    K = Keith; GY = Grimm & Yarnold

    CLASS SYLLABUS in Microsoft Word format, should you lose your original. This may be considerably modified in the schedule below.

    TOPIC READING
    Introduction K 491-510 [App. E, review of basic statistics concepts -- and optionally Howell Ch.4 and Howell Ch.7 for additional information about these concepts]
    Tabachnick & Fidell (2004) table on the General Linear Model
    Revised Summary of Techniques in the General Linear model in HTML format, Microsoft Word format, and PDF format.
    Correlation K 499-505 [section of App. E on correlation coefficient, with different derivation and formula than the one I use, which is:
    r = covxy / (sx*sy), where covxy = SPxy / (N-1), and SPxy = Σ(X-Mx)(Y-My)
    Keppel & Wickens (2004) excerpt on correlation coefficient
    Correlation article in Wikipedia: whether or not the math explained here is of interest (correlations as cosines, etc.), the two images depicting sets of scatterplots are very important to understand.
    Reliability Overview of reliability theory, which makes use of correlation, from William Trochim's on-line Research Methods Knowledge Base. This is a very basic exposition but sufficient for an introduction.
    Notes on Cohen's Kappa and its relationship to Pearson's Chi-Square: describes the logic and calculation of each and how to obtain them in SPSS. IMPORTANT: this is NOT a legitimate application of the chi-square statistic, as the example requires the same case to appear in multiple categories (cases are rated twice) and this violates chi-square's independence assumption. The example is appropriate for Kappa, though.
    Simple Regression K Ch. 1 [simple (one-predictor) regression; calculation of regression line intercept and coefficients; significance tests for prediction equation and individual predictor; unstandardized (b) vs. standardized (beta) coefficients; confidence intervals (but note p. 11 description is not completely accurate -- see Notes on Confidence Intervals and compare to Howell Ch.7 p. 182)]
    Multiple Regression: two predictors K Ch. 2 [multiple correlation coefficient; interpretation of a and b's; experimental control vs. statistical control; partial and semi-partial correlation (covered in more detail in App. D, but in a way that may not be completely accessible to you at this point, and which employs the conventions of Table 10.17 on p. 235); interpretation of b vs. beta (see especially Table 2.1 on p. 36); formulas for beta, b, and R for the two-predictor case]
    Multiple Regression: more on the two-predictor case K Ch. 3 [interpretation of R-square with Venn diagrams]
    Multiple Regression: the general case K Ch. 4 [dependency of b-weights on other variables included in the equation; prediction and explanation]
    K Ch. 5 p. 96 cross-validation; p. 97 adjusted R-square [with a different formula that is algebraically equivalent to the one I use, which is:
    R2 adj = 1 - (1 - R2)(N - 1)/(N - k - 1)]
    Multiple Regression: simultaneous, sequential, and stepwise procedures K Ch. 5 [simultaneous regression means business as usual; sequential regression and F for increment to R-square; why you'll never use stepwise regression; adjusted R-square; cross-validation]
    MIDTERM REVIEW interim summary
    Lundquist STAT 379 midterm spring07.doc
    Cillessen STAT 379 midterms going back to 2001
    Katz STAT 379 midterms from before 2001
    Multiple Regression: categorical predictors K Ch. 6 [dummy coding, effect coding; relation of df in ANOVA and regression]
    Multiple Regression: categorical and continuous predictors K Ch. 7 [introduction to cross product variables and interaction in regression]
    ANCOVA K pp. 155-156 [describes ANCOVA as an instance of a regression employing both continuous and categorical predictors (Ch. 7)]
    Multiple Regression: interactions with continuous variables K Ch. 8 [moderation and mediation; curvilinear relationships]
    Powerpoint slides on mediation and moderation for printing; to view original slides click here.
    Dave Kenny's web page on mediation
    Howell Ch.15 Excerpts on Mediation & Moderation, and Logistic Regression [the earlier part is an alternative and/or supplement to the Keith pages and other listed resources on mediation and moderation]
    Preacher and Leonardelli's calculator for the Sobel z-test for partial mediation
    Notes on Interpreting Regression Interactions (Microsoft Word format)
    Notes on Mediation and SPSS (Microsoft Word format)
    Logistic Regression GY Ch. 7; K pp. 205-207
    Howell Ch.15 Excerpts on Mediation & Moderation, and Logistic Regression [the latter part is an alternative and/or supplement to the Grimm and Yarnold chapter on Logistic Regression]
    Notes On Logistic Regression, not in finished form and therefore retaining some rambling and redundancy, but hopefully useful nonetheless
    Multiple Regression: assumptions, diagnostics, other issues K Ch. 9
    Canonical Correlation Thompson, B. (2000): chapter from Grimm and Yarnold (2000) on Canonical Correlation.
    Principal Components / Factor Analysis GY Ch. 4; K pp. 305-306
    Factor Analysis by David Garson at NC State: this is very good and consistent with the Bryant and Yarnold chapter (GY Ch. 4), but more detailed on Exploratory Factor Analysis. The rest of his statistics web page is really good too, next time you're looking for a resource.
  • Linked in that web page is his annotated SPSS factor analysis output, which is incredibly useful. The data set, so far as I can tell, is about people asked to rate their liking for various types of music, presumably on a scale of 1 = "like very much" to 7 = "don't like at all".
    Factor Analysis by Richard Darlington -- another very authoritative web page for those who want yet another perspective.
    Some concise and helpful PowerPoint slides on Factor Analysis by Alan Pickering of Goldsmiths College, University Of London. These are actually in Word format.
    Consumer Rated Assessment Procedure, from the Journal Of Irreproducible Results; maybe they're satirizing the MMPI, or maybe they're saying that some researchers use factor analysis to make their gobbledygook look more impressive, or maybe they were just bored in the lab one night. I find it pretty funny.
  • MANOVA GY Ch. 8
    Discriminant Function Analysis GY Ch. 9
    Multiway Frequency Analysis GY Ch. 6
    Path Analysis / Structural Equation Modeling K Ch. 10 and 11; GY Ch. 3
    FINAL EXAM REVIEW summary
    (note that logistic regression is treated in a separate document)
    Lundquist STAT 379 final spring07.doc
    Cillessen STAT 379 finals going back to 2001
    Katz STAT 379 finals from before 2001


    HOMEWORK ASSIGNMENTS: to be updated throughout the semester
    1. HW1 due 1/30; SPSS formatted data available here
    2. HW2 due 2/6; SPSS formatted data available here
    3. HW3 due 2/13; SPSS formatted data available here: HW3s08a.sav and HW3s08b.sav
    4. HW4 due 2/27; SPSS formatted data available here
    5. HW5 due 3/26; SPSS formatted data available here
    6. HW6 due 4/2; SPSS formatted data available here
    7. HW7 due 4/9; SPSS formatted data available here
    8. HW8 due Friday 4/18; SPSS formatted data available here
    9. HW9 due 4/30; SPSS formatted data available here
    10. OPTIONAL EXTRA CREDIT HW10 due MONDAY 5/12; SPSS formatted data available here


    NOTES AND RESOURCES

    STAT 3115 web page: Includes resources on some topics that are covered in the course on Experimental Design and Analysis Of Variance.

    Logic Of ANOVA summary: Here is a summary of some basics about ANOVA for those who may want a refresher.

    NPR commentary by Douglas Kamerow, former Assistant U.S. Surgeon General, on the leading "causes" of death in the U.S., and what factor best predicts longevity. This will get you thinking about the relation between correlation and causality, and how proximal something needs to be to count as a cause. From All Things Considered, 1/1/08.

    PBS Frontline 2/14/06: The Meth Epidemic features (especially in the first part) intriguing correlational work by investigative journalists uncovering likely causal connections among incidences of meth-addiction-related problems and availability of meth ingredients (like cold medicines) over time. The most relevant portions occur in the first 10-15 minutes of the show. Pretty cool stuff. And if you don't want to watch it, you can just read about the parts that are interesting to you.

    Correlation article in Wikipedia: whether or not the math explained here is of interest (correlations as cosines, etc.), the two images depicting sets of scatterplots are very important to understand.

    David Hume on causation, from An Enquiry Concerning Human Understanding (1748) -- in case you want to see where questions about causality all began. (See also the famous conclusion.)

    Overview of reliability theory, which makes use of correlation, from William Trochim's on-line Research Methods Knowledge Base. This is a very basic exposition but sufficient for an introduction.

    Some significance tests for correlation coefficients (from Howell ch. 9), describing how to test for significant differences between correlation coefficients using a t-test for two independent r's, for comparing r to a hypothesized population value, and for two non-independent r's.

    Confidence Intervals in Howell ch. 7 pp. 181-183
    Notes on the meaning and interpretation of Confidence Intervals: Howell's discussion is very good, so the somewhat lengthy little essay that I've included here is more than I intended to write; still, it may be helpful to hear it expressed in more than one way.

    G*Power Home Page: free software for power calculations.

    Excel spreadsheet for calculating values of the z, t, F, and chi-square distributions and their probabilities

    Table of Selected Values of the t Distribution:

  • In the absence of SPSS, Excel (TDIST and TINV functions), or other relevant software, use this table to find the value of t that cuts off a certain percentage of the area under the curve, which corresponds to the probability of obtaining a t of that size or larger. Since t is symmetric it doesn't matter whether it's positive or negative (i.e., whether it's in the upper or lower tail); all that counts is the absolute value which represents the obtained score's distance from the null hypothesis value in units of estimated standard errors -- analogous to a z-score which uses KNOWN standard errors or standard deviations as its units. The many curves representing the t distribution differ depending on the degrees of freedom or df, with few df giving a curve that is flatter with longer tails than the standard normal distribution (or z distribution); with more and more df, the t distribution looks more and more like the z distribution. (Note that with infinite df, which means an infinite sample size, the values for t are identical to those you'd find in the z distribution.)
  • Read the row corresponding to the correct df: for analyzing means the df are n-1 for a single sample, and for a 2 sample means comparison the df are the sum of each sample's df (or N-2, where N is the total number of observations from both groups). In correlation and regression the df are the number of observations minus the number of predictors, minus 1 (or N-k-1). The commonly used proportions listed in this version of the table are conveniently identified by two different column headings, based on whether you want the proportion of interest to be located entirely in one tail, or split between the upper and lower tails. See the diagram accompanying the table to clarify this. ALWAYS use the two-tailed version, and thus the headings under "proportion in two tails combined" -- so the 1 df value for p=.05 is 12.706, not 6.314. (One-tailed tests of so-called "directional hypotheses" map p-values onto smaller required values of t, making it easier to declare results significant, but this procedure has always been controversial and I rarely see a situation that legitimately calls for it. How often is it really the case that one group's mean MUST be higher than the other's, and it's inconceivable that their sizes could be reversed?) As an example, the t value for the p<.01 cutoff for the difference between the means of two samples of size n=10 would be 2.878. The df would be (10-1) + (10-1) = 18, and the appropriate column would be the one under 0.01 as you read the "proportion in two tails combined" headings. If your obtained t is larger than 2.878 then it clearly cuts off an even smaller proportion of the area than .01, and thus you can say the t you obtained has p<.01. (Any statistical software will tell you precisely what the p-value for your t actually is.)
  • Note that if the particular df you're looking for don't appear in the table, you should use the next LOWER df -- do NOT round df UP even if that higher df value is closer to yours. Another table with more values included appears here, and many more are available on the web. Many of these, for instance this one, will give the complementary proportion of the area for values SMALLER than t, and will do so only for one tail -- thus to find the example value of 2.878 you'd have to look for 18 df and then the 99.5% cutoff value, because p=.01 corresponds to a total of 1% of the area being more extreme and you have to split that 1% into 0.5% in the upper tail and 0.5% in the lower.

    Table of Selected Values of the F Distribution:

  • In the absence of SPSS, Excel, or other relevant software, use this table to find the value of F that cuts off a certain percentage of the area under the curve, which corresponds to the probability of obtaining an F of that size or larger. The F distribution has only one tail to consider, in the sense that the extreme values of interest are UPPER values only. The distribution's shape differs according to both the number of groups (or predictors) being analyzed, and the number of observations being made, and so picking out the relevant member of the family of F distributions requires two numbers specifying its df (one for the numerator df and one for the denominator df). Reproducing all the percentage cutoff points for the area under the curve (corresponding to the probabilities) for all possible combinations of these df would be very unwieldy. Thus only the most common cutoff values -- 5%, 10%, and 1% -- are included in this version of the table. They are organized such that the columns represent different numerator df up to 20 (appropriate for 21 group means in ANOVA, or 20 predictor variables in regression, which should be plenty), and the rows represent all values of the denominator df from 1 to 100.
  • Consulting the section of the table appropriate for the p-value you wish to examine, you find the row and column corresponding to your numerator and denominator df and the value at that entry is the upper "critical value": the value of F beyond which the given percentage of the area under the curve is cut off. For instance, the value for the p<.01 cutoff for the difference between the means of two samples of size n=10 would be 8.285. Familiarity with ANOVA df would make it apparent that the numerator df would be [number of groups] - 1 = 2-1 = 1, and the denominator df would be the sum of the df within each group, or (10-1) + (10-1) = 18. The entry in the p=.01 portion of the table under numerator df (called "ν1") and denominator df (called "ν2") is 8.285, meaning that for those df the area under the curve beyond the value of 8.285 on the horizontal axis is 1% of the total, and the probability of randomly sampling scores that lead to that high an F value when there is no difference between the populations means is 1%. If your obtained F is larger than 8.285 then it clearly cuts off an even smaller proportion of the area than .01, and thus you can say the F you obtained has p<.01. (Any statistical software will tell you precisely what the p-value for your F actually is.)
  • For 2 groups, either F or t can be used to yield exactly the same probability; in comparing just two groups the numerator df will always be 1 and the denominator df will be the same as the df for t. F then is the square of t -- that is, within rounding error, 8.285 is the square of 2.878.
  • Note that if the particular numerator and/or denominator df you're looking for don't appear in the table, you should use the next LOWER df -- do NOT round df UP even if that higher df value is closer to yours. A printable pdf version of the F distribution table for p=.05 and p=.01 values with numerator df up to 10 and all denominator df up to 100 is here. More versions of tables for F and other distributions appear here and at various other easily located web sites. Many web pages such as this one will calculate a p-value for any given F and df, and others will calculate F given df and a p-value, etc. But if you have access to the internet, chances are you also have access to Excel which will do the same with its FDIST, FINV, TDIST, and TINV functions, etc., or SPSS which displays all p-values for its analyses automatically.
    Supplemental readings in statistics and psychology:

  • Some useful papers:
  • Gravetter, F. J., & Wallnau, L. B. (2006) Statistics for the Behavioral Sciences (7th ed.). Belmont, CA: Wadsworth/Thomson: a very clear introductory level statistics text.
  • Howell, David C. (2007), Statistical Methods for Psychology (6th Ed.). Thomson-Wadsworth. (ISBN-10: 0495012874; ISBN-13: 9780495012870): an introductory text of exceptional clarity and accuracy, for the grad or advanced undergrad level:
  • Keith, Timothy Z. (2006). Multiple Regression and Beyond. Allyn & Bacon. ISBN: 0205326447 (ISBN-13: 9780205326440): used for STAT 379 Spring 2007/2008.
  • Grimm, Lawrence G. and Yarnold, Paul R., eds. (1994). Reading and Understanding Multivariate Statistics. APA. (ISBN: 1-55798-273-2; ISBN-13: 978-1-55798-273-5): used for STAT 379 Spring 2007/2008.
  • Grimm, Lawrence G. and Yarnold, Paul R., eds. (2000). Reading and Understanding MORE Multivariate Statistics. APA. (ISBN: 1-55798-698-3; ISBN 13: 978-1-55798-698-6): companion volume to the 1994 book.
  • Pedhazur, Elazar J. (1997). Multiple Regression in Behavioral Research (3rd Ed.) Thomson-Wadsworth. (ISBN-10: 0030728312; ISBN-13: 9780030728310): an advanced text and one of the best references on multiple regression and related procedures.
  • Keppel, Geoffrey & Wickens, Thomas D. (2004). Design and Analysis: A Researcher's Handbook, 4/E. Prentice Hall. ISBN-10: 0135159415 (ISBN-13: 9780135159415): used for STAT 242 Fall 2007.
  • Maxwell, S. E., & Delaney, H. D. (2004). Designing experiments and analysing data: A model comparison perspective (2nd ed.). Mahwah, NJ: Erlbaum.(ISBN/ISSN: 0-8058-3718-3; ISBN13: 978-0-8058-3718-6): an advanced text on experimental design and ANOVA.


    Some important figures in the history of statistics:

  • Abraham De Moivre around 1730 derived the normal distribution as the limit of the binary distribution when the number of binary decisions (e.g., coin tosses) is infinite.
  • Johann Carl Friedrich Gauss often gets credit for discovering the normal distribution because in 1809 he proved that it described errors of measurement (in astronomy, etc.), which is why the normal distribution is sometimes called the Gaussian distribution.
  • Adolphe Quetelet in 1835 first applied the normal distribution to biological and behavioral traits rather than merely to measurement error, describing the concept of "the average man"; he also invented the Quetelet Index which today we usually refer to as the Body Mass Index (BMI).
  • Francis Galton invented the concepts of correlation and regression around 1886. He also read and wrote at age 2-1/2, went ballooning and did experiments with electricity for fun, mapped previously unexplored African territories, taught soldiers camping procedures and how to deal with wild animals and "savages," tried to objectively determine which part of Britain had the most attractive women, studied the efficacy of prayer empirically, observed the amount of fidgeting at scientific lectures to measure the degree of boredom, invented fingerprinting and weather maps along with the meteorological terms "highs," "lows," and "fronts," coined the phrase "nature and nurture," and pioneered mental testing, twin studies of heritability, the composite photograph, the study of mental imagery, the free-association technique for probing unconscious thought processes, the psychological survey questionnaire, and... umm... eugenics. Oops.
  • Karl Pearson founded modern statistics beginning in the 1890's, inventing the chi-square distribution and test and coining the term "standard deviation" among others; he formalized the calculation of the correlation coefficient (where Galton had arrived at it graphically) and so that calculation bears his name today.
  • George Udny Yule worked on the concepts and mathematics of partial correlation and regression in the 1890's, making multiple regression as we know it possible.
  • William Sealy Gosset in 1908 worked out the distribution of sample means ("standard error" in modern terminology) for cases where the population standard deviation is unknown -- hence he is the inventor of the t-test.
  • Ronald Fisher was a key figure in bridging the gap between the Darwinian theory of natural selection and its underlying mechanism of Mendelian genetics; from about 1915 onwards he also invented experimental design as we know it today, and developed Analysis Of Variance (ANOVA) as a generalization of Gosset's work to more than two groups (Snedecor in his influential early textbook named the 'F' statistic for Fisher).
  • Jerzy Neyman and Egon Pearson (son of Karl) invented and refined many of the concepts of null hypothesis significance testing in the 1930's (e.g. the alternative hypothesis, power, Type II error, confidence intervals), though Fisher had a constant ongoing argument with everything they did -- mainly because it wasn't the way HE did it.


    If you're wondering about classes being canceled due to weather, see http://alert.uconn.edu or call 486-3768.