Statistics as summaries of data
Statistical analysis involves summarizing data - When you use this data to infer something about the population from which it's drawn, you must attach a measure of uncertainty
Univariate statistics
statistics for one variable at a time. ex: mean, median, SD, skewness, kurtosis...
Bivariate and multivariate statistics
relationship between and among variables Correlation, regression, t-test, ANOVA
Raw data vs. Summaries
always look at plots of raw data before doing any analysis. You could show all of the data graphically or in a spreadsheet - this has Limited utility when too much data or relationships are complex - can't visually discern what is going on. Some graphical representations are actually representations of summaries A statistical summary attempts to distinguish the wheat from the chaff Variability related to predictors of interest vs. noise (i.e.,tangential variability)
accounting for' variability
Outcomes vary RTs, ratings, key choice, number of responses, size of lesion What "causes" observed variability?However, this requires clean cause-effect interpretations which are sometimes unfoundedSo, we use the more neutral what "accounts for" observed variability Often quantified as R2 for an entire statistical model
residual
the difference between the observed value and the mean value that the model predicts for that observation. Residual values are especially useful in regression and ANOVA procedures because they indicate the extent to which a model accounts for the variation in the observed data.
Statistical goal one
estimating numeric quantities about data (means, variances, equations of a line, group means)
Statistical goal two
identify factors that "account for" observed variability. Factors that account for variability are good predictors of outcome quantities Predictors may be either categorical (e.g., occupation, race, drug dose) or continuous (e.g., income, trial, drug dose)
Goal of regression
* Identify the line that goes through a scatterplot that best summarizes the linear aspect of the relationship * This equation will be of the general form of a line: y = mx+b * Recall that b is the intercept - what is the predicted value of y when x is zero * The slope, m, is the amount of change in y for each one unit change in x
First Order (straight-line) Model
E(y) == "expected values of y
Formulas for the Least Squares estimates
Slope, y-intercept
Visual straight-line fit
this is to data in Table 3.1
Assumptions of regression regarding the error distribution,
* 1. Mean of the error distribution is zero. * 2. The error distribution is normal around the regression line. * 3. The variance of the error distribution is constant along the line. * 4. The errors are independent (problematic assumption for within-subject designs).
Doing regressions in jmp
* use multivariate methods, use red pull down triangle and select pairwise correlations. * CI gets larger with smaller samples * the CI is asymmetrical b/c you can not go beyond a value of 1 ( r only goes from -1 to 1) * correlation coefficients here are NOT normally distributed.
Test of Hypothesis for Linear correlation
0
s2 - residual error variance
This is variance not accounted for by the regression model. AKA MSE - mean square error Square root of MSE is called Root Mean Square Error or RMSE. Locate both MSE and RMSE in JMP output
Estimation of s2 and s for the Straight-line (first order) model
0
slope vs intercept symbols
B1 = slope, B0 = y-intercept
Sampling distribution of b1--
0
Test of Model Utility: Simple linear regression--
looking at t because it is sensitive to sample size and we are assuming large sample. [focus on stuff in red box] t's between +2 and -2 are not strong evidence to reject the null. when values outside this range, then strong evidence to reject null.
Rejection region and calculated t-value for testing whether the slope b1 = 0
0
A 100(1 - a)% Confidence interval for the simple linear regression slope B1
0
Predicting using regression
To approximate the prediction interval, you need to use the equation to obtain the predicted value... • Then, look up the RMSE (average error/residual) • Finally, the approximate 95% CI for the prediction interval is the predicted value +/2*RMSE
Pearson's r
Summarizes the strength and direction of "relationship" between variables value will always be between [-1, +1] Provides less information than a regression can only assess the linear relationship. a lack of a linear relationship does not mean that there is no relationship at all.
Pearson's r Computation assumes
Variables are continuous. Data are linearly related. Homogeneity of variance across range. Normal distribution of variables. Measures are independent.
Values of r and their implications-
0
Interpreting r
• Sample correlation (r) vs. Population correlation (p) Statistical tests of significance aren't particularly informative - test if r! = 0. Only of interest if r is small Only assesses strength of linear relationship Underestimates curvature and non-linearity correlation Can't be used for prediction (need regression model), BUT it does provide a generic metric for quality of a regression model r for simple regression, R for multiple regression
challenges with correlation: Restriction of range
If one or the other variable spans too small of a range, this limits ability to accurately assess correlation
challenges with correlation: Influential data points
With little data, a single unusual data point can have a bit impact on the correlation
challenges with correlation: unreliable measures
If the variables being measured are unreliable (inconsistent), then r adversely affected
Nonparametric correlation
Recall assumptions- Variables are continuous - Data are linearly related - Homogeneity of variance across range - Normal distribution of variables - Measures are independent What about the rest?-- Definition of "nonparametric" statistic - Spearman's nonparametric can help with 2nd, 3rd and 4th assumptions
parametric stats
stats that make certain assumptions about the data ( normal distribution, linearity)
non parametric stats
stats that do not make assumptions about data
parametric assumptions
* Variables are continuous * Data are linearly related * Homogeneity of variance across range * Normal distribution of variables * Measures are independent
monotonic relationship
relationships where when on variable changes, the other always changes.
non-monotonic relationship
relationships where when one variable changes, the other does not always change ( ex: relationship changes direction after a while)
Spearman's, rs (also, rho or r)
• Turns each variable into a rank and correlates the ranks Preferred when- - Variables are ordinal Spacing inconsistent between values (e.g., non-normal data) Data contain outliers Variance of variable is inconsistent as value of other variable changes Can help with monotonic nonlinearities can help with 2nd, 3rd, and 4th assumptions In JMP, Multivariate, triangle, Nonpar Corr
details about the assumption independence
Violating this assumption can be very serious Imagine that you assessed two ratings variables multiple times for each subject - For example, ratings of multiple scenarios You then correlate the two columns of ratings , r = -.36 But, what if you correlated the two columns separately for each subject?
An alternative, less attractive, solution to dependence
• Average across the multiple ratings for a subject to obtain one measure Appropriate for scales (multiple ratings are of same scale, e.g., conscientiousness, collapsed to create one assessment of construct) Can result in much weaker correlations Ignores advantages of within-subject design Using different scenarios, situations, stimuli can assess whether the measures track each other across these variations
Coefficient of Determination: ( r squared) R2
Aka proportion of variance accounted for by the model. Is a Generic metric of model fit has minimum value of 0 (supposedly), max of 1. Has shortcoming as the number of predictors increase • Provides less information than regression equation and correlationis done this way because it is easier to compute then the alternative estimate of population value?
adjusted R2
Models with more predictors or little data are more likely to overestimate the amount of variance accounted for (i.e., R2) - The adjustment increases with more predictors and less data The adjusted value better approximates the population r2 (rho-squared)