#### Unit 3: simple regressions

Statistics as summaries of data

Statistical analysis involves summarizing data - When you use this data to infer something about the population from which it's drawn, you must attach a measure of uncertainty

Univariate statistics

statistics for one variable at a time. ex: mean, median, SD, skewness, kurtosis...

Bivariate and multivariate statistics

relationship between and among variables Correlation, regression, t-test, ANOVA

Raw data vs. Summaries

always look at plots of raw data before doing any analysis. You could show all of the data graphically or in a spreadsheet - this has Limited utility when too much data or relationships are complex - can't visually discern what is going on. Some graphical representations are actually representations of summaries A statistical summary attempts to distinguish the wheat from the chaff Variability related to predictors of interest vs. noise (i.e.,tangential variability)

accounting for' variability

Outcomes vary RTs, ratings, key choice, number of responses, size of lesion What "causes" observed variability?However, this requires clean cause-effect interpretations which are sometimes unfoundedSo, we use the more neutral what "accounts for" observed variability Often quantified as R2 for an entire statistical model

residual

the difference between the observed value and the mean value that the model predicts for that observation. Residual values are especially useful in regression and ANOVA procedures because they indicate the extent to which a model accounts for the variation in the observed data.

Statistical goal one

estimating numeric quantities about data (means, variances, equations of a line, group means)

Statistical goal two

identify factors that "account for" observed variability. Factors that account for variability are good predictors of outcome quantities Predictors may be either categorical (e.g., occupation, race, drug dose) or continuous (e.g., income, trial, drug dose)

Goal of regression

* Identify the line that goes through a scatterplot that best summarizes the linear aspect of the relationship * This equation will be of the general form of a line: y = mx+b * Recall that b is the intercept - what is the predicted value of y when x is zero * The slope, m, is the amount of change in y for each one unit change in x

First Order (straight-line) Model

E(y) == "expected values of y

Formulas for the Least Squares estimates

Slope, y-intercept

Visual straight-line fit

this is to data in Table 3.1

Assumptions of regression regarding the error distribution,

* 1. Mean of the error distribution is zero. * 2. The error distribution is normal around the regression line. * 3. The variance of the error distribution is constant along the line. * 4. The errors are independent (problematic assumption for within-subject designs).

Doing regressions in jmp

* use multivariate methods, use red pull down triangle and select pairwise correlations. * CI gets larger with smaller samples * the CI is asymmetrical b/c you can not go beyond a value of 1 ( r only goes from -1 to 1) * correlation coefficients here are NOT normally distributed.

Test of Hypothesis for Linear correlation

0

s2 - residual error variance

This is variance not accounted for by the regression model. AKA MSE - mean square error Square root of MSE is called Root Mean Square Error or RMSE. Locate both MSE and RMSE in JMP output

Estimation of s2 and s for the Straight-line (first order) model

0

slope vs intercept symbols

B1 = slope, B0 = y-intercept

Sampling distribution of b1--

0

Test of Model Utility: Simple linear regression--

looking at t because it is sensitive to sample size and we are assuming large sample. [focus on stuff in red box] t's between +2 and -2 are not strong evidence to reject the null. when values outside this range, then strong evidence to reject null.

Rejection region and calculated t-value for testing whether the slope b1 = 0

0

A 100(1 - a)% Confidence interval for the simple linear regression slope B1

0

Predicting using regression

To approximate the prediction interval, you need to use the equation to obtain the predicted value... • Then, look up the RMSE (average error/residual) • Finally, the approximate 95% CI for the prediction interval is the predicted value +/2*RMSE

Pearson's r

Summarizes the strength and direction of "relationship" between variables value will always be between [-1, +1] Provides less information than a regression can only assess the linear relationship. a lack of a linear relationship does not mean that there is no relationship at all.

Pearson's r Computation assumes

Variables are continuous. Data are linearly related. Homogeneity of variance across range. Normal distribution of variables. Measures are independent.

Values of r and their implications-

0

Interpreting r

• Sample correlation (r) vs. Population correlation (p) Statistical tests of significance aren't particularly informative - test if r! = 0. Only of interest if r is small Only assesses strength of linear relationship Underestimates curvature and non-linearity correlation Can't be used for prediction (need regression model), BUT it does provide a generic metric for quality of a regression model r for simple regression, R for multiple regression

challenges with correlation: Restriction of range

If one or the other variable spans too small of a range, this limits ability to accurately assess correlation

challenges with correlation: Influential data points

With little data, a single unusual data point can have a bit impact on the correlation

challenges with correlation: unreliable measures

If the variables being measured are unreliable (inconsistent), then r adversely affected

Nonparametric correlation

Recall assumptions- Variables are continuous - Data are linearly related - Homogeneity of variance across range - Normal distribution of variables - Measures are independent What about the rest?-- Definition of "nonparametric" statistic - Spearman's nonparametric can help with 2nd, 3rd and 4th assumptions

parametric stats

stats that make certain assumptions about the data ( normal distribution, linearity)

non parametric stats

stats that do not make assumptions about data

parametric assumptions

* Variables are continuous * Data are linearly related * Homogeneity of variance across range * Normal distribution of variables * Measures are independent

monotonic relationship

relationships where when on variable changes, the other always changes.

non-monotonic relationship

relationships where when one variable changes, the other does not always change ( ex: relationship changes direction after a while)

Spearman's, rs (also, rho or r)

• Turns each variable into a rank and correlates the ranks Preferred when- - Variables are ordinal Spacing inconsistent between values (e.g., non-normal data) Data contain outliers Variance of variable is inconsistent as value of other variable changes Can help with monotonic nonlinearities can help with 2nd, 3rd, and 4th assumptions In JMP, Multivariate, triangle, Nonpar Corr

Violating this assumption can be very serious Imagine that you assessed two ratings variables multiple times for each subject - For example, ratings of multiple scenarios You then correlate the two columns of ratings , r = -.36 But, what if you correlated the two columns separately for each subject?

An alternative, less attractive, solution to dependence

• Average across the multiple ratings for a subject to obtain one measure Appropriate for scales (multiple ratings are of same scale, e.g., conscientiousness, collapsed to create one assessment of construct) Can result in much weaker correlations Ignores advantages of within-subject design Using different scenarios, situations, stimuli can assess whether the measures track each other across these variations

Coefficient of Determination: ( r squared) R2

Aka proportion of variance accounted for by the model. Is a Generic metric of model fit has minimum value of 0 (supposedly), max of 1. Has shortcoming as the number of predictors increase • Provides less information than regression equation and correlationis done this way because it is easier to compute then the alternative estimate of population value?