Statistics Unit 2

scatterplots

shows the relationship between two quantitative variables measured on the same cases

association

- direction: a positive direction or association means that, in general, as one variable increases, so does the other. when increases in one variable generally correspond to decreases in the other, the association is negative
- form: the form we care abou

outlier

a point that does not fit the overall pattern seen in the scatterplot

response variable, explanatory variable, x-variable, y-variable

in a scatterplot, you much choose a role for each variable. assign to the y-axis the response variable you hope to predict or explain. assign to the x-axis the explanatory or predictor variable that accounts for, explains, predicts, or is otherwise respon

correlation coefficient

a numerical measure of the direction and strength of a linear association

lurking variable

a variable other than x and y that simultaneously affects both variables. accounting for the correlation between the two

model

an equation of formula that simplifies and represents reality

linear model

an equation of a line. to interpret a linear model, we need to know the variables, their W's, and their units

predicted value

the value of y^ found for a given x-value in the data. a predicted value is found by substituting the x-value in the regression equation. the predicted values are the values on the fitted line; the points (x, y^) all lie exactly on the fitted line

residuals

the difference between data values and the corresponding values predicted by the model- or, more generally, values predicted by any model

least squares

specifies the unique line that minimizes the variance of the residuals or, equivalently, the sum of the squared residuals

regression to the mean

because the correlation is always less than 1.0 in magnitude, each predicted y tends to be fewer standard deviations from its mean than corresponding x was from its mean

regression line or line of best fit

the particular linear equation that satisfies the least squares criterion is called the least squares regression line. casually, we often just call it the regression line, or line of best fit

slope

the slope, b1, gives a value in "y-units per x-unit." changes of one unit in x are associated with changes of b1 units in predicted values of y.

intercept

the intercept, b0, gives a starting value in y-units. its the y^-value when x is 0.

extrapolation

although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model equation. such extrapolation may pretend to see into the future, but the pred

leverage

data points whose x-values are far from the mean of x are said to exert leverage on a linear model. high-leverage points pull the line close to them, and so they can have a large effect on the line, sometimes completely determining the slope and intercept

influential point

if omitting a point from the date results in a vary different regression model, then that point is called an influential point

re-expression

we re-express data by taking the logarithm, the square root, the reciprocal, or some other mathematical operation on all values of a variable

ladder of powers

places in order the effects that many re-expressions have on the data