scatterplots
shows the relationship between two quantitative variables measured on the same cases
association
- direction: a positive direction or association means that, in general, as one variable increases, so does the other. when increases in one variable generally correspond to decreases in the other, the association is negative
- form: the form we care abou
outlier
a point that does not fit the overall pattern seen in the scatterplot
response variable, explanatory variable, x-variable, y-variable
in a scatterplot, you much choose a role for each variable. assign to the y-axis the response variable you hope to predict or explain. assign to the x-axis the explanatory or predictor variable that accounts for, explains, predicts, or is otherwise respon
correlation coefficient
a numerical measure of the direction and strength of a linear association
lurking variable
a variable other than x and y that simultaneously affects both variables. accounting for the correlation between the two
model
an equation of formula that simplifies and represents reality
linear model
an equation of a line. to interpret a linear model, we need to know the variables, their W's, and their units
predicted value
the value of y^ found for a given x-value in the data. a predicted value is found by substituting the x-value in the regression equation. the predicted values are the values on the fitted line; the points (x, y^) all lie exactly on the fitted line
residuals
the difference between data values and the corresponding values predicted by the model- or, more generally, values predicted by any model
least squares
specifies the unique line that minimizes the variance of the residuals or, equivalently, the sum of the squared residuals
regression to the mean
because the correlation is always less than 1.0 in magnitude, each predicted y tends to be fewer standard deviations from its mean than corresponding x was from its mean
regression line or line of best fit
the particular linear equation that satisfies the least squares criterion is called the least squares regression line. casually, we often just call it the regression line, or line of best fit
slope
the slope, b1, gives a value in "y-units per x-unit." changes of one unit in x are associated with changes of b1 units in predicted values of y.
intercept
the intercept, b0, gives a starting value in y-units. its the y^-value when x is 0.
extrapolation
although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model equation. such extrapolation may pretend to see into the future, but the pred
leverage
data points whose x-values are far from the mean of x are said to exert leverage on a linear model. high-leverage points pull the line close to them, and so they can have a large effect on the line, sometimes completely determining the slope and intercept
influential point
if omitting a point from the date results in a vary different regression model, then that point is called an influential point
re-expression
we re-express data by taking the logarithm, the square root, the reciprocal, or some other mathematical operation on all values of a variable
ladder of powers
places in order the effects that many re-expressions have on the data