Statistics Ch. 4: Describing the Relation Between Two Variables

Response Variable

The variable whose value can be explained by the value of the explanatory or predictor variable.

Synonyms for Explanatory Variable

Predictor or Independent Variable

Synonym for Response Variable

Dependent Variable

Scatter Diagram Definition

A graph that shows the relationship between two quantitative variables measured on the same individual.

How is scatter diagram constructed.

Each individual in the data set is represented by a point on the scatter diagram. the explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis.

What's the purpose of drawing a scatter diagram.

Scatter diagram serves as the first step in helping us identify whether a relationship exist or doesn't exists between two variables.

.
.
.
.
.
What's sort of relationship does the above scatter diagram show?

A positvely-correlation linear relationship between the explanatory varaible on the x axis and the response variable on the y axis.

.
.
.
.
.
.
.
What type of relationship is shown here

This is negatively associated linear relationship between the explanatory variable and the resopnse variable.

.
. .
. .
. .
. .
What type of relationship is shown here

This is a nonlinear relationship

.
.
. .
. . .
. . .
. .
What type of relationship is shown here?

This is a non-linear relationship

Positively associated variables

1) Linear variables
2) Are + correlated when above average values of one variable is associated with an above average values of the other variable.
3) In other words when one variable increases the other increases.

Negatively Associated Variables.

1) Linear variables
2) Are - correlated when above average values of one variable is associated with below average values of the other variable.
3) In other words when one variable increases the other decreases.

. . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
What type of relationship is implied here?

No relation between the explanatory and the response variable.

What's important to remember when drawing scatter diagrams.

NEVER connect the dots with lines.

Why are scatter diagrams not sufficient to help us determine whether a relationship exists between two data?

Because the horizontal and/or vertical scales of the graph can be manipulated and hence mislead,distort or show a different representation of the relationship between the two variables. Hence, we need a numerical summary called the linear correlation coefficient to help us determine any relation that exists between two variables.

Other words for Linear Correlation Coefficient

Pearson Linear Correlation Coefficient or Pearson Product Moment Correlation Coefficient.

Linear Correlation Coefficient Definition

It is a measure of the STRENGTH and DIRECTION of the LINEAR relation between two quantitative variables.

Set of values for linear correlation coefficient "r

r can be any value between -1 and 1, it can also be -1 or 1

If r=+1

Perfect positive Linear relationship exists.

If r=-1

Then perfect negative linear relationship exists

If r=0

Then there is LITTLE or NO evidence of a LINEAR RELATIONSHIP between the two variables.

Caution about r=0

this is different from saying that there is NO RELATIONSHIP at all between the two variables.

What is the units for the linear correlation coefficient "r

It is unitless

The linear correlation coefficient is .......? what does that mean

Not resistant. That means that any value in the data set that doesn't follow the overall pattern of the data set may affect the value of r. We don't need to know the formula of how to find "r" but the formula takes into consideration ALL THE VALUES of both the explanatory and response variable and hence extreme values of any of the variables can definitely the value of r. See pg. 193 to see the formula which we don't need to memorize.

If r is close to +1, -1

+1: stronger the evidence of a positive correlation between the variables; -1: stronger evidence of negative correlation between the 2 variables.

How do you:
1) Test for the strength or direction of a linear relationship

First you have to determine whether a linear relationship exists or not:
First: Compute the ABSOLUTE VALUE OF THE LINEAR COEFFICIENT
Second: Look at critical value of r in an appendix for the given sample size.
Third: If computed absolute value is greater than the critical value than a linear relationship exists between the two variables.
Fourth: If linear, Observe the actual (not absolute value of r); if r> critical value: positive association between the variables, r< critical value: negative association between the variables.

How to determine if there is pos, neg or no linear association especially if r is ambiguous like not very close to 1 to really show that there is a positive linear assocation for instance. Take the r value computed for e.g. -.5 or +.4;

if r is a positive number and > critical value of the sample size then + association, if < critical value then no linear relation
if r is a negative number and < critical value of the sample size then - assocation, if > critical value then no linear relation.

When does a strong positive or a strong negative r value imply/doesn't imply causality between the two variables for e.g. the change in the value of 1 variable CAUSES the CHANGE IN THE VALUE of the other variable?

It depends on the nature of data collection. If data was collected in an OBSERVATIONAL STUDY, then the results no matter how strong the r value is are only ASSOCIATION OR STRONGLY CORRELATED TOGETHER and the cause of the relation might be a lurking variable not accounted for in the study. But if data was collected through an EXPERIMENT, than that implies CAUSATION.

Relationship between the scatter diagram and correlation coefficient from section 1 with what we learn in section 2. Tie it to the whole chapter.

The main theme of the chapter is how to describe a relationship between two variables. Section 1 taught us how to test for a linear relationship that exists between 2 variables first by drawing a scatter diagram and then confirming a linear relationship by computing the correlation coefficient. Let's say that there is a linear relationship that exists. We then need to learn how to describe it as accurately as POSSIBLE in an equation called the least squares regression line equation. This is the focus of section 2.

How to compute linear correlation coefficient on Calculator

1. Go to 2nd then 0 = catalog
2. Go to diagnostic on and click Enter twice to see the word done.
3. Go to stats, calc. go to Linear regression and put for example L1, L2 (exp., resp. variable respectively).

The residual value represents

How close the observation that we predicted using the linear equation: least squares regression line: to the actual observation. The smaller the residual (or the difference between the value of the observed response variable to the predicted value of the response variable), the better our prediction was.

Residual is like

An error because the observed value is different from our predicted value.

The least sqaured regression line is a line that

Minimizes the SUM of (the SQUARED:::: errors or residuals). In other words it minimizes the SUM of (the SQUARED:::: vertical distance between the observed y values and the predicted y values which are also called y hat).

What is y hat?

Y hat is the predicted value of the response variable using the least squares regression line equation.

What's an interesting application of the y hat. Give an example.

It can be used to represent the mean response variable for any value of the explanatory variable. For instance if the best-fit regression line equation relates student's amounts of hours studying to student's gpa then for instance if student studies 12 hours per day get 3.8 gpa then we can say that the mean gpa for all students who study 12 hours per day is 3.8.

Interpretation of a regular slope value of m=-y/x and m=3/2 vs interpretation of slope of least squares regression line.

In general: slope measures the rate of change
m = slope = change in y over change in x.
m=-1/1, if x increases by 1, y decreases by 1,
m=3/2, if x increases by 2, then y increases by 3.
In case of least squares regression line
if m=-1/1, if x increases by 1, y decreases, ON AVERAGE by 1, (this is because in statistics everything is based on statistics and there is no 100% certainty).
m=3/2, if x increases by 2, then y increases, on average by 3. (same reason)

SOMETHING VERY IMPORTANT TO REMEMBER about the usage of least squares regression line.

Do not use the least squares regression line equation for explanatory variables that are NOT within the range of values in the data set because the linear relation that we computed may not hold true for values that are smaller or larger than scope of values in the data set.

If the correlation coefficient shows no linear relation, (2 things!)

1) we can't use the least sqaures regression line to make predictions (y hat)
2) y hat or predicted observation is equal to the mean of the response variables.

Contingency Table aka .... define

Two-way table. Relates between two Categorical or QUALITATIVE data. For instance level of education and employment status.

Marginal Distribution of a variable compare with conditional distreubtion

Frequency or relative frequency of either the row or column variable in the contingency table.
Conditional: the RELATIVE FREQUENCY (only) of a category of the response variable (e.g. employment) given a certain value of the explanatory variable (e.g. number of high school graduates) (see pg. 238 if necessary)

Marginal Distributions ... why are they called so?

Because each marginal distribution appear either at the right margin or the bottom margin of the contingency table

What's the purpose of computing the conditional distribution? what do we use

To describe whether there is a relationship that exists between two categories of qualitative data. We use relative frequency numbers because there are different number of observations for each category of data (see pg.237 if necessary)

Give an example of how conditional frequency can help us see if there is an association between two variables.

For instance: level of high school education and employment.
If the relative frequencies for employed people who graduated high school and who did not graduate high school are both close to the relative frequency marginal distribution for employment then the level of high school education is not really a factor in employment and is NOT ASSOCIATED with better chance of employment.

Summarize how we observe association between qualitative or categorical variables.

See compare the relative frequency of the explanatory variable in each category of the explanatory variable. Differences in the values of the frequencies between the different categories MIGHT BE attributed to difference in the explanatory variable and hence the expl. and response variable are correlated.

Simpson's Paradox

Describes a situation in which an association between two variables inverts or goes away when a third variable is introduced into the analysis. (see pg. 241 if necessary)

Hint: When asked what proportion of ..........,

The .............. is the denominator.

Conditional Distribution of Y

That means Y is the numerator, makes sense because imagine question asking what's the conditional distribution of employed people who are high school graduates. (see h.w. q.2 if necessary)

Conditional Distrebution by X

It means that X is the denomenator. (see h.w. q.2 if necessary)