STAT 313 Exam 1

response variable

measures an outcome of a study

explanatory variable

independent variable

observational units

the individual entities on which data are recorded

Association

Two variables are associated if the conditional distribution of one variable changes depending on the explanatory variable value on which you are conditioning.

Simpson's Paradox

when averages are taken across different groups, they can appear to contradict the overall averages

Confounding variable

a third variable that is related to both the explanatory variable and the response variable. You may recall from your first statistics course that we always need to be concerned about confounding variables with observational studies, as they may provide an alternative explanation for an observed association between the explanatory and response variables, preventing cause-and-effect conclusions.

sample

observational units we collect data

Generalizability

refers to deciding an appropriate population to which we can generalize our conclusions

Identify the response variable, the explanatory variable and the blocking variable.

RV: overall liking on a 1-15 scale
EV: barrel age of the wine (2-years old vs. new)
Block: rater

Now consider the MSError in these two tables. Explain why the MSError went up after including blocks in the analysis. Then discuss the impact this had on the F-statistic for testing the effect of barrel age, and the resulting p-value for the effect of bar

The MSError went up after including blocks because raters do not account for very much variation in the taste ratings. This means the SSError did not go down by enough to overcome the loss in error df.

The goals of blocking are to expand the scope of inference (i.e., more varieties of strawberries) while at the same time reducing the MSWithin (i.e., the MSError), to make the test of the treatment effect more precise. When an experiment is designed, the

Because the goal of blocking is to reduce the MSError, we want to block on something that will account for a lot of the unexplained/extraneous variation that is left-over after accounting for the treatment variable. Because blocking will cost us df in the Error (df Error will go down as a result of blocking), we want to block on a factor that will decrease the SSError enough to overcome this loss in df Error. So, a good blocking variable will be one that accounts for lots of unexplained/extraneous error.

Shown below are two sets of boxplots. The first set (Set #1) are boxplots of the actual HVL test scores in this study. The second set (Set #2) show boxplots of fake (made-up) HVL test scores. Thinking about the discussion in class on the F-statistic, orde

Set #1 will have the smallest F-stat and Set #2 will have the largest F-stat. In both sets of data the within group variation is the same, as seen by the same standard deviations in both sets. Thus, the denominators of the respective F-statistics will be the same. Because the means in set #2 are farther apart, the numerator of the F-stat for set #2 will be larger than that of Set #1. Thus, the F-stat for Set #2 will be bigger.

In class we simulated many, many F-statistics under the conditions of the null hypothesis (re- randomized null distribution). Explain why this distribution is right skewed. Also explain what the smallest value of the re-randomized F-statistics can be and

The smallest the simulated F-stat could be is 0. That will happen if all of the means are identical. The re- randomization process should 'balance' the data values among the 3 groups, this will keep the means fairly similar to one another, which will keep the numerator of the simulated F-statistics small. Occasionally, the groups will end up more unbalanced, in terms of the data values, and we'll get a larger discrepancy among the means, giving a bigger F-statistic. Thus, the distribution of re-randomized F-stats should be right skewed.

Create side-by-side boxplots and summary statistics for the 5 treatment groups. Comment on what the means and standard deviations suggest about the effect of insect control method on corn yield. Copy/paste the appropriate JMP output into part (a) of your

Because the means are quite close together relative to the standard deviations (causing lots of overlap in the boxplots), it's possible there may not be any differences in the effect of insect control method on corn yield.

Do the data meet the validity conditions for using the theoretical F-distribution? Explain citing evidence from (a) and/or the design of the study to support your statements.

The ratio of the largest standard deviation to the smallest is 3.5/1.95 = 1.8 which is less than 2, so the validity condition requiring that the standard deviations be approximately equal is met.
The boxplots in (a) look fairly symmetric, although with small sample sizes it can be difficult to say for sure. None of the distribution have strong skew, nor do any have any outliers, thus the condition that the data are reasonably symmetric is valid.

o the data suggest there is a statistically significant effect due to the insect control method? Use ? = 0.05. Show all steps of the overall hypothesis test (provide the null and alternative hypotheses, p-value with statistical decision, conclusion in con

With a small p-value (0.0033), there is sufficient evidence to conclude there is a statistically significant effect of insect control method on corn yield. This conclusion applies to all plots of sweet corn grown under the same conditions as the experimental plots in this study.
OR: With a small p-value (0.0033), there is sufficient evidence to conclude at least one of the mean corn yields is different among the 5 insect control methods. This conclusion applies to all plots of sweet corn grown under the same conditions as the experimental plots in this study.

Explain briefly, how you know this is a randomized complete block design. As part of this explanation, identify the response variable, the explanatory variable and the blocking variable.

Because a random ordering of each paint type was used for each location. This means each location had one stretch of road with each paint type.

Create boxplots and a table of summary statistics (means, standard deviations, and sample sizes) for each paint type (ignoring location). Comment on whether or not paint type appears to explain variation in the wear.

Paint type does appear to explain some variation in wear, since there are differences in the mean wear among the 5 paint types (e.g., mean for paint 1 is 20.5, while the mean for paint 4 is 29.375).

Use JMP Fit Y by X to carry out the analysis, including
location as the block. In the Fit Y by X output, under the red
hotspot, find Display options and turn on the boxplots. Also, under this hotspot turn on the output for means and standard deviations. F

After adjusting for location, there is a significant effect due to paint type (F = 30.4, p-value < 0.0001).
The mean wear for each paint type has not changed from the one-way analysis, however, the standard deviations are now much smaller.
This is because the wear values have been adjusted by the location effect. In other words, the location effects have been removed from the wear values, pulling them closer together.

Comment on whether or not the validity conditions are met for this analysis. Cite specific numbers and/or characteristics of the graphs to support your statements. (See Exploration 2.2 - Part C

The boxplots of the location-adjusted wear values in (c) appear reasonable symmetric. Specifically, there is not extreme skew in any of the boxplots, nor are there any outliers. Thus, the first validity condition is met. The smallest standard deviation is 0.4 and the largest is 2.3, thus the largest sd is almost 6 times bigger than the smallest. The second validity condition requiring approximately equal standard deviations is not met for these data.

Identify the response variable, the explanatory variable and the blocking variable.

RV: overall liking on a 1-15 scale
EV: barrel age of the wine (2-years old vs. new)
Block: rater

Because the study was a block design, we should analyze the data as such. Run
an analysis (Fit Y by X), but this time include the blocking variable (Rater). Copy/paste the ANOVA table below. How has the significance of the barrel age, after adjusting for

The significance of barrel age has gotten worse. The F-statistic is smaller than it was in the original analysis (F = 0.1316 vs. F = 0.2081) and the p-value is larger (p = 0.7352 vs. p = 0.6605).

Consider the two ANOVA tables from (b) and (d). How has including the blocking variable changed the SSError? How has including the blocking variable changed the df for Error?

The SSError got smaller, going from 15.375 down to 12.1625.
The df Error also got smaller, going from 8 down to 4.