Right-skewed distribution
The right side of the graph has a long tail
Left-skewed distribution
The left side of the graph has a long tail
What best displays categorical variables?
Frequency tables ("tally()") or bar graphs ("gf_bar()")
What best displays quantitative variables?
Histograms or relative frequency histograms ("gf_histogram()" or "gf_dhistogram()"), boxplots ("gf_boxplot()"), and scatterplots ("gf_point()")
Type I error (alpha)
False positive results
i.e: Reject the null hypothesis when you should accept it
Why is the mean a good model for most distributions?
It balances the deviations above and below the mean
If the mean of a variable is 22.5, what would the empty model predict for each observation in that variable
Each observation would be 22.5 according to the empty model
If the mean of a variable is 22.5 and a given observation is 26.7, what is the data?
26.7
If the mean of a variable is 22.5 and a given observation is 20.1, what is the residual?
-2.4
Out of these three histograms:
1. of the variable
2. of the empty model
3. of the residuals
Which two would have a similar shape?
1. of the variable
3. of the residuals
In the GLM notation, what represents the model (or prediction) of the sample?
b0
Knowing that mean and SS have a direct relationship, what happens when you pick any number, higher or lower than the mean?
The SS is now bigger than its original value.
What is the difference between a residual and standard deviation?
Residuals are defined as how far an individual's score is from the mean
Standard deviations are defined as how generally far apart points are spread across a regression line relative to the mean.
What is the difference between SSerror and SStotal on a SUPERNOVA table?
They are sums of squared residuals from different models; SSerror comes from the quantitative predictor model, and SStotal comes from the simple/NULL/empty model.
In a histogram and a density histogram, what parts are the same?
The range of the x-axis,
the shape of the dist, and what points are most likely
In magnitude~depth and magnitude~longitude, why are the SStotals the same? (we are referring to earthquakes)
Because they are both from the simple model of magnitude. In other words, they use the same outcome variable.
Which of the following F-ratios indicate that the explanatory variable isn't explaining more variance per degree of freedom than the simple model?
Any value less than 1
What characteristic is the same across the simple, categorical predictor, and quantitative predictor models?
If we sum the residuals, it would equal zero for each model.
What can we say about the power of aggregation?
The law of large numbers, more sample size means more close to the population mean.
What is the key factor of a quantitative predictor model?
When you explain a quantitative variable with another quantitative variable (magnitude explained by longitude for earthquakes)
What is the key factor of a categorical predictor model?
When you explain a quantitative variable with a categorical variable (weight explained by type of food eaten)
Why is it important to examine both PRE and F-ratio (in SUPERNOVA)
PRE gives information about the variance accounted for and F-ratio corrects for model complexity
Is variance impacted by sample size?
NO, so we can compare error across two samples of different sizes
Is variance a sample statistic or a population parameter?
Sample statistic
Variance
The average of the squared deviations from the mean
Standard deviation
How much scores deviate around the mean
Larger z-scores mean
Larger residuals
What is similar about residuals from the empty model and the complex model?
Both represent the difference between the data and the model's prediction
Margin of error refers to variability of a/an
Estimate/Statistic
SS and SAD are both
measures of total error
Sampling distributions tell us that each time we take a random sample from the population...
There will be variability in the sample statistics
What is a correct interpretation of a CI for carbon and steel bikes (106.16/110.52)
Our data would be considered likely (likely = greater than 5%) for population mean commute times for carbon bikes between 106.16 and 110.52.
With smaller sample sizes, should we use confint.default() or confint()?
confint(), it's just safer under smaller sample size conditions.
Does confint.default() or confint() produce a slightly later CI?
confint()
What is true about the sampling distribution of PREs?
They aren't normally distributed
They shouldn't be modeled with a t-dist
They don't center around the sample PRE
Why don't the sampling distribution of PREs cluster around the sample mean?
Because the PREs were generated from randomization, which is a DGP that makes any differences between groups due to random chance.
What does a F value of 3.30 mean?
That 3.30 times the error is reduced by the additional parameter compared to any other parameter that could have been added to the model
If sample PRE and sample f are higher then...
p value should be lower
If sample PRE and sample f values are lower, then...
p value should be higher, and that means our sample explains a lot of variation.