equal variances
used F-distribution and F-test
F-test
performed to see if the dataset has equal variances
one-sided
p-value / 2
Welch
unequal variance
two-sample
equal variance
Power
ability to detect important effects
the power of a hypothesis test is:
the probability of rejecting the null hypothesis when it is not true
Type I error (alpha)
false positive
Type II error (beta)
false negative
we calculate the test statistic from the data under the assumption that:
H0 is true
a small p-value suggests evidence against:
the null hypothesis
observed level of significance
p-value
increase alpha, decrease beta
power increases
decrease alpha, increase beta
power decreases
effect size
the smallest change that we care about
for pairs of observations, we can compare two groups using:
the differences of the pairs
the sign test is a:
non-parametric test, meaning that it does not rely on the data following any distribution, therefore there is no normality or symmetry conditions
pros of non-parametric tests:
less assumptions on the underlying distribution
cons of non-parametric tests:
-generally less powerful than parametric methods
-we might have to change the hypotheses
-there is often not a corresponding estimate of the size of the differences, such as a confidence interval
dbinom
finds the probability for the binomial distribution
noise
random error
linear regression
is the statistical method of fitting a line to data
x
is the predictor/explanatory/independent variable
y
is the response/dependent variable
Residual (denoted by e)
the difference between the data and our predicted response
residuals can help us to evaluate:
how well our linear model fits the data
if a pattern exists in a residual plot:
do not use a linear model
correlation
quantifies the strength of a linear relationship between two variables
when fitting a least squares line, we require:
-linearity
-nearly normal residuals
-constant variability
-independent observations
Linearity
the data follows a linear trend
Nearly normal residuals
the residuals are nearly normal. If the residuals don't appear to be normal, this is possibly due to outliers or influential points (CHECK USING QQ PLOTS)
Constant Variability
the variability of the points around the least squares line should remain roughly constant
Independent Observations
the data points should be independent of each other. One case where this might not apply is in time series data where the observations are sequential and correlated - this aspect of the model cannot be captured by linear regression