Power, Sign Test, and Simple Linear Regression

equal variances

used F-distribution and F-test

F-test

performed to see if the dataset has equal variances

one-sided

p-value / 2

Welch

unequal variance

two-sample

equal variance

Power

ability to detect important effects

the power of a hypothesis test is:

the probability of rejecting the null hypothesis when it is not true

Type I error (alpha)

false positive

Type II error (beta)

false negative

we calculate the test statistic from the data under the assumption that:

H0 is true

a small p-value suggests evidence against:

the null hypothesis

observed level of significance

p-value

increase alpha, decrease beta

power increases

decrease alpha, increase beta

power decreases

effect size

the smallest change that we care about

for pairs of observations, we can compare two groups using:

the differences of the pairs

the sign test is a:

non-parametric test, meaning that it does not rely on the data following any distribution, therefore there is no normality or symmetry conditions

pros of non-parametric tests:

less assumptions on the underlying distribution

cons of non-parametric tests:

-generally less powerful than parametric methods
-we might have to change the hypotheses
-there is often not a corresponding estimate of the size of the differences, such as a confidence interval

dbinom

finds the probability for the binomial distribution

noise

random error

linear regression

is the statistical method of fitting a line to data

x

is the predictor/explanatory/independent variable

y

is the response/dependent variable

Residual (denoted by e)

the difference between the data and our predicted response

residuals can help us to evaluate:

how well our linear model fits the data

if a pattern exists in a residual plot:

do not use a linear model

correlation

quantifies the strength of a linear relationship between two variables

when fitting a least squares line, we require:

-linearity
-nearly normal residuals
-constant variability
-independent observations

Linearity

the data follows a linear trend

Nearly normal residuals

the residuals are nearly normal. If the residuals don't appear to be normal, this is possibly due to outliers or influential points (CHECK USING QQ PLOTS)

Constant Variability

the variability of the points around the least squares line should remain roughly constant

Independent Observations

the data points should be independent of each other. One case where this might not apply is in time series data where the observations are sequential and correlated - this aspect of the model cannot be captured by linear regression