STAT193 exam study

What is univariate data?

data about one variable

What is bivariate data?

data about two variables

What is stratified sampling?

- Stratified sampling is a probability sampling technique wherein the researcher divides the entire population into different subgroups or strata, then randomly selects the final subjects proportionally from the subgroups
- it is used instead of random sa

What are the four types of variable?

- numerical continuous
- numerical discrete
- categorical ordinal
- categorical nominal

What is a numerical continuous variable?

a numerical variable that can take all values

What is a numerical discrete variable?

a numerical variable that is limited to certain values

What is a categorical ordinal variable?

a categorical variable with a natural order

What is a categorical nominal variable?

a categorical variable with no natural order

What is the correct data display for a numerical continuous variable?

dotplot/boxplot/histogram

What is the correct data display for a numerical discrete variable?

barchart or histogram

What is the correct data display for a categorical ordinal variable?

barchart

What is the correct data display for a categorical nominal variable?

barchart

What are the main features to look for in a histogram?

- symmetry
- skew
- range
- number and location of modes
- outliers

What is a percentile?

a percentile is the number below which x percent of the data lie

What are the 'special' percentiles?

- 25th percentile = lower quartile
- 50th percentile = median
- 75th percentile = upper quartile

What is an outlier?

an outlier is defined as a value more than 1.5 x IQR below the LQ or above the UQ

What is the IQR?

- interquarile range
- found by subtracting the LQ from the UQ
- IQR = UQ - LQ

What is the mean and how do you calculate it?

- the mean tells us the typical or central location of our data
- add all values and divide by the number of values

What is standard deviation?

- the standard deviation measures the spread of the data
- it is a measure of the average squared deviation (distance) from the mean

What is variance?

- the sample variance is a measure of the spread (or amount of variation in our data)
- it is the square of the standard deviation
- variance = SD�

What is ??

- the population mean
- "mu

What is ??

- the population standard deviation
- "sigma

What is ?�?

- the population variance
- "sigma squared

What is X??

sample mean

What is S?

sample standard deviation

What is S�?

sample variance

What data display should be used for two categorical variables?

clustered bar chart

What data display should be used for one categorical variable with one numerical variable?

side by side boxplots

What data display should be used for two numerical variables?

scatterplot

What is a monotonic graph?

a graph that is either always increasing or always decreasing from left to right

What is a non-monotonic graph?

a graph that changes direction

What are mutually exclusive events?

events that cannot occur simultaneously

What are complementary events?

2 mutually exclusive events, which are the only possible outcomes

What are independent events?

events that have no influence on each other

What does P( A?B) mean?

- the probability of A intersection B
- meaning the probability of A and B occuring

What does P(A?B) mean?

- the probability of A union B
- meaning the probability of either A or B occuring

What does P(A?B) mean?

- the probability of A given B
- meaning the conditional probability that A occurs, given that B has already occurred

What is the probability rule when you have two independent events?

P(A?B)= P(A) � P(B)

What is the probability rule when you have two mutually exclusive events?

P(A?B)= P(A) + P(B)

What is a bernoulli trial?

- a trial where there are only two outcomes
- the two outcomes are usually called (often arbitrarily) 'success' and 'failure'
- bernoulli trials form the basis of binomial distribution

What are the necessary conditions for a binomial distribution?

1. fixed number of trials
2. fixed probability of success
3. two possible outcomes
4. trials are independent

What is the distribution of X for a binomial distribution?

X~Bin(n,p)
- when n is number of trials
- and p is the probability of success

What is the distribution of X for a normal distribution?

X~N(?,?�)
- ? is the population mean
- ?� is the population variance

What is a Z distribution?

- a special kind of normal distribution where the mean is zero and the standard deviation is one
- Z~N(0,1)

What are the six steps of hypothesis testing?

1. form two opposing hypotheses H0 and H1
2. decide on a significance level ? for the test
3. calculate the test statistic (x or t or z)
4. find the p-value using the test statistic
5. either 'reject' or 'fail to reject' the null hypothesis
6. make a conc

When should a sign test be used?

- when there is a small sample and we want to use the median
- it is suitable when the data is skewed

When should a t distribution be used?

- based on a normal distribution - symmetric, very few outliers
- sample size is "small" and ? is unknown
- the mean of a t distribution is zero

What is a type 1 error?

- if we reject H0 when it is true then we are making a type 1 error
- concluding a change has occured when it hasn't
-false positive
(the probability of making a type 1 error is alpha, the significance level)

What is a type 2 error?

- if we fail to reject H0 when it is false we are making a type 2 error
- conclude no change when there is one
- false negative

What is the central limit theorem (CLT)?

- if we take many random samples, all of size n, as long as n 'is large enough' (n>30) the distribution of the means of the samples will approximate a normal distribution
- this is true even when the population we are sampling from is not itself normally

What test should be used for two categorical variables?

chi-square test

How are degrees of freedom calculated in a chi-square test?

df = (no. of rows minus one) x (no. of columns minus one)
df = (R - 1) x (C - 1)

What are the necessary conditions for a chi-square test?

- observations must be independent
- each observation must only appear once
- it is also required that there are very few cells with expected values lower than 5

What test should be used for categorical by numerical variables?

ANOVA test

How are the degrees of freedom for an ANOVA calculated?

k-1 (k is no. of groups) and n-k (n is total sample size)

What are the necessary conditions for an ANOVA test?

- the groups have approximately equal variances
- the groups are approximately normally distributed
- sampling was random

What does it mean if a confidence interval does/does not include zero?

- if the confidence interval does include zero, we have no evidence of a difference between the means, so we do not reject H0
- if the confidence interval is purely positive or purely negative (does not include zero), we have evidence of a difference, so

Why would a t-test for difference of 2 means be used instead of an ANOVA?

- if the categorical variable only has 2 levels, we use a t-test for the difference of 2 means instead of ANOVA
- we need the same assumptions as ANOVA
- the advantage of using a t-test is that it can be used for a one-sided test (ANOVA can only be used f

What is a residual?

a residual is the numerical difference between an observed value and a predicted value

How is a residual calculated?

residual = observed value - predicted value

What type of test is used when we have two numerical variables?

linear regression

In linear regression which axis is the explanatory and which is the response?

the x-axis is the explanatory and the y-axis is the response

How do we interpret the correlation coefficient?

- between 0 and 0.3 = weak
- between 0.3 and 0.7 = moderate
- between 0.7 and 1 = strong

What is the coefficient of determination?

the proportion of the variable Y which is explained by the variation in X

What is interpolation?

when we choose a value within the range of our data

What is extrapolation?

when we choose a value from outside our data

What are the assumptions for linear regression?

- straight line relationship between X and Y (curvature?)
- constant variance (funnelling?)
- normally distributed (outliers?)
- independent observations