Categorical or Quantitative
Categorical - a variable that places individuals in a group/category (green eyes)
Quantitative - a variable that is a numerical values that make sense to average (test scores)
describing and comparing distribution(s) - SOCS
One distribution - Shape, outliers-unusal features, center, spread - context.
If comparing two or more distributions - same as one, except use comparision words!
Z-Score
z= (observation - mean)/sd
Used to standardize a score
Used to compare different individuals relative to each other (a man's relative height to a women's relative height)
Mean/SD OR Median/IQR
Use mean/sd to describe when the data set is aproximately normal or roughly symmetrical
Use median/IQR to describe when the data set is skewed. (These values are less affected by outliers - resistant to outliers)
standard deviation and variance
SD - The typical distance of each observation from the mean.
Variance is the standard deviation squared.
Five Number Summary and Outlier Rule
Min, Q1, Median, Q3, Max
1.5 Outlier Rule-
IQR = Q3-Q1
Q3 + (1.5xIQR) or
Q1-(1.5xIQR)
DOFS - describing a scatterplot
Direction - positive, negative
Outliers - high leverage and influential also, include unusual features
Form - linear or not
Strength - strong, moderately strong, weak, moderately weak
Parameter and statistics
Parameters come from populations
Statistics come from samples
Census versus Sample
Census - attempt to collect data from every individual in the population
Sample - a subset of the population
Good Sampling
SRS, stratified, cluster, systematic
Bad Sampling
Voluntary, convenience
Bias - explain bias and in which direction it would bias the study
Undercoverage, response, nonresponse, question wording
Empirical Rule
68-95-99.7
Used to estimate areas in a Normal distribution
Describing random sampling and random assignment
Be specific!
Label your participants
Explain how you are using a chance process
What are you doing with repeats
Experimental Designs
Completely randomized
Randomized block
Matched pairs (type of block - blocks of 2)
Good Experimental Design Components
Random - random assignment
Control (not same as control group) - trying to limit other variables (confounding) that might affect outcome
Replication - using enough subjects or experimental units
Comparison - compare two or more treatments
Simulation - SPDC
A way to imitate chance behavior.
If asked to set one up - SPDC
State - question of interest
Plan - describe how to use chance to imitate one repitition - explain thoroughly and tell what you will record
Do - perform many repititions
Conclude - use simula
Law of large numbers
If we perform many, many repititions of a chance outcome the value will approach a single number.
Example, if we roll a die, many, many times, our probability of getting a 2 is 1/6. I may not get that in 10 rolls, 20 rolls, etc, but if I did it many, many
independent events
Two events that have no effect on each other.
P(N|W)=P(N)
If these two probablities are equal, then the two events, N and W, are independent
The fact that W occurred has no effect on the probability of N.
mutually exclusive events
Two events that cannot occur at the same time. (Male and Pregnant)
Binomial and Geometric RV
B - success/failure
I - independent observations
N - fixed number of trials (binomial); continue until you get one success (geometric)
S - same probablity of succes each trial
Scope of Inference
If random sampling then we can make inferences about the population from which we sampled.
If random assignment occurs we can make cause and effect inferences, for those similar to the ones in the study.
Sample distribution vs Sampling distribution vs. Population Distribution
Sample - Data for one sample (10 red chips and 10 blue chips from 1 sample of 20 chips)
Sampling - One dot represents the proportion of red chips (successes). So one dot on the plot would represent all 20 chips, the proportion of red. You would need many,
Sampling Distributions
If the scenario is a sampling distribution don't forget to use the formula to find the standard deviation/standard error before using normalcdf.
Don't use the standard deviation for original distribution. Remember the sampling distribution has lower varia
Large Counts for Normal condition
Only used in proportions
CLT - Central Limit Theorem for Normal condition
Only used in means
Use Confidence Interval
when estimating a value
Point estimator/Point estimate
estimator - A statistic that estimates a population parameter.
Proportions and means are good unbiased estimators.
estimate - is the value from your sample.
The confidence interval is the point estimate +/- margin of error.
SPDC - confidence intervals
STATE- What are you estimating, define parameter and CI level
PLAN - name and conditions
DO- calculuator name and CI
CONCLUDE - We are ____% confident.........CONTEXT
Use Hypothesis/Significance Test
when testing a claim, evaluating evidence, asked if we have convincing evidence
Hypotheses and Conclusions are always talking about ______
parameters
SPDC - hypotheses(significance) test
STATE - Null and Alternative Hypotheses (in terms of parameter) define parameter, and alpha level
PLAN - name and conditions
DO- calculator name, test statistic, p-value and df(means)
CONCLUDE - Since our p-value......CONTEXT
test statistic
The number of standard deviations the sample statistic lies away from a hypothesized population parameter.
z for proportions, t for means
Pooled proportion
Use in a 2-Sample Z test (not a CI) for the difference of proportions in the Normal/Large Counts condition and test statistic (found on calculutor)
Never Pool for means
Statistically significant
When the p-value is smaller than alpha we say the results are statistically significant at the alpha level we used.
Meaning we do have evidence for the alternative, so we reject the null.
power
The probability of rejecting the null given that the alternative is true (Power + Beta = 1)
Type 1 Error
Rejecting the null when null was true (alpha)
Type II Error
Failing to reject null when alternative was true (beta)
(Power + Beta = 1)
Residuals
R = A(actual) - P(predicted)
The difference between the actual data value and the prediction line.
If the data value is above the prediction line, the residual is positive.
If the data value is below the prediction line, the residual is negative.
Scatterplots - prediction line
y-hat (response) = intercept - slope(x, explanatory)
Always define variables - better yet write within equation. Such as (price of truck)-hat = intercept - slope(miles on truck)
Unusual points in scatterplots
Outlier (high residual)
High Leverage (x-value is outside of the bulk of the data)
Influential (changes slope, intercept, correlation values)
correlation coefficient(r)
The strength and direction of a linear association between two quantitative variables that are related.
Does not tell if a line is a good fit, only tells strength and direction once you've determined a line is a good fit. Use scatterplot and residual plot
Correlation/Association is NOT Causation
You need an experiment with random assignment to make inferences about cause and effect
coefficient of determination (r^2) - interpretation
Approximately ______% of the variation in the y (response) can be explained by the linear relationship, or LSRL, with x (explanatory).
confidence interval interpretation
We are ____% confident the interval between _____ and ____ captures the true ____________________________.
confidence level interpretation
If the study was repeated many times, about ____% of the resulting confidence intervals would contain the true population ______________________________________(Context)
p-value interpretation
Assuming the null is true (in context), there is a _______probability(sample data) of seeing a sample value, or more extreme, just by chance .
Y-Intercept (a) interpretation
The PREDICTED value of the y (response) variable when the x (explanatory) variable is 0.
Sometimes this value makes sense in context and sometimes it doesn't - we always interpret the same way though.
Slope (b) interpretation
The amount which the y (response) is PREDICTED to change when x increases by 1 unit.
Z-Score Interpretation
The number of standard deviations an individual is above or below the mean
s - on computer output for prediction lines and Interpretation
Standard deviation of the residuals
Interpretation - measures the typical size of the residuals (prediction errors) when using the LSRL
Percentile and interpretation
Describe the location of an individual within a distribution.
Your percentile is the percent of observations that are below or equal to your value.