Statistics
The art of collecting, organizing, summarizing, and describing data as well as drawing inferences from the data
Probability
Quantification of our uncertainty about our conclusions; between 0 and 1
Descriptive statistics
Describing, collecting, organizing, and summarizing what is observed from the sample
Inferential Statistics
drawing conclusions from the data
Population
the complete set of units we are interested in studying, size = N
Sample
a subset of the population, size = n
Well Representative Sample
A sample in which the characteristics of the sample match the characteristics of the population
Parameter
A numerical value based on the population
Statistic
A numerical value based on the sample
Variable
A characteristic that takes on different values for different people, places, or things
Constant
Values in the population that do not vary
Quantitative variable
A variable that can be measured
Qualitative variable
A characteristic that cannot be measured, also Categorical variable
Random variable
a variable whose exact value cannot be determined in advance and is the result of a chance factor; a rule that assigns a numerical quantity to each outcome in the sample space: X,Y,Z
Discrete random variable
the variable can only assume specific values
Continuous random variable
the variable can theoretically assume any value on a given interval
Measures of central tendency
conveys a "typical" value of the data set: mean, median, mode
Mean
the center of the data: ? (population), x bar (sample) - is not a resistance measure
Median
the middle of the data: M (population), x-squiggle (sample) - is a resistance measure
Mode
the data value that occurs the most often - is a resistance measure
Outlier
a data value that is an extreme value
Resistance measure
a measure that is not affected by outliers
Measures of Dispersion
Convey information about the amount of variability in the data: range, standard deviation, variance, and IQR
Range
compares the largest and smallest data values - is not a resistance measure
Variance
Compares each data value to the mean; ?^2 (population) s^2 (sample) - is not a resistance measure
Standard Deviation
Measures the variability in the original units of data: ? (population), s (sample) - is not a resistance measure
Measures of Location
Conveys information about the location of a specific data value compared to the other data values: Percentiles, Quartiles, Standardized Values
Percentile
The mth percentile is the value x such that m% of the data values are less than x and (100-m)% are greater than x
Quartile
Specific percentiles dividing the data into quarters: Q1=P25, Q2=P50, Q3=P75
IQR
Interquartile Range describes the variability of the middle 50% of the data: Q3-Q1 - is a resistance measure
Experiment
A process in which the outcome can not be predicted ahead of time
Sample space
a collection of all possible outcomes
Event
a subset of the sample space
Equally likely events
each outcome in the event has the same chance of occuring
Mutually exclusive events
cannot happen at the same time; there are no elements in common between the two events
Exhaustive events
Every element in S occurs in one of the events. There are no elements in S that are not in one of the events listed.
Probability Distribution
Describes the population of data values - all the possible values that can be attained by the random variable
Probability Density Function
A continuous curve that describes the distribution of a continuous random variable
Probability Mass Function
A table, graph, formula, or any device used to specify all possible values of a discrete random variable along with their respective probabilities, gives probability at a point. Properties: all f(x) values are between 0 and 1, f(x)=0 for all x not in S, a
Probability Histogram
the graph of a PMF: the horizontal axis is the random variable and the vertival axis is the probability. Probability is represented by a bar with width one centered about the random variable with a height equal to the probability.
CDF
Cumulative Distribution Function, gives cumulative probability
The Expected Value
the long-run average of a particular distribution; the expected value or mean value of a random variable
Bernoulli Trial
When a random process or experiment results in one of only two mutually exclusive outcomes
Step Function
Graph of a CDF
Uniform Probability Distribution
Continuous random variables that have equally likely outcomes over their range of possible values
Normal Distribution
Bell-shaped curve, area under curve equals 1, determined by u and o, z-scores tell us how many standard deviations x is above or below the mean
Sampling Distribution
The distribution of all possible values that can be assumed by some statistic
Central Limit Theorem
Suppose a random sample of size n is selected from any population. When n is sufficiently large, the sampling distribution of x bar has an approximate normal distribution. As n gets larger, this approximation becomes better. N is greater than or equal to
Theorem
If a random sample of n observations is selected from a normal population, the sampling distribution of x bar will also have a normal distribution
Bernoulli Random Variable
1. Is the RV counting the number of successes?
2. Do I have a finite Bernoulli process: are there a finite number of trials? only two outcomes? constant probability (probability describes the parameter)? trials are independent?
Point estimate
When we use a statistic to estimate the value of a parameter. The sample mean is an unbiased point estimate of the population mean and the sample proportion is an unbiased point estimate of the population proportion.
Confidence Interval
An interval estimate of the parameter in which we include how "confident" we are that the interval contains the parameter we are estimating
Hypothesis
A statement about one or more populations, usually concerned with the value of a parameter
Research hypothesis
The conjecture that motivates research
Statistical hypothesis
hypotheses that are stated in such a way that they may be evaluated by appropriate statistical techniques
Null hypothesis
Ho, the hypothesis being tested, assume throughout the test that it is true, either reject it or fail to reject it
Alternative hypothesis
Ha, corresponds to what the researcher is trying to prove
Level of significance
alpha, the probability that the test statistic will fall in the rejection region if the null hypothesis is true; this is set by the researcher as a "small probability.
Test statistic
The formula used to find the observed value, OV, of the test statistic, where the OV is the quantity used to make a decision in a hypothesis test, telling us where on the distribution curve the point estimate falls. T.S.=(relevant statistic-hypothesized v
Assumptions of the test statistic
Properties that must be satisfied in order for your test statistic to be valid and have the assumed distribution
Rejection Region (RR)
Those values of the test statistic that provide strong evidence in favor of the alternative hypothesis; the "region" on the distribution curve that "unlikely" values of the point estimate will fall if the assumed value of the parameter is true. If the OV
P-value
the probability of getting a value of the point estimate that is favorable or more favorable to the alternative hypothesis, if the null hypothesis is true. The probability if the point estimate is an "unlikely" or "likely" value if the assumed parameter i
Decision of the Test
Reject Ho or fail to reject Ho
Type-I error
Rejecting a true null hypothesis; probability of making this error is alpha
Type-II error
Failing to reject a false null hypothesis; probability of making the error is beta
Power
The probability of rejecting a false null hypothesis, 1-beta