Statistics Term Definitions

Statistics

The art of collecting, organizing, summarizing, and describing data as well as drawing inferences from the data

Probability

Quantification of our uncertainty about our conclusions; between 0 and 1

Descriptive statistics

Describing, collecting, organizing, and summarizing what is observed from the sample

Inferential Statistics

drawing conclusions from the data

Population

the complete set of units we are interested in studying, size = N

Sample

a subset of the population, size = n

Well Representative Sample

A sample in which the characteristics of the sample match the characteristics of the population

Parameter

A numerical value based on the population

Statistic

A numerical value based on the sample

Variable

A characteristic that takes on different values for different people, places, or things

Constant

Values in the population that do not vary

Quantitative variable

A variable that can be measured

Qualitative variable

A characteristic that cannot be measured, also Categorical variable

Random variable

a variable whose exact value cannot be determined in advance and is the result of a chance factor; a rule that assigns a numerical quantity to each outcome in the sample space: X,Y,Z

Discrete random variable

the variable can only assume specific values

Continuous random variable

the variable can theoretically assume any value on a given interval

Measures of central tendency

conveys a "typical" value of the data set: mean, median, mode

Mean

the center of the data: ? (population), x bar (sample) - is not a resistance measure

Median

the middle of the data: M (population), x-squiggle (sample) - is a resistance measure

Mode

the data value that occurs the most often - is a resistance measure

Outlier

a data value that is an extreme value

Resistance measure

a measure that is not affected by outliers

Measures of Dispersion

Convey information about the amount of variability in the data: range, standard deviation, variance, and IQR

Range

compares the largest and smallest data values - is not a resistance measure

Variance

Compares each data value to the mean; ?^2 (population) s^2 (sample) - is not a resistance measure

Standard Deviation

Measures the variability in the original units of data: ? (population), s (sample) - is not a resistance measure

Measures of Location

Conveys information about the location of a specific data value compared to the other data values: Percentiles, Quartiles, Standardized Values

Percentile

The mth percentile is the value x such that m% of the data values are less than x and (100-m)% are greater than x

Quartile

Specific percentiles dividing the data into quarters: Q1=P25, Q2=P50, Q3=P75

IQR

Interquartile Range describes the variability of the middle 50% of the data: Q3-Q1 - is a resistance measure

Experiment

A process in which the outcome can not be predicted ahead of time

Sample space

a collection of all possible outcomes

Event

a subset of the sample space

Equally likely events

each outcome in the event has the same chance of occuring

Mutually exclusive events

cannot happen at the same time; there are no elements in common between the two events

Exhaustive events

Every element in S occurs in one of the events. There are no elements in S that are not in one of the events listed.

Probability Distribution

Describes the population of data values - all the possible values that can be attained by the random variable

Probability Density Function

A continuous curve that describes the distribution of a continuous random variable

Probability Mass Function

A table, graph, formula, or any device used to specify all possible values of a discrete random variable along with their respective probabilities, gives probability at a point. Properties: all f(x) values are between 0 and 1, f(x)=0 for all x not in S, a

Probability Histogram

the graph of a PMF: the horizontal axis is the random variable and the vertival axis is the probability. Probability is represented by a bar with width one centered about the random variable with a height equal to the probability.

CDF

Cumulative Distribution Function, gives cumulative probability

The Expected Value

the long-run average of a particular distribution; the expected value or mean value of a random variable

Bernoulli Trial

When a random process or experiment results in one of only two mutually exclusive outcomes

Step Function

Graph of a CDF

Uniform Probability Distribution

Continuous random variables that have equally likely outcomes over their range of possible values

Normal Distribution

Bell-shaped curve, area under curve equals 1, determined by u and o, z-scores tell us how many standard deviations x is above or below the mean

Sampling Distribution

The distribution of all possible values that can be assumed by some statistic

Central Limit Theorem

Suppose a random sample of size n is selected from any population. When n is sufficiently large, the sampling distribution of x bar has an approximate normal distribution. As n gets larger, this approximation becomes better. N is greater than or equal to

Theorem

If a random sample of n observations is selected from a normal population, the sampling distribution of x bar will also have a normal distribution

Bernoulli Random Variable

1. Is the RV counting the number of successes?
2. Do I have a finite Bernoulli process: are there a finite number of trials? only two outcomes? constant probability (probability describes the parameter)? trials are independent?

Point estimate

When we use a statistic to estimate the value of a parameter. The sample mean is an unbiased point estimate of the population mean and the sample proportion is an unbiased point estimate of the population proportion.

Confidence Interval

An interval estimate of the parameter in which we include how "confident" we are that the interval contains the parameter we are estimating

Hypothesis

A statement about one or more populations, usually concerned with the value of a parameter

Research hypothesis

The conjecture that motivates research

Statistical hypothesis

hypotheses that are stated in such a way that they may be evaluated by appropriate statistical techniques

Null hypothesis

Ho, the hypothesis being tested, assume throughout the test that it is true, either reject it or fail to reject it

Alternative hypothesis

Ha, corresponds to what the researcher is trying to prove

Level of significance

alpha, the probability that the test statistic will fall in the rejection region if the null hypothesis is true; this is set by the researcher as a "small probability.

Test statistic

The formula used to find the observed value, OV, of the test statistic, where the OV is the quantity used to make a decision in a hypothesis test, telling us where on the distribution curve the point estimate falls. T.S.=(relevant statistic-hypothesized v

Assumptions of the test statistic

Properties that must be satisfied in order for your test statistic to be valid and have the assumed distribution

Rejection Region (RR)

Those values of the test statistic that provide strong evidence in favor of the alternative hypothesis; the "region" on the distribution curve that "unlikely" values of the point estimate will fall if the assumed value of the parameter is true. If the OV

P-value

the probability of getting a value of the point estimate that is favorable or more favorable to the alternative hypothesis, if the null hypothesis is true. The probability if the point estimate is an "unlikely" or "likely" value if the assumed parameter i

Decision of the Test

Reject Ho or fail to reject Ho

Type-I error

Rejecting a true null hypothesis; probability of making this error is alpha

Type-II error

Failing to reject a false null hypothesis; probability of making the error is beta

Power

The probability of rejecting a false null hypothesis, 1-beta