AP Statistics Terms | AP Statistics

Data analysis

Organizing, displaying, summarizing, and asking questions about data

Individuals

The objects described by a set of data. They may be people, animals, or things

Variables

Any characteristic of an individual. It can take different values for different individuals

Categorical variable

This is a factor that places an individual into one of several groups or categories

Quantitative variable

This is a factor that places numerical values for which it makes sense to find an average.
Ex. it would be useful to have the average individuals' GPAs
Ex. it would not be useful to have the average of individual's zip codes or genders

Distribution

The _______ of a variable tells us what values the variable takes and how often it takes these values

Inference

Drawing conclusions that go beyond the data at hand

Frequency table

A table that displays the counts of variables in each format category

Relative frequency table

A table that shows the percents of variables in each format category

Pie chart

Chart in a circle with sections that show the parts of the whole made up by a certain individual's percentage

Bar graph

Chart comparing relative heights of the percents of individuals' variables

Roundoff error

The result that occurs when, with percentages, the total should add to 100% but only comes close to it, around 99.9% or some other value near it

Marginal distribution

The ______ ______ of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table. (This tells us nothing about the relationship between two variables, though)

Conditional distribution

A ______ ______ of a variable describes the value of that variable among individuals who have a specific value of another variable. There is a separate ______ ______ for each value of the other variable.

Dotplot

A graph for quantitative data where each data value is shown as a dot above its location on a number line.

Shape, outliers, center, spread

Four terms needed to describe a dotplot (SOCS)

Stemplot

Simple graphical display of a distribution that includes the actual numerical values in the graph

Two-way table

Describes two categorical variables, organizing counts according to a row variable and a column variable

Side-by-side bar graph

A type of graph with two or more different variables being displayed and compared next to each other

Association

We say that there is ________ between two variables if specific values of one variable tend to occur in common with specific values of the other

Histogram

The most common graph of the distribution of one quantitative variable. MORE ON THIS DEFINITION??

Outlier

An individual value that falls outside the overall pattern

Symmetric

A distribution is roughly _______ if the left and right sides of the graph are approximately mirror images of each other

Spread

The ______ of a distribution tells us how much variability there is in the data. One way to describe it is to give the largest and smallest numbers. Another way is to compute the range of the data

Variability

How much a set of data varies, described by the spread

Median

The center of a group of data, where half of the values lie above this point and half below

First quartile (Q1)

The median of the lower half of data (from the minimum to the median).

Third quartile (Q3)

The median of the upper half of data (from the median to the maximum)

Interquartile range (IQR)

Q3-Q1=_____, the range of the middle 50% of the data

1.5 x IQR

Rule for outliers: if a piece of data falls more than ________ above Q3 or below Q1, it is classified as an outlier

Five-number summary

The ____ ______ ______ of a distrivution consists of the smallest observation: the min., Q1, M, Q3, and max. written from smallest to largest

Boxplot

Box with edges from Q1 to Q3, line within box at median, whiskers extending from the box to the minimum and maximum.

Variance

The average of the sum of the squared deviations of each observation. "Average" squared deviance. So, the sum of X subscript i minus x? quantities squared

Standard deviation

The square root of the "average" squared deviation

Mean

The ____, also called x?, for a set of observations is the sum of the observations divided by the number of observations. Also seen in the formula x? = (? X subscript i)/n

Sigma ?

Short for "add them all up." Greek symbol

Resistant

A term is a ________ measure of center or spread if it is not easily affected by extreme observations. For example, the median and IQR are _______ measures, while the mean and standard deviation are not

Percentile

The pth _______ of a distribution is the value with p percent of the observations less than that

Cumulative relative frequency graph

Graph with points corresponding to the cumulative relative frequency in each class at the smallest value of the next class

Standardized value or z-score

If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is z=(x-mean)/standard deviation

z=(x-mean)/standard deviation

z-score formula

Density curve

A curve that is always on or above the horizontal axis and has area exactly 1 underneath it. A _______ _________ describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the propo

Mean of a density curve

The balance point at which the density curve would balance if made of solid material

Median of a density curve

the equal-areas point, the point that divides the area under the density curve in half

Transforming data

Adding or subtracting a constant will add (subtract) to measures of center and location (mean, median, quartiles, percentiles) but does not change the shape or measures of spread (range, IQR, standard deviation). Multiplying or dividing a constant will mu

Normal curves

The particularly important class of density curves that describe normal distributions. The mean for a normal distribution is at the center of this symmetric curve.

Normal distribution

A distribution described by a normal density curve. Any particular ______ ______ is completely specified by two numbers: its mean ? and its standard deviation ?

The 68-95-99.7 Rule

In the normal distribution with a mean ? and standard deviation ?:
-Approximately 68% of observations fall within ? of the mean ?
-Approximately 95% of observations fall within 2? of the mean ?
-Approximately 99.7% of observations fall within 3? of the me

Chebyshev's inequality

The _________ ________ is a rule similar to the 68-95-99.7 rule of normal distributions. It says that in any distribution, he proportion of observations falling within k standard deviations of the mean is at least 1-1/(k^2).

Standard Normal distribution

The normal distribution with mean 0 and standard deviation 1. If a variable x has any Normal distribution N (?, ?) with mean ? and standard deviation ?, then the standardized variable z=(x-?)/? has the standard Normal distribution.

Standard Normal table

A table of areas under the standard Normal curve. The table entry for each value x is the area under the curve to the left of z.

Population

In a statistical study, the entire group of individuals about which we want information

Sample

The part of the population from which we actually collect information. We use the information from a ________ to draw conclusions about the entire population.

Sample survey

Drawing conclusions about a population based on questions asked of a representative sample group

Convenience sample

Choosing individuals who are easiest to reach in a sample survey

Bias

Using a method that will consistently overestimate or underestimate the value you want to know. A statistical study shows ______ if it systematically favors certain outcomes

Voluntary response sample

A ________ ______ _______ consists of people who choose themselves by responding to a general appeal. They show bias because people with strong opinions (often in the same direction) are most likely to respond

Simple random sample

A _______ _______ ______ (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected

Table of random digits

A long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties:
-Each entry in the table is equally likely to be any of the 10 digits 0 through 9
-The entries are independent of each other. That is, knowledge of one part of the table gives

Stratified random sample

To select a ______ ______ ______, classify the population into groups of similar individuals called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample

Strata

In a stratified random sample, these are the groups of similar individuals that a population is broken into

Cluster sample

To take a _________ ________, first divide the population into smaller groups. Ideally, these clusters should mirror the characteristics of the population. Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the samp

Cluster

In a cluster sample, these are the smaller groups that mirror the characteristics of the population

Inference

Drawing conclusions about a population on the basis of sample data is an example of this

Undercoverage

Occurs when some groups in the population are left out of the process of choosing the sample

Margin of error

Sets bounds on the size of the likely error

Sampling frame

The list of individuals from which we draw a sample. Ideally, it should list every individual in the population, but that is rarely available, so most samples suffer from some degree of undercoverage

Nonresponse

Occurs when an individual chosen for a sample can't be contacted or refuses to participate

Response bias

A systematic pattern of incorrect responses in a sample survey leads to this. For example, people telling an interviewer that they voted in an election when they did not.

Wording of questions

The most important influence on the answers given to a sample survey

Observational study

Observes individuals and measures variables of interest but does not attempt to influence the responses

Experiment

An ________ deliberately imposes some treatment on individuals to measure their responses

Lurking variables

A variable that is not among the explanatory or response variables in a study but that may influence the response variables

Confounding

Occurs when two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other

Treatment

The specific condition applied to the individuals in an experiment. If an experiment has several explanatory variables, a ________ is a combination of specific values of these variables

Experimental units

The _______ ________ are the smallest collection of individuals to which treatments are applied

Subjects

When the experimental units are human beings, they are called this.

Factors

The explanatory variables are also called ________

Levels

The specific value of each factor

Random assignment

In an experiment, this term means that experimental units are assigned to treatments at random, that is, using some sort of chance process

Completely randomized design

An experiment where the treatments are assigned to all the experimental units completely by chance

Control group

A _______ ______ is a group that provides a baseline for comparing the effects of the other treatments

Principles of experimental design

1. Control for lurking variables
2. Random assignment to create roughly equivalent groups
3. Replication should yield the same results as the original experiment if enough experimental units are used to distinguish effects from chance

Placebo effect

The response to a "dummy" treatment or false treatment

Double-blind

Neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received

Single-blind

The individuals who are interacting with the subjects and measuring the response do not know which treatment the group is receiving though the subjects know, or vice versa

Statistically significant

An observed effect so large that it would rarely occur by chance

Block

A group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments (ex. light-colored laundry and dark-colored laundry)

Randomized block design

The type of experiment design where random assignment of experimental units to treatments carried out separately within each block

Matched pairs design

A common type of randomized block design for comparing two treatments. The idea is to create blocks by matching pairs of experimental units. Then chance is used to decide which member gets which of the two treatments, or which order the treatments are giv

Law of large numbers

If we observe more and more repetitions of any chance process, the proportion of times that a specific outcome occurs approaches a single value.

Probability

The study of chance behavior, or "the ______ of any outcome of a chance process is a number between 0 and 1 that describes the proportion of times the outcome would occur in a very long series of repetitions

Simulation

The imitation of chance behavior based on a model that accurately reflects the situation

Sample space (S)

The set of all possible outcomes for a chance process

Probability model

A description of some chance process that consists of two parts: a sample space S and a probability for each outcome

Event

An ______ is any collection of outcomes from some chance process. That is, it is a subset of the sample space. It is usually designated by capital letters, like A, B, C, and so on.

Mutually exclusive or disjoint

Two events are _____ _____ or _____ if they have no outcomes in common and can never occur together

Complement

The P(event does not occur) = 1 - P(event does occur). This event is called the ______ and is denoted by A[superscript c]

Complement rule

P(A^c) = 1 - P(A)

Addition rule for mutually exclusive events

P(A or B) = P(A) + P(B)

Venn diagram

Helps visualize two events that overlap (not disjoint) and suggests how to fix this "double-counting" problem

General addition rule

For events that are NOT mutually exclusive, P(A or B) = P(A) + P(B) - P(A and B)

Intersection

The intersection of two events A and B is the area where A and B overlap, "A and B

Union

The union of two events A and B is the area covered by event A, event B, or both events. It is phrased as "A or B

Conditional probability

The probability that one event happens given that another event is already known to have happened is called ________ ________. Suppose we know that event A has happened. Then the probability that event B happens given that event A has happened is denoted

Independent events

Two events A and B are _______ ______ if the occurence of one has no effect on the chance that the other event will happen. In other words, events A and B are _______ if P(A|B) = P(A) and P(B|A) = P(B)

Tree diagram

A type of diagram that can display the sample space and model chance behavior that involves a sequence of outcomes.

General multiplication rule

P (A and B) = P(A) * P(B|A)

P(A)

The probability of event A occurring is denoted as ______

Multiplication rule for independent events

If A and B are independent events, then the probability that A and B both occur is P (A and B) = P(A) * P(B)

Random variable

Takes numerical values that describe the outcomes of some chance process

Probability distribution

Shows the probability of all possible values rather than each individual event

Discrete random variable

A ______ ______ ______ where the variable takes a fixed set of possible values with gaps between. Each probability must be between O and 1, and the sum of the probabilities must be 1.

Expected value or mean ?x of a discrete random variable X

To find this number, multiply each possible value by its probability, then add all of the products.

Continuous random variable

A _______ _____ ______ X takes all values in an interval of numbers. The probability distribution is described by a density curve. The probability of any event is the area under the density curve and above the values of X that make up the event.

Effect on a random variable of multiplying/dividing by a constant

-Multiplies measures of center and location
-Multiplies measures of spread by the absolute value of the factor
-Does not change shape

Effect on a random variable of adding/subtracting a a constant

-Adds/subtracts from measures of center and location
-Does not change measures of spread
-Does not change shape

Linear transformations with random variables

-If Y=a+bX is a linear transformation on the random variable X, then...
-->The probability distribution of Y is the same shape as that of X
-->Mean of Y=a+b(mean of X)
-->Standard deviation of Y=|b|(standard deviation of X) since b could be a negative num

Mean of the sum of random variables

In general, the mean of the sum of several random variables is the sum of their means.

Independent random variables

If knowing whether any event involving X alone has occurred tells us nothing about the occurrence of any event involving Y alone, and vice versa, then X and Y are independent. If variables are independent, write "assuming the variables are independent...

Variance of the sum of random variables

If T=X+Y and X and Y are independent, then the variance of T=the variance of X+the variance of Y. You can add variances, but NEVER add standard deviations!!

Mean of the difference of random variables

For two random variables X and Y, if D=X-Y then the expected value of D is E(D)=mean of D=mean of X-mean of Y. The ORDER matters!!

Variance of the difference of random variables

If D=X-Y and X and Y are independent, then the variance of D=the variance of X+the variance of Y.

Binomial setting

Arises when we perform several independent trials of the same chance process and record the number of times that a particular outcome occurs. The four conditions are
-Binary? Outcomes classified as "success" or "failure"
-Independent? Knowing the result o

Binomial random variable, binomial distribution

The count X of successes in a binomial setting is a ________. The probability distribution of X is a ________ with parameters n and p where n is the number of trials and p is the probability of success on each one trial. Values of X are whole numbers 0 to

Binomial coefficient

The number of ways of arranging k successes among n observations is given by the ______ _______ shown as
n
k
which equals n!/k!(n-k)!. It can also be phrased as "n choose k" or nCk

Binomial probability

If X has the binomial distribution with n trials and probability p of success on each trial, the possible values of X are 0, 1, 2, ..., n. If k is any one of these values,
P(X=k)=nCk
p^k
(1-p)^n-k

Mean and standard deviation of a binomial distribution

Mean=n*p
Standard deviation=square root of n
p
(1-p)

Sampling without replacement condition

When taking an SRS of size n from a population of size N, we can use a binomial distribution to model the count of success in the sample AS LONG AS
n is less than or equal to (1/10)*N

Normal approximation for binomial distributions

Suppose that a count X has the binomial distribution with n trials and success probability p. When n is large, the distribution of X is approximately Normal with mean=n
p and standard deviation=square root of n
p
(1-p). As a rule of thumb, we will use the

Geometric setting

A ______ ______ arises when we perform independent trials of the same chance process and record the number of trials until a particular outcome occurs. The four conditions for a ______ _____ are
-Binary? Each outcome must be a "success" or "failure"
-Inde

Geometric random variable and geometric distribution

The number of trials Y that it takes to get a success in a geometric setting is a __________. The probability distribution of X is a __________ with parameter p, the probability of a success on any trial. The possible values of Y are 1, 2, 3, ...

Geometric probability

If Y has the geometric distribution with probability p of success on each trial, the possible values of Y are 1, 2, 3, ... . If k is any one of these values, then
P(Y=k)=(1-p)^k-1*p

Mean (expected value) of a geometric random variable

If Y is a geometric random variable with probability of success p on each trial, then its mean (expected value) is E(Y)=(1/p). That is, the expected number of trials required to get the first success is 1/p.

Parameter

A number that describes some characteristic of the population. In stats, the value of a parameter is usually not known because we cannot examine the entire population.

Statistic

A number that describes some characteristic of a sample. The value of a ______ can be computed directly from the sample data. It is often used to estimate an unknown parameter.

Sampling variability

A basic fact: the value of a statistic varies in repeated random sampling

Sampling distribution

The _______ ________ of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the population. It does NOT show individual values in the form of a bar graph; it shows a dot plot for each sample value

Population distribution

The _______ ________ gives the values of the variable for all individuals in the population (usually in the form of a bar graph)

Distribution of sample data

A _______ of ______ _______ is a graph showing just one of the many values of the variable from one SRS. It shows the results of an individual sample (usually in the form of a bar graph)

Biased estimator

A statistic used to estimate a parameter is a(n) ________ estimator if the chosen statistic of its sampling distribution is not equal to/very far off from the true value of the parameter being estimated.

Unbiased estimator

A statistic used to estimate a parameter is a(n) ________ estimator if the mean of its sampling distribution is equal to the true value of the parameter being estimated.

Bias (chapter 6)

Could also be called measure of accuracy. In sampling distributions, means that the center (mean) of the sampling distribution is not equal to the true value of the parameter.

Variability (chapter 6)

Could also be called measure of precision

Variability of a statistic

The _______ of a ________ is described by the spread of its sampling distribution. Spread is determined primarily by size of the random sample. Larger samples give smaller spread (more precision). The spread of the sampling distribution doesn't depend on

Sampling distribution of a sample proportion

Choose an SRS of size n from a population of size N with the proportion p of successes. Let p-hat be the sample proportion of success. Then...
-The mean of the sampling distribution of p-hat is
mew p-hat=p
-The standard deviation of the sampling distribut

Mean and standard deviation of the sampling distribution of x-bar

Suppose that x-bar is the mean of an SRS of size n drawn from a large population with mean mew and standard deviation sigma. Then...
-The mean of the sampling distribution of x-bar is
mew x-bar=mew
-The standard deviation of the sampling distribution of x

Sampling distribution of a sample mean from a Normal population

Suppose that a population is Normally distributed with mean mew and standard deviation sigma. Then the sampling distribution of x-bar has the Normal distibution with mean mew and standard deviation (sigma/square root of n), provided that the 10% condition

Central limit theorem

Draw an SRS of size n fro many population with mean mew and finite standard deviation sigma. The central limit theorem (CLT) says taht when n is large, the sampling distribution of the mean x-bar is approximately normal.

Normal condition for sample means

-If the population is normal, thens o is the sampling distribution of x-bar. This is true no matter what the sample size n is.
-If the population is not Normal, the central limit theorem tells us that the sampling distribution of x-bar will be approximate