Data analysis
Organizing, displaying, summarizing, and asking questions about data
Individuals
The objects described by a set of data. They may be people, animals, or things
Variables
Any characteristic of an individual. It can take different values for different individuals
Categorical variable
This is a factor that places an individual into one of several groups or categories
Quantitative variable
This is a factor that places numerical values for which it makes sense to find an average.
Ex. it would be useful to have the average individuals' GPAs
Ex. it would not be useful to have the average of individual's zip codes or genders
Distribution
The _______ of a variable tells us what values the variable takes and how often it takes these values
Inference
Drawing conclusions that go beyond the data at hand
Frequency table
A table that displays the counts of variables in each format category
Relative frequency table
A table that shows the percents of variables in each format category
Pie chart
Chart in a circle with sections that show the parts of the whole made up by a certain individual's percentage
Bar graph
Chart comparing relative heights of the percents of individuals' variables
Roundoff error
The result that occurs when, with percentages, the total should add to 100% but only comes close to it, around 99.9% or some other value near it
Marginal distribution
The ______ ______ of one of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table. (This tells us nothing about the relationship between two variables, though)
Conditional distribution
A ______ ______ of a variable describes the value of that variable among individuals who have a specific value of another variable. There is a separate ______ ______ for each value of the other variable.
Dotplot
A graph for quantitative data where each data value is shown as a dot above its location on a number line.
Shape, outliers, center, spread
Four terms needed to describe a dotplot (SOCS)
Stemplot
Simple graphical display of a distribution that includes the actual numerical values in the graph
Two-way table
Describes two categorical variables, organizing counts according to a row variable and a column variable
Side-by-side bar graph
A type of graph with two or more different variables being displayed and compared next to each other
Association
We say that there is ________ between two variables if specific values of one variable tend to occur in common with specific values of the other
Histogram
The most common graph of the distribution of one quantitative variable. MORE ON THIS DEFINITION??
Outlier
An individual value that falls outside the overall pattern
Symmetric
A distribution is roughly _______ if the left and right sides of the graph are approximately mirror images of each other
Spread
The ______ of a distribution tells us how much variability there is in the data. One way to describe it is to give the largest and smallest numbers. Another way is to compute the range of the data
Variability
How much a set of data varies, described by the spread
Median
The center of a group of data, where half of the values lie above this point and half below
First quartile (Q1)
The median of the lower half of data (from the minimum to the median).
Third quartile (Q3)
The median of the upper half of data (from the median to the maximum)
Interquartile range (IQR)
Q3-Q1=_____, the range of the middle 50% of the data
1.5 x IQR
Rule for outliers: if a piece of data falls more than ________ above Q3 or below Q1, it is classified as an outlier
Five-number summary
The ____ ______ ______ of a distrivution consists of the smallest observation: the min., Q1, M, Q3, and max. written from smallest to largest
Boxplot
Box with edges from Q1 to Q3, line within box at median, whiskers extending from the box to the minimum and maximum.
Variance
The average of the sum of the squared deviations of each observation. "Average" squared deviance. So, the sum of X subscript i minus x? quantities squared
Standard deviation
The square root of the "average" squared deviation
Mean
The ____, also called x?, for a set of observations is the sum of the observations divided by the number of observations. Also seen in the formula x? = (? X subscript i)/n
Sigma ?
Short for "add them all up." Greek symbol
Resistant
A term is a ________ measure of center or spread if it is not easily affected by extreme observations. For example, the median and IQR are _______ measures, while the mean and standard deviation are not
Percentile
The pth _______ of a distribution is the value with p percent of the observations less than that
Cumulative relative frequency graph
Graph with points corresponding to the cumulative relative frequency in each class at the smallest value of the next class
Standardized value or z-score
If x is an observation from a distribution that has known mean and standard deviation, the standardized value of x is z=(x-mean)/standard deviation
z=(x-mean)/standard deviation
z-score formula
Density curve
A curve that is always on or above the horizontal axis and has area exactly 1 underneath it. A _______ _________ describes the overall pattern of a distribution. The area under the curve and above any interval of values on the horizontal axis is the propo
Mean of a density curve
The balance point at which the density curve would balance if made of solid material
Median of a density curve
the equal-areas point, the point that divides the area under the density curve in half
Transforming data
Adding or subtracting a constant will add (subtract) to measures of center and location (mean, median, quartiles, percentiles) but does not change the shape or measures of spread (range, IQR, standard deviation). Multiplying or dividing a constant will mu
Normal curves
The particularly important class of density curves that describe normal distributions. The mean for a normal distribution is at the center of this symmetric curve.
Normal distribution
A distribution described by a normal density curve. Any particular ______ ______ is completely specified by two numbers: its mean ? and its standard deviation ?
The 68-95-99.7 Rule
In the normal distribution with a mean ? and standard deviation ?:
-Approximately 68% of observations fall within ? of the mean ?
-Approximately 95% of observations fall within 2? of the mean ?
-Approximately 99.7% of observations fall within 3? of the me
Chebyshev's inequality
The _________ ________ is a rule similar to the 68-95-99.7 rule of normal distributions. It says that in any distribution, he proportion of observations falling within k standard deviations of the mean is at least 1-1/(k^2).
Standard Normal distribution
The normal distribution with mean 0 and standard deviation 1. If a variable x has any Normal distribution N (?, ?) with mean ? and standard deviation ?, then the standardized variable z=(x-?)/? has the standard Normal distribution.
Standard Normal table
A table of areas under the standard Normal curve. The table entry for each value x is the area under the curve to the left of z.
Population
In a statistical study, the entire group of individuals about which we want information
Sample
The part of the population from which we actually collect information. We use the information from a ________ to draw conclusions about the entire population.
Sample survey
Drawing conclusions about a population based on questions asked of a representative sample group
Convenience sample
Choosing individuals who are easiest to reach in a sample survey
Bias
Using a method that will consistently overestimate or underestimate the value you want to know. A statistical study shows ______ if it systematically favors certain outcomes
Voluntary response sample
A ________ ______ _______ consists of people who choose themselves by responding to a general appeal. They show bias because people with strong opinions (often in the same direction) are most likely to respond
Simple random sample
A _______ _______ ______ (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected
Table of random digits
A long string of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with these properties:
-Each entry in the table is equally likely to be any of the 10 digits 0 through 9
-The entries are independent of each other. That is, knowledge of one part of the table gives
Stratified random sample
To select a ______ ______ ______, classify the population into groups of similar individuals called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample
Strata
In a stratified random sample, these are the groups of similar individuals that a population is broken into
Cluster sample
To take a _________ ________, first divide the population into smaller groups. Ideally, these clusters should mirror the characteristics of the population. Then choose an SRS of the clusters. All individuals in the chosen clusters are included in the samp
Cluster
In a cluster sample, these are the smaller groups that mirror the characteristics of the population
Inference
Drawing conclusions about a population on the basis of sample data is an example of this
Undercoverage
Occurs when some groups in the population are left out of the process of choosing the sample
Margin of error
Sets bounds on the size of the likely error
Sampling frame
The list of individuals from which we draw a sample. Ideally, it should list every individual in the population, but that is rarely available, so most samples suffer from some degree of undercoverage
Nonresponse
Occurs when an individual chosen for a sample can't be contacted or refuses to participate
Response bias
A systematic pattern of incorrect responses in a sample survey leads to this. For example, people telling an interviewer that they voted in an election when they did not.
Wording of questions
The most important influence on the answers given to a sample survey
Observational study
Observes individuals and measures variables of interest but does not attempt to influence the responses
Experiment
An ________ deliberately imposes some treatment on individuals to measure their responses
Lurking variables
A variable that is not among the explanatory or response variables in a study but that may influence the response variables
Confounding
Occurs when two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other
Treatment
The specific condition applied to the individuals in an experiment. If an experiment has several explanatory variables, a ________ is a combination of specific values of these variables
Experimental units
The _______ ________ are the smallest collection of individuals to which treatments are applied
Subjects
When the experimental units are human beings, they are called this.
Factors
The explanatory variables are also called ________
Levels
The specific value of each factor
Random assignment
In an experiment, this term means that experimental units are assigned to treatments at random, that is, using some sort of chance process
Completely randomized design
An experiment where the treatments are assigned to all the experimental units completely by chance
Control group
A _______ ______ is a group that provides a baseline for comparing the effects of the other treatments
Principles of experimental design
1. Control for lurking variables
2. Random assignment to create roughly equivalent groups
3. Replication should yield the same results as the original experiment if enough experimental units are used to distinguish effects from chance
Placebo effect
The response to a "dummy" treatment or false treatment
Double-blind
Neither the subjects nor those who interact with them and measure the response variable know which treatment a subject received
Single-blind
The individuals who are interacting with the subjects and measuring the response do not know which treatment the group is receiving though the subjects know, or vice versa
Statistically significant
An observed effect so large that it would rarely occur by chance
Block
A group of experimental units that are known before the experiment to be similar in some way that is expected to affect the response to the treatments (ex. light-colored laundry and dark-colored laundry)
Randomized block design
The type of experiment design where random assignment of experimental units to treatments carried out separately within each block
Matched pairs design
A common type of randomized block design for comparing two treatments. The idea is to create blocks by matching pairs of experimental units. Then chance is used to decide which member gets which of the two treatments, or which order the treatments are giv
Law of large numbers
If we observe more and more repetitions of any chance process, the proportion of times that a specific outcome occurs approaches a single value.
Probability
The study of chance behavior, or "the ______ of any outcome of a chance process is a number between 0 and 1 that describes the proportion of times the outcome would occur in a very long series of repetitions
Simulation
The imitation of chance behavior based on a model that accurately reflects the situation
Sample space (S)
The set of all possible outcomes for a chance process
Probability model
A description of some chance process that consists of two parts: a sample space S and a probability for each outcome
Event
An ______ is any collection of outcomes from some chance process. That is, it is a subset of the sample space. It is usually designated by capital letters, like A, B, C, and so on.
Mutually exclusive or disjoint
Two events are _____ _____ or _____ if they have no outcomes in common and can never occur together
Complement
The P(event does not occur) = 1 - P(event does occur). This event is called the ______ and is denoted by A[superscript c]
Complement rule
P(A^c) = 1 - P(A)
Addition rule for mutually exclusive events
P(A or B) = P(A) + P(B)
Venn diagram
Helps visualize two events that overlap (not disjoint) and suggests how to fix this "double-counting" problem
General addition rule
For events that are NOT mutually exclusive, P(A or B) = P(A) + P(B) - P(A and B)
Intersection
The intersection of two events A and B is the area where A and B overlap, "A and B
Union
The union of two events A and B is the area covered by event A, event B, or both events. It is phrased as "A or B
Conditional probability
The probability that one event happens given that another event is already known to have happened is called ________ ________. Suppose we know that event A has happened. Then the probability that event B happens given that event A has happened is denoted
Independent events
Two events A and B are _______ ______ if the occurence of one has no effect on the chance that the other event will happen. In other words, events A and B are _______ if P(A|B) = P(A) and P(B|A) = P(B)
Tree diagram
A type of diagram that can display the sample space and model chance behavior that involves a sequence of outcomes.
General multiplication rule
P (A and B) = P(A) * P(B|A)
P(A)
The probability of event A occurring is denoted as ______
Multiplication rule for independent events
If A and B are independent events, then the probability that A and B both occur is P (A and B) = P(A) * P(B)
Random variable
Takes numerical values that describe the outcomes of some chance process
Probability distribution
Shows the probability of all possible values rather than each individual event
Discrete random variable
A ______ ______ ______ where the variable takes a fixed set of possible values with gaps between. Each probability must be between O and 1, and the sum of the probabilities must be 1.
Expected value or mean ?x of a discrete random variable X
To find this number, multiply each possible value by its probability, then add all of the products.
Continuous random variable
A _______ _____ ______ X takes all values in an interval of numbers. The probability distribution is described by a density curve. The probability of any event is the area under the density curve and above the values of X that make up the event.
Effect on a random variable of multiplying/dividing by a constant
-Multiplies measures of center and location
-Multiplies measures of spread by the absolute value of the factor
-Does not change shape
Effect on a random variable of adding/subtracting a a constant
-Adds/subtracts from measures of center and location
-Does not change measures of spread
-Does not change shape
Linear transformations with random variables
-If Y=a+bX is a linear transformation on the random variable X, then...
-->The probability distribution of Y is the same shape as that of X
-->Mean of Y=a+b(mean of X)
-->Standard deviation of Y=|b|(standard deviation of X) since b could be a negative num
Mean of the sum of random variables
In general, the mean of the sum of several random variables is the sum of their means.
Independent random variables
If knowing whether any event involving X alone has occurred tells us nothing about the occurrence of any event involving Y alone, and vice versa, then X and Y are independent. If variables are independent, write "assuming the variables are independent...
Variance of the sum of random variables
If T=X+Y and X and Y are independent, then the variance of T=the variance of X+the variance of Y. You can add variances, but NEVER add standard deviations!!
Mean of the difference of random variables
For two random variables X and Y, if D=X-Y then the expected value of D is E(D)=mean of D=mean of X-mean of Y. The ORDER matters!!
Variance of the difference of random variables
If D=X-Y and X and Y are independent, then the variance of D=the variance of X+the variance of Y.
Binomial setting
Arises when we perform several independent trials of the same chance process and record the number of times that a particular outcome occurs. The four conditions are
-Binary? Outcomes classified as "success" or "failure"
-Independent? Knowing the result o
Binomial random variable, binomial distribution
The count X of successes in a binomial setting is a ________. The probability distribution of X is a ________ with parameters n and p where n is the number of trials and p is the probability of success on each one trial. Values of X are whole numbers 0 to
Binomial coefficient
The number of ways of arranging k successes among n observations is given by the ______ _______ shown as
n
k
which equals n!/k!(n-k)!. It can also be phrased as "n choose k" or nCk
Binomial probability
If X has the binomial distribution with n trials and probability p of success on each trial, the possible values of X are 0, 1, 2, ..., n. If k is any one of these values,
P(X=k)=nCk
p^k
(1-p)^n-k
Mean and standard deviation of a binomial distribution
Mean=n*p
Standard deviation=square root of n
p
(1-p)
Sampling without replacement condition
When taking an SRS of size n from a population of size N, we can use a binomial distribution to model the count of success in the sample AS LONG AS
n is less than or equal to (1/10)*N
Normal approximation for binomial distributions
Suppose that a count X has the binomial distribution with n trials and success probability p. When n is large, the distribution of X is approximately Normal with mean=n
p and standard deviation=square root of n
p
(1-p). As a rule of thumb, we will use the
Geometric setting
A ______ ______ arises when we perform independent trials of the same chance process and record the number of trials until a particular outcome occurs. The four conditions for a ______ _____ are
-Binary? Each outcome must be a "success" or "failure"
-Inde
Geometric random variable and geometric distribution
The number of trials Y that it takes to get a success in a geometric setting is a __________. The probability distribution of X is a __________ with parameter p, the probability of a success on any trial. The possible values of Y are 1, 2, 3, ...
Geometric probability
If Y has the geometric distribution with probability p of success on each trial, the possible values of Y are 1, 2, 3, ... . If k is any one of these values, then
P(Y=k)=(1-p)^k-1*p
Mean (expected value) of a geometric random variable
If Y is a geometric random variable with probability of success p on each trial, then its mean (expected value) is E(Y)=(1/p). That is, the expected number of trials required to get the first success is 1/p.
Parameter
A number that describes some characteristic of the population. In stats, the value of a parameter is usually not known because we cannot examine the entire population.
Statistic
A number that describes some characteristic of a sample. The value of a ______ can be computed directly from the sample data. It is often used to estimate an unknown parameter.
Sampling variability
A basic fact: the value of a statistic varies in repeated random sampling
Sampling distribution
The _______ ________ of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the population. It does NOT show individual values in the form of a bar graph; it shows a dot plot for each sample value
Population distribution
The _______ ________ gives the values of the variable for all individuals in the population (usually in the form of a bar graph)
Distribution of sample data
A _______ of ______ _______ is a graph showing just one of the many values of the variable from one SRS. It shows the results of an individual sample (usually in the form of a bar graph)
Biased estimator
A statistic used to estimate a parameter is a(n) ________ estimator if the chosen statistic of its sampling distribution is not equal to/very far off from the true value of the parameter being estimated.
Unbiased estimator
A statistic used to estimate a parameter is a(n) ________ estimator if the mean of its sampling distribution is equal to the true value of the parameter being estimated.
Bias (chapter 6)
Could also be called measure of accuracy. In sampling distributions, means that the center (mean) of the sampling distribution is not equal to the true value of the parameter.
Variability (chapter 6)
Could also be called measure of precision
Variability of a statistic
The _______ of a ________ is described by the spread of its sampling distribution. Spread is determined primarily by size of the random sample. Larger samples give smaller spread (more precision). The spread of the sampling distribution doesn't depend on
Sampling distribution of a sample proportion
Choose an SRS of size n from a population of size N with the proportion p of successes. Let p-hat be the sample proportion of success. Then...
-The mean of the sampling distribution of p-hat is
mew p-hat=p
-The standard deviation of the sampling distribut
Mean and standard deviation of the sampling distribution of x-bar
Suppose that x-bar is the mean of an SRS of size n drawn from a large population with mean mew and standard deviation sigma. Then...
-The mean of the sampling distribution of x-bar is
mew x-bar=mew
-The standard deviation of the sampling distribution of x
Sampling distribution of a sample mean from a Normal population
Suppose that a population is Normally distributed with mean mew and standard deviation sigma. Then the sampling distribution of x-bar has the Normal distibution with mean mew and standard deviation (sigma/square root of n), provided that the 10% condition
Central limit theorem
Draw an SRS of size n fro many population with mean mew and finite standard deviation sigma. The central limit theorem (CLT) says taht when n is large, the sampling distribution of the mean x-bar is approximately normal.
Normal condition for sample means
-If the population is normal, thens o is the sampling distribution of x-bar. This is true no matter what the sample size n is.
-If the population is not Normal, the central limit theorem tells us that the sampling distribution of x-bar will be approximate