Data
recorded values whether numbers or labels, together with their context
Data table
an arrangement of data in which each row represents a case and each column represents a variable
Context
ideally tells who was measured, what was measured, how the data were collected, where the data were collected, and when and why the study was performed
Case
an individual about whom or which we have data
Respondent
someone who answers, or responds to, a survey
Subject/Participant
a human experimental unit
Experimental unit
an individual in a study for which or for whom data values are recorded
Record
information about an individual
Sample
a subset of a population, examined in hope of learning about the population
Population
the entire group of individuals or instances about whom we hope to learn
Variable
holds information about the same characteristic for many cases
Categorical (or qualitative) variable
a variable that names categories with words or numerals
Nominal variable
the term applied to a variable whose values are used only to name categories
Quantitative variable
a variable in which the numbers are values of measured quantities with units
Units
a quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams
Identifier variable
a categorical variable that records a unique value for each case, used to name or identify it
Ordinal variable
the term applied to a variable whose categorical values possess some kind of order
Frequency table
lists the categories in a categorical variable and gives the count of observations for each category
Relative frequency table
lists the categories in a categorical variable and gives the percentage of observations for each category
Distribution
gives the possible values of the variable and the relative frequency of each value; in a quantitative variable, it slices up all of the possible values of the variable into equal-width bins and gives the number of values (or counts) falling into each bin
Area principle
In a statistical display, each data value should be represented by the same amount of area.
Bar chart
shows a bar whose area represents the count of observations for each category
Relative frequency bar chart
shows a bar whose area represents the percentage of observations for each category
Pie chart
shows how a "whole" divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category
Categorical data condition
It is important not to confuse displays for categorical data with quantitative data.
Contingency table
displays counts and, sometimes, percentages of individuals falling into named categories on two or more variables; the table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the cat
Marginal distribution
the distribution of either variable alone in a contingency table; the counts or percentages are the totals found in the margins (last row or column) of the table
Conditional distribution
the distribution of a variable restricting the "who" to consider only a smaller group of individuals
Independence
a variable in which the conditional distribution of one variable is the same for each category of the other
Segmented bar chart
displays the conditional distribution of a categorical variable within each category of another variable
Simpson's Paradox
When averages are taken across different groups, they can appear to contradict the overall averages.
Histogram
uses adjacent bars to show the distribution of a quantitative variable; each bar represents the frequency of values falling into each bin
Relative frequency histogram
uses adjacent bars to show the distribution of a quantitative variable; each bar represents the relative frequency of values falling into each bin
Gap
a region of the distribution where there are no values
Stem-and-leaf display
a display that shows the quantitative data values in a way that sketches the distribution of the data; it's best described in detail by example
Dotplot
graphs a dot for each case against a single axis
Shape
a description of a distribution in which we look for: single vs. multiple modes, symmetry vs. skewness, outliers and gaps
Mode
a hump or local high point in the shape of the distribution of a variable; the apparent location can change as the scale of a histogram is changed
Unimodal
a distribution having one mode
Bimodal
a distribution having two modes
Multimodal
a distribution having more than two modes
Uniform
a distribution that doesn't appear to have any mode and in which all the bars of its histogram are approximately the same height
Symmetric
a distribution in which the two halves on either side of the center look approximately like mirror images of one another
Tails
the parts of a distribution that typically trail off on either side
Skewed
a distribution that is not symmetric and one tail stretches out farther than the other
Skewed left
in which the distribution's longer tail stretches to the left
Skewed right
in which the distribution's longer tail stretches to the right
Outliers
extreme values that don't appear to belong with the rest of the data
Center
the place in the distribution of a variable that you'd point to if you wanted to attempt the impossible by summarizing the entire distribution with a single number; measures include the mean and median
Median
the middle value, with half of the data above and half below it; if "n" is even, it is the average of the two middle values; it is usually paired with the IQR
Spread
a numerical summary of how tightly the values are clustered around the center; include the IQR and the standard deviation
Range
the difference between the lowest and highest values in a data set: Range=max-min
Quartile
the lower quartile (Q1) is the value with a quarter of the data below it; the upper quartile (Q3) has three quarters of the data below it; the median and quartiles divide the data into four parts with equal numbers of data values
Percentile
the "i"th percentile is the number that falls above "i"% of the data
Interquartile range (IQR)
the difference between the first and third quartiles
IQR=Q3-Q1; it is usually reported among along with the median
5-number summary
reports the minimum value, Q1, the median, Q3, and the maximum value
Boxplot
displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values; particularly effective for comparing groups and for displaying possible outliers
Mean
found by summing all the data values and dividing by the count: ?=Total/n=(?y/n); usually paired with standard deviation
Resistant
a calculated summary is said to be resistant if outliers have only a small effect on it
Variance
the sum of squared deviations from the mean, divided by the count minus 1: s�=[?(y-?)�/(n-1)]; the expected value of the squared deviations from the mean in a random variable; for discrete random variables, it can be calculated as ?�=Var(X)=?(x-�)�P(x).
Standard deviation
the square root of the variance: s=?[?(y-?)�/(n-1)]; usually reported along with the mean
Comparing distributions
When using histograms or stem-and-leaf displays, consider their: shape, center, and spread
Comparing boxplots
Compare the shapes (Do the boxes look symmetric or skewed? Are there differences between groups?); Compare the medians (Which group has the higher center? Is there any pattern to the medians?); Compare the IQRs (Which group is more spread out? Is there an
Timeplot
displays data that change over time; often, successive values are connected with lines to show trends more clearly; sometimes a smooth curve is added to the plot to help show long-term patterns and trends
Standardizing
the method of eliminating units, in which values can be compared and combined even if the original variables had different units and magnitudes
Standardized value
a value found by subtracting the mean and dividing by the standard deviation
Shifting
adding a constant to each data value adds the same constant to the mean, the median, and the quartiles, but does not change the standard deviation or IQR
Rescaling
multiplying each data value by a constant multiplies both the measures of position (mean, median, and quartiles) and the measures of spread (standard deviation and IQR) by that constant
Normal model
a useful family of models for unimodal, symmetric distributions
Parameter
a numerically valued attribute of a model; e.g. the values of ? and ? in a N(?, ?) model
Statistic
a value calculated from data to summarize aspects of the data; e.g. the mean, ?, and standard deviation, s
z-score
tells how many standard deviations a value is from the mean; have a mean of 0 and a standard deviation of 1
When working with data, use the statistics ? and s:
z=(y-?)/s.
When working with models, use the parameters � and ?:
z=(y-�)/?.
Standard normal model
a Normal model, N(�, ?) with mean �=0 and standard deviation ?=1; also called the standard normal distribution
Nearly normal condition
A distribution is nearly normal if it is unimodal and symmetric. We can check by looking at a histogram or a Normal probability plot.
68-95-99.7 Rule
In a Normal model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the man, and about 99.7% fall within 3 standard deviations of the mean.
Normal percentile
(corresponding to a z-score) gives the percentage of values in a standard normal distribution found at that z-score or below
Normal probability plot
a display to help assess whether a distribution of data is approximately normal. If the plot is nearly straight, the data satisfy the Nearly Normal Condition.
Scatterplots
shows the relationship between two quantitative variables measured on the same cases
Association
1) Direction (a positive direction or association means that, in general, as one variable increases, so does the other. When increases in one variable generally correspond to decreases in the other, the association is negative.) 2) Form (the form we care
Outlier
a point that does not fit the overall pattern seen in the scatterplot; any data point that stands away from the others; in regression, cases can be extraordinary in two ways: by having a large residual or by having high leverage
Response variable, Explanatory variable, x-variable, y-variable
in a scatterplot, you must chose a role for each variable; assign to the y-axis the response variable that you hope to predict or explain; assign to the x-axis the explanatory or predictor variable that accounts for, explains, predicts, or is otherwise re
Correlation Coefficient
a numerical measure of the direction and strength of a linear association
r=(?*Zx*Zy)/(n-1)
Lurking variable
a variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two; a variable that is not explicitly part of a model but affects the way the variables in the model appear to be related
Re-expression
taking the logarithm, the square root, the reciprocal, or some other mathematical operation of all values of a variable
Ladder of Powers
places in order the effects that many re-expressions have on the data
Linear model
An equation of the form
?=b?+b?x.
To interpret this, we need to know the variables (along with their W's) and their units.
Model
An equation or formula that simplifies and represents reality
Predicted value
the value of ? found for a given x-value in the data; found by substituting the x-value in the regression equation; values on the fitted line; the points (x, ?) all lie exactly on the fitted line; found from the linear model that we fit:
?=b?+b?x.
Residuals
the differences between data values and the corresponding values predicted by the regression model-or, more generally, values predicted by any model
Regression line (line of best fit)
the particular line equation (?=b?+b?x) that satisfies the least squares criterion
Least squares
the criterion that specifies the unique line that minimizes the variance of the residuals or, equivalently, the sum of the squared residuals
Slope
b?, gives a value in "y-units per x-unit;" changes of one unit in x are associated with changes of b? units in predicted values of y
Intercept
b?, gives a starting value in y-units; it's the ?-value when x is 0; you can find it from b?=?-b?x?
Regression to the mean
because the correlation is always less than 1.0 in magnitude, each predicted ? tends to be fewer standard deviations from its mean than its corresponding x was from its mean
Standard deviations of the residuals (Se)
The standard deviation of the residuals can be found by Se=[?(?e�)/(n-2)]; when the assumptions and conditions are met, the residuals can be well described by using this standard deviation and the 68-95-99.7 Rule
R�
the square of the correlation between y and x; gives the fraction of the variability of y accounted for by the least squares linear regression on x; an overall measure of how successful the regression is in linearly relating to y and x
Extrapolation
although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model equation; such extrapolation may pretend to see into the future, but the pred
Leverage
Data points whose x-values are far from the mean of x are said to exert this on a linear model; high-leverage points pull the line close to them, and so they can have a large effect on the line, sometimes completely determining the slope and intercept; wi
Influential point
a point that, if omitted from the data, results in a very different regression model
Random
An outcome in which we know the possible values it can have, but not which particular value it takes
Generating random numbers
hard to generate, nevertheless, several Internet sites offer an unlimited supply of equally likely random values
Simulation
models a real world situation by using random-digit outcomes to mimic the uncertainty of a response variable of interest
Trial
the sequence of several components representing events that we are pretending will take place; a single attempt or realization of a random phenomenon
Component
of a simulation, using equally likely random digits to model simple random occurrences whose outcomes may not be equally likely
Response variable
values that record the results of each trial with respect to what we were interested in; a variable whose values are compared across different treatments; in a randomized experiment, large response differences can be attributed to the effect of difference
Sample
a representative subset of a population, examined in the hope of learning about the population
Sample survey
a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population; a common example is polls taken to assess voter preferences
Bias
any systematic failure of a sampling method to represent its population; these sampling methods tend to over- or underestimating parameters; common errors include: relying on voluntary response, undercoverage of the population, nonresponse bias, response
Randomization
the best defense against bias; in which each individual is given a fair, random chance of selection
Sample size
the number of individuals in a sample; determines how well the sample represents the population, not the fraction of the population sampled
Census
a sample that consists of the entire population
Population parameter
a numerically valued attribute of a model for a population; we rarely expect to know the true value of this parameter, but we do hope to estimate it from sampled data; example: the mean income of all employed people in the country
Sample statistic
statistics (values calculated for sampled data) that correspond to, and thus estimate, a population parameter, are of particular interest; example: the mean income of all employed people in a representative sample can provide a good estimate of the corres
Representative
a sample in which the statistics computed from it accurately reflect the corresponding population parameters
Simple Random Sample (SRS)
sample size n, in which each set of n elements in the population has an equal chance of selection
Sampling frame
a list of individuals from whom the sample is drawn; individuals who may be in the population of interest, but who are not in this, cannot be included in any sample
Sampling variability
the natural tendency of randomly drawn samples to differ, one from another; sometimes, unfortunately, called sampling error, though it is no error at all, but just the natural result of random sampling
Cluster sample
a sampling design in which entire groups, or clusters, are chosen at random; usually selected as a matter of convenience, practicality, or cost; clusters are heterogeneous; this should be representative of the population
Multistage sample
Sampling schemes that combine several sampling methods; example: a national polling service may stratify the country by geographical regions, select a random sample of cities from each region, and then interview a cluster of residents in each city
Systematic sample
a sample drawn by selecting individuals systematically from a sampling frame; when there is no relationship between the order of the sampling frame and the variables of interest, this can be representative
Pilot study
a small trial run of a survey to check whether questions are clear; it can reduce errors due to ambiguous questions
Voluntary response sample
a sample in which individuals can choose on their own whether to participate; this is always invalid and cannot be recovered, no matter how large the sample size
Voluntary response bias
bias introduced to a sample when individuals can choose on their own whether to participate in the sample
Convenience sample
consists of the individuals who are conveniently available; this often fails to be representative because every individual in the population is not equally convenient to sample
Undercoverage
a sampling scheme that biases the sample in a way that gives a part of the population less representation in the sample than it has in the population
Nonresponse bias
bias introduced when a large fraction of those sampled fails to respond; those who do not respond are likely to not represent the entire population; voluntary response bias is a form of this, but this may occur for other reasons; example: those who are at
Response bias
anything in a survey design that influences responses falls under this heading; a typical instance of this arises from the wording of questions, which may suggest a favored response; example: voters are more likely to express support of "the president" th
Observational study
a study based on data in which no manipulation of factors has been employed
Retrospective study
an observational study in which subjects are selected and then their previous conditions or behaviors are determined; this need not be biased on random samples and it usually focuses on estimating differences between groups or associations between variabl
Prospective study
an observational study in which subjects are followed to observe future outcomes; because no treatments are deliberately applied, this is not an experiment; nevertheless, this typically focuses on estimating differences among groups that may appear as the
Random assignment
to be valid, an experiment must assign experimental units to treatment groups using some form of randomization
Factor
a variable whose values are compared across different treatments; in a randomized experiment, large response differences in factor levels may have on the responses of the experimental units
Experiment
manipulates factor levels to create treatments, randomly assigns subjects to these treatment levels, and then compares the responses of the subject groups across treatment levels
Subjects/participants
the individuals who participate in an experiment, especially when they are human; a more general term is experimental unit
Experimental units
individuals on whom an experiment is performed
Levels
the specific values that the experimenter chooses for a factor
Treatment
the process, intervention, or other controlled circumstance applied to randomly assigned experimental units; they are the different levels of a single factor or are made up of combinations of levels of two or more factors
Principles of experimental design
1) Control aspects of the experiment that we know may have an effect on the response, but that are not the factors being studied. 2) Randomize subjects to treatments to even out effects that we cannot control. 3) Replicate over as many subjects as possibl
Completely randomized design
in which all experimental units have an equal chance of receiving any treatment
Statistically significant
when an observed difference is too large for us to believe that it is likely to have occurred naturally, we consider the difference to be this. Subsequent chapters will show specific calculations and give rules, but the principle remains the same
Control group
the experimental units assigned to a baseline treatment level (called the "control treatment") typically either the default treatment, which is well understood, or a null, placebo treatment; their responses provide a basis for comparison
Blinding
any individual associated with an experiment who is not aware of how subjects have been allocated to treatment groups
Single-blind and double-blind
Two main classes of individuals who can affect the outcome of an experiment: 1) Those who could influence the results (the subjects, treatment administrators, or technicians) 2) Those who evaluate the results (judges, treating physicians, etc.)
When every
Placebo
a treatment known to have no effect, administered so that all groups experience the same conditions; many subjects respond to such a treatment (a response known as placebo effect); Only by comparing with a placebo can we be sure that the observed effect o
Placebo effect
the tendency of many human subjects (often 20% or more of of experiment subjects) to show a response even when administered a placebo
Block
when groups of experimental units are similar in a way that is not a factor under study, it is often a good idea to gather them together into these and then randomize the assignment of treatments within each one of these. By doing this, we isolate the var
Randomized block design
an experiment design in which participants are randomly assigned to treatments within each block
Matching
in a retrospective or prospective study, participants who are similar in ways not under study may be matched and then compared with each other on the variables of interest; this, like blocking, reduces unwanted variation
Confounding
when the levels of one factor are associated with the levels of another factor in such a way that their effects cannot be separated
Random phenomenon
a phenomenon in which we know what outcomes could happen, but not which particular values will happen
Outcome
the value measured, observed, or reported for an individual instance of trial
Event
a collection of outcomes; usually identified so that we can attach probabilities to them; we denote these with bold capital letters such as A, B, or C
Sample space
the collection of all possible outcome values; the collection of values in this has a probability of 1; we denote this with a boldface capital S.
Law of Large Numbers
This law states that the long-run relative frequency of an event's occurrence gets closer and closer to the true relative frequency as the number of trials increases.
Independence (informally)
Two events are this if learning that one event occurs does not change the probability that the other event occurs
Probability
of an event, a number between 0 and 1 that reports the likelihood of that event's occurrence; we write P(A) for this of event A
Empirical probability
when the probability comes from the long-run relative frequency of the event's occurrence
Theoretical probability
when the probability comes from a model (such as equally likely outcomes)
Personal (or subjective) probability
when the probability is subjective and represents your personal degree of belief
Probability Assignment Rule
The probability of an entire sample space must be 1. P(S)=1
Complement Rule
The probability of an event not occurring is 1 minus the probability that it does occur: P(A^c)=1-P(A)
Addition Rule
If A and B are disjoint events, then the probability of A or B is P(A or B)=P(A) + P(B)
Disjoint (mutually exclusive)
Two events are disjoint if they share no outcomes in common; If A and B are this then knowing that A occurs tells us that B cannot occur
Legitimate assignment of probabilities
An assignment of probabilities to outcomes is legitimate if: 1) each probability is between 0 and 1 (inclusive) 2) the sum of the probabilities is 1.
Multiplication Rule
If A and B are independent events, then the probability of A and B is P(A and B)=P(A) � P(B)
General Addition Rule
For any two events, A and B, the probability of A or B is P(A or B)=P(A) + P(B)-P(A and B)
Conditional probability
P(B|A)=P(A and B)�P(A)
P(B|A) is read "the probability of B given A
General Multiplication Rule
For any two events, A and B, the probability of A and B is P(A and B)=P(A) � P(B|A)
Independence (used formally)
Events A and B are independent when P(B|A)=P(B)
Tree diagram
A display of conditional events or probabilities that is helpful in thinking through conditioning
Random variable
assumes any of several different values as a result of some random event; these are denoted by a capital letter, such as X
Discrete random variable
A random variable that can take one of a finite number of distinct outcomes
Continuous random variable
a random variable that can take on any of an (uncountably) infinite number of outcomes
Probability model
a function that associates a probability P with each value of a discrete random variable X; denoted P(X=x) or P(x), or with any interval of values of a continuous random variable
Expected value
theoretical long-run average value, the center of its model; denoted � or E(X), it is found (if the random variable is discrete) by summing the products of variable values and probabilities
�=E(X)=?(x-�)�P(x)
Standard deviation of a random variable
Describes the spread in the model and is the square root of the variance, denoted SD(X) or ?
Changing a random variable by a constant
E(X � c)=E(X) � c, Var (X � c)=Var(X), SD(X � c)=SD(X), E(aX)=aE(X), Var(aX)=a�Var(X), SD(aX)=|a|SD(X)
Addition Rule for Expected Values of Random Variables
E(X �=E(X) � E(Y)
Addition Rule for Variances of Random Variables
(Pythagorean Theorem of Statistics) If X and Y are independent: Var(X �Y)=Var(X) + Var(Y), and SD(X � Y)=?(Var(X) + Var(Y))
Bernoulli trials
A sequence of trials in which: 1) There are exactly two possible outcomes (usually denoted success and failure) 2) The probability of success is constant 3) The trials are independent
Binomial probability distribution
appropriate for random variable that counts the number of successes in n Bernoulli trials
Binomial Model
P(X=x)=?nC?xp^xq^(n-x), where ?nC?=(n!)/[x!(n-x)!]
Poisson model
A discrete model often used to model the number of arrivals of events such as customers arriving in a queue or calls arriving into a call center