Statistics Midterm Exam (4th Ed. Ch. 1-14) Vocabulary

Data

recorded values whether numbers or labels, together with their context

Data table

an arrangement of data in which each row represents a case and each column represents a variable

Context

ideally tells who was measured, what was measured, how the data were collected, where the data were collected, and when and why the study was performed

Case

an individual about whom or which we have data

Respondent

someone who answers, or responds to, a survey

Subject/Participant

a human experimental unit

Experimental unit

an individual in a study for which or for whom data values are recorded

Record

information about an individual

Sample

a subset of a population, examined in hope of learning about the population

Population

the entire group of individuals or instances about whom we hope to learn

Variable

holds information about the same characteristic for many cases

Categorical (or qualitative) variable

a variable that names categories with words or numerals

Nominal variable

the term applied to a variable whose values are used only to name categories

Quantitative variable

a variable in which the numbers are values of measured quantities with units

Units

a quantity or amount adopted as a standard of measurement, such as dollars, hours, or grams

Identifier variable

a categorical variable that records a unique value for each case, used to name or identify it

Ordinal variable

the term applied to a variable whose categorical values possess some kind of order

Frequency table

lists the categories in a categorical variable and gives the count of observations for each category

Relative frequency table

lists the categories in a categorical variable and gives the percentage of observations for each category

Distribution

gives the possible values of the variable and the relative frequency of each value; in a quantitative variable, it slices up all of the possible values of the variable into equal-width bins and gives the number of values (or counts) falling into each bin

Area principle

In a statistical display, each data value should be represented by the same amount of area.

Bar chart

shows a bar whose area represents the count of observations for each category

Relative frequency bar chart

shows a bar whose area represents the percentage of observations for each category

Pie chart

shows how a "whole" divides into categories by showing a wedge of a circle whose area corresponds to the proportion in each category

Categorical data condition

It is important not to confuse displays for categorical data with quantitative data.

Contingency table

displays counts and, sometimes, percentages of individuals falling into named categories on two or more variables; the table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the cat

Marginal distribution

the distribution of either variable alone in a contingency table; the counts or percentages are the totals found in the margins (last row or column) of the table

Conditional distribution

the distribution of a variable restricting the "who" to consider only a smaller group of individuals

Independence

a variable in which the conditional distribution of one variable is the same for each category of the other

Segmented bar chart

displays the conditional distribution of a categorical variable within each category of another variable

Simpson's Paradox

When averages are taken across different groups, they can appear to contradict the overall averages.

Histogram

uses adjacent bars to show the distribution of a quantitative variable; each bar represents the frequency of values falling into each bin

Relative frequency histogram

uses adjacent bars to show the distribution of a quantitative variable; each bar represents the relative frequency of values falling into each bin

Gap

a region of the distribution where there are no values

Stem-and-leaf display

a display that shows the quantitative data values in a way that sketches the distribution of the data; it's best described in detail by example

Dotplot

graphs a dot for each case against a single axis

Shape

a description of a distribution in which we look for: single vs. multiple modes, symmetry vs. skewness, outliers and gaps

Mode

a hump or local high point in the shape of the distribution of a variable; the apparent location can change as the scale of a histogram is changed

Unimodal

a distribution having one mode

Bimodal

a distribution having two modes

Multimodal

a distribution having more than two modes

Uniform

a distribution that doesn't appear to have any mode and in which all the bars of its histogram are approximately the same height

Symmetric

a distribution in which the two halves on either side of the center look approximately like mirror images of one another

Tails

the parts of a distribution that typically trail off on either side

Skewed

a distribution that is not symmetric and one tail stretches out farther than the other

Skewed left

in which the distribution's longer tail stretches to the left

Skewed right

in which the distribution's longer tail stretches to the right

Outliers

extreme values that don't appear to belong with the rest of the data

Center

the place in the distribution of a variable that you'd point to if you wanted to attempt the impossible by summarizing the entire distribution with a single number; measures include the mean and median

Median

the middle value, with half of the data above and half below it; if "n" is even, it is the average of the two middle values; it is usually paired with the IQR

Spread

a numerical summary of how tightly the values are clustered around the center; include the IQR and the standard deviation

Range

the difference between the lowest and highest values in a data set: Range=max-min

Quartile

the lower quartile (Q1) is the value with a quarter of the data below it; the upper quartile (Q3) has three quarters of the data below it; the median and quartiles divide the data into four parts with equal numbers of data values

Percentile

the "i"th percentile is the number that falls above "i"% of the data

Interquartile range (IQR)

the difference between the first and third quartiles
IQR=Q3-Q1; it is usually reported among along with the median

5-number summary

reports the minimum value, Q1, the median, Q3, and the maximum value

Boxplot

displays the 5-number summary as a central box with whiskers that extend to the non-outlying data values; particularly effective for comparing groups and for displaying possible outliers

Mean

found by summing all the data values and dividing by the count: ?=Total/n=(?y/n); usually paired with standard deviation

Resistant

a calculated summary is said to be resistant if outliers have only a small effect on it

Variance

the sum of squared deviations from the mean, divided by the count minus 1: s�=[?(y-?)�/(n-1)]; the expected value of the squared deviations from the mean in a random variable; for discrete random variables, it can be calculated as ?�=Var(X)=?(x-�)�P(x).

Standard deviation

the square root of the variance: s=?[?(y-?)�/(n-1)]; usually reported along with the mean

Comparing distributions

When using histograms or stem-and-leaf displays, consider their: shape, center, and spread

Comparing boxplots

Compare the shapes (Do the boxes look symmetric or skewed? Are there differences between groups?); Compare the medians (Which group has the higher center? Is there any pattern to the medians?); Compare the IQRs (Which group is more spread out? Is there an

Timeplot

displays data that change over time; often, successive values are connected with lines to show trends more clearly; sometimes a smooth curve is added to the plot to help show long-term patterns and trends

Standardizing

the method of eliminating units, in which values can be compared and combined even if the original variables had different units and magnitudes

Standardized value

a value found by subtracting the mean and dividing by the standard deviation

Shifting

adding a constant to each data value adds the same constant to the mean, the median, and the quartiles, but does not change the standard deviation or IQR

Rescaling

multiplying each data value by a constant multiplies both the measures of position (mean, median, and quartiles) and the measures of spread (standard deviation and IQR) by that constant

Normal model

a useful family of models for unimodal, symmetric distributions

Parameter

a numerically valued attribute of a model; e.g. the values of ? and ? in a N(?, ?) model

Statistic

a value calculated from data to summarize aspects of the data; e.g. the mean, ?, and standard deviation, s

z-score

tells how many standard deviations a value is from the mean; have a mean of 0 and a standard deviation of 1
When working with data, use the statistics ? and s:
z=(y-?)/s.
When working with models, use the parameters � and ?:
z=(y-�)/?.

Standard normal model

a Normal model, N(�, ?) with mean �=0 and standard deviation ?=1; also called the standard normal distribution

Nearly normal condition

A distribution is nearly normal if it is unimodal and symmetric. We can check by looking at a histogram or a Normal probability plot.

68-95-99.7 Rule

In a Normal model, about 68% of values fall within 1 standard deviation of the mean, about 95% fall within 2 standard deviations of the man, and about 99.7% fall within 3 standard deviations of the mean.

Normal percentile

(corresponding to a z-score) gives the percentage of values in a standard normal distribution found at that z-score or below

Normal probability plot

a display to help assess whether a distribution of data is approximately normal. If the plot is nearly straight, the data satisfy the Nearly Normal Condition.

Scatterplots

shows the relationship between two quantitative variables measured on the same cases

Association

1) Direction (a positive direction or association means that, in general, as one variable increases, so does the other. When increases in one variable generally correspond to decreases in the other, the association is negative.) 2) Form (the form we care

Outlier

a point that does not fit the overall pattern seen in the scatterplot; any data point that stands away from the others; in regression, cases can be extraordinary in two ways: by having a large residual or by having high leverage

Response variable, Explanatory variable, x-variable, y-variable

in a scatterplot, you must chose a role for each variable; assign to the y-axis the response variable that you hope to predict or explain; assign to the x-axis the explanatory or predictor variable that accounts for, explains, predicts, or is otherwise re

Correlation Coefficient

a numerical measure of the direction and strength of a linear association
r=(?*Zx*Zy)/(n-1)

Lurking variable

a variable other than x and y that simultaneously affects both variables, accounting for the correlation between the two; a variable that is not explicitly part of a model but affects the way the variables in the model appear to be related

Re-expression

taking the logarithm, the square root, the reciprocal, or some other mathematical operation of all values of a variable

Ladder of Powers

places in order the effects that many re-expressions have on the data

Linear model

An equation of the form
?=b?+b?x.
To interpret this, we need to know the variables (along with their W's) and their units.

Model

An equation or formula that simplifies and represents reality

Predicted value

the value of ? found for a given x-value in the data; found by substituting the x-value in the regression equation; values on the fitted line; the points (x, ?) all lie exactly on the fitted line; found from the linear model that we fit:
?=b?+b?x.

Residuals

the differences between data values and the corresponding values predicted by the regression model-or, more generally, values predicted by any model

Regression line (line of best fit)

the particular line equation (?=b?+b?x) that satisfies the least squares criterion

Least squares

the criterion that specifies the unique line that minimizes the variance of the residuals or, equivalently, the sum of the squared residuals

Slope

b?, gives a value in "y-units per x-unit;" changes of one unit in x are associated with changes of b? units in predicted values of y

Intercept

b?, gives a starting value in y-units; it's the ?-value when x is 0; you can find it from b?=?-b?x?

Regression to the mean

because the correlation is always less than 1.0 in magnitude, each predicted ? tends to be fewer standard deviations from its mean than its corresponding x was from its mean

Standard deviations of the residuals (Se)

The standard deviation of the residuals can be found by Se=[?(?e�)/(n-2)]; when the assumptions and conditions are met, the residuals can be well described by using this standard deviation and the 68-95-99.7 Rule

R�

the square of the correlation between y and x; gives the fraction of the variability of y accounted for by the least squares linear regression on x; an overall measure of how successful the regression is in linearly relating to y and x

Extrapolation

although linear models provide an easy way to predict values of y for a given value of x, it is unsafe to predict for values of x far from the ones used to find the linear model equation; such extrapolation may pretend to see into the future, but the pred

Leverage

Data points whose x-values are far from the mean of x are said to exert this on a linear model; high-leverage points pull the line close to them, and so they can have a large effect on the line, sometimes completely determining the slope and intercept; wi

Influential point

a point that, if omitted from the data, results in a very different regression model

Random

An outcome in which we know the possible values it can have, but not which particular value it takes

Generating random numbers

hard to generate, nevertheless, several Internet sites offer an unlimited supply of equally likely random values

Simulation

models a real world situation by using random-digit outcomes to mimic the uncertainty of a response variable of interest

Trial

the sequence of several components representing events that we are pretending will take place; a single attempt or realization of a random phenomenon

Component

of a simulation, using equally likely random digits to model simple random occurrences whose outcomes may not be equally likely

Response variable

values that record the results of each trial with respect to what we were interested in; a variable whose values are compared across different treatments; in a randomized experiment, large response differences can be attributed to the effect of difference

Sample

a representative subset of a population, examined in the hope of learning about the population

Sample survey

a study that asks questions of a sample drawn from some population in the hope of learning something about the entire population; a common example is polls taken to assess voter preferences

Bias

any systematic failure of a sampling method to represent its population; these sampling methods tend to over- or underestimating parameters; common errors include: relying on voluntary response, undercoverage of the population, nonresponse bias, response

Randomization

the best defense against bias; in which each individual is given a fair, random chance of selection

Sample size

the number of individuals in a sample; determines how well the sample represents the population, not the fraction of the population sampled

Census

a sample that consists of the entire population

Population parameter

a numerically valued attribute of a model for a population; we rarely expect to know the true value of this parameter, but we do hope to estimate it from sampled data; example: the mean income of all employed people in the country

Sample statistic

statistics (values calculated for sampled data) that correspond to, and thus estimate, a population parameter, are of particular interest; example: the mean income of all employed people in a representative sample can provide a good estimate of the corres

Representative

a sample in which the statistics computed from it accurately reflect the corresponding population parameters

Simple Random Sample (SRS)

sample size n, in which each set of n elements in the population has an equal chance of selection

Sampling frame

a list of individuals from whom the sample is drawn; individuals who may be in the population of interest, but who are not in this, cannot be included in any sample

Sampling variability

the natural tendency of randomly drawn samples to differ, one from another; sometimes, unfortunately, called sampling error, though it is no error at all, but just the natural result of random sampling

Cluster sample

a sampling design in which entire groups, or clusters, are chosen at random; usually selected as a matter of convenience, practicality, or cost; clusters are heterogeneous; this should be representative of the population

Multistage sample

Sampling schemes that combine several sampling methods; example: a national polling service may stratify the country by geographical regions, select a random sample of cities from each region, and then interview a cluster of residents in each city

Systematic sample

a sample drawn by selecting individuals systematically from a sampling frame; when there is no relationship between the order of the sampling frame and the variables of interest, this can be representative

Pilot study

a small trial run of a survey to check whether questions are clear; it can reduce errors due to ambiguous questions

Voluntary response sample

a sample in which individuals can choose on their own whether to participate; this is always invalid and cannot be recovered, no matter how large the sample size

Voluntary response bias

bias introduced to a sample when individuals can choose on their own whether to participate in the sample

Convenience sample

consists of the individuals who are conveniently available; this often fails to be representative because every individual in the population is not equally convenient to sample

Undercoverage

a sampling scheme that biases the sample in a way that gives a part of the population less representation in the sample than it has in the population

Nonresponse bias

bias introduced when a large fraction of those sampled fails to respond; those who do not respond are likely to not represent the entire population; voluntary response bias is a form of this, but this may occur for other reasons; example: those who are at

Response bias

anything in a survey design that influences responses falls under this heading; a typical instance of this arises from the wording of questions, which may suggest a favored response; example: voters are more likely to express support of "the president" th

Observational study

a study based on data in which no manipulation of factors has been employed

Retrospective study

an observational study in which subjects are selected and then their previous conditions or behaviors are determined; this need not be biased on random samples and it usually focuses on estimating differences between groups or associations between variabl

Prospective study

an observational study in which subjects are followed to observe future outcomes; because no treatments are deliberately applied, this is not an experiment; nevertheless, this typically focuses on estimating differences among groups that may appear as the

Random assignment

to be valid, an experiment must assign experimental units to treatment groups using some form of randomization

Factor

a variable whose values are compared across different treatments; in a randomized experiment, large response differences in factor levels may have on the responses of the experimental units

Experiment

manipulates factor levels to create treatments, randomly assigns subjects to these treatment levels, and then compares the responses of the subject groups across treatment levels

Subjects/participants

the individuals who participate in an experiment, especially when they are human; a more general term is experimental unit

Experimental units

individuals on whom an experiment is performed

Levels

the specific values that the experimenter chooses for a factor

Treatment

the process, intervention, or other controlled circumstance applied to randomly assigned experimental units; they are the different levels of a single factor or are made up of combinations of levels of two or more factors

Principles of experimental design

1) Control aspects of the experiment that we know may have an effect on the response, but that are not the factors being studied. 2) Randomize subjects to treatments to even out effects that we cannot control. 3) Replicate over as many subjects as possibl

Completely randomized design

in which all experimental units have an equal chance of receiving any treatment

Statistically significant

when an observed difference is too large for us to believe that it is likely to have occurred naturally, we consider the difference to be this. Subsequent chapters will show specific calculations and give rules, but the principle remains the same

Control group

the experimental units assigned to a baseline treatment level (called the "control treatment") typically either the default treatment, which is well understood, or a null, placebo treatment; their responses provide a basis for comparison

Blinding

any individual associated with an experiment who is not aware of how subjects have been allocated to treatment groups

Single-blind and double-blind

Two main classes of individuals who can affect the outcome of an experiment: 1) Those who could influence the results (the subjects, treatment administrators, or technicians) 2) Those who evaluate the results (judges, treating physicians, etc.)
When every

Placebo

a treatment known to have no effect, administered so that all groups experience the same conditions; many subjects respond to such a treatment (a response known as placebo effect); Only by comparing with a placebo can we be sure that the observed effect o

Placebo effect

the tendency of many human subjects (often 20% or more of of experiment subjects) to show a response even when administered a placebo

Block

when groups of experimental units are similar in a way that is not a factor under study, it is often a good idea to gather them together into these and then randomize the assignment of treatments within each one of these. By doing this, we isolate the var

Randomized block design

an experiment design in which participants are randomly assigned to treatments within each block

Matching

in a retrospective or prospective study, participants who are similar in ways not under study may be matched and then compared with each other on the variables of interest; this, like blocking, reduces unwanted variation

Confounding

when the levels of one factor are associated with the levels of another factor in such a way that their effects cannot be separated

Random phenomenon

a phenomenon in which we know what outcomes could happen, but not which particular values will happen

Outcome

the value measured, observed, or reported for an individual instance of trial

Event

a collection of outcomes; usually identified so that we can attach probabilities to them; we denote these with bold capital letters such as A, B, or C

Sample space

the collection of all possible outcome values; the collection of values in this has a probability of 1; we denote this with a boldface capital S.

Law of Large Numbers

This law states that the long-run relative frequency of an event's occurrence gets closer and closer to the true relative frequency as the number of trials increases.

Independence (informally)

Two events are this if learning that one event occurs does not change the probability that the other event occurs

Probability

of an event, a number between 0 and 1 that reports the likelihood of that event's occurrence; we write P(A) for this of event A

Empirical probability

when the probability comes from the long-run relative frequency of the event's occurrence

Theoretical probability

when the probability comes from a model (such as equally likely outcomes)

Personal (or subjective) probability

when the probability is subjective and represents your personal degree of belief

Probability Assignment Rule

The probability of an entire sample space must be 1. P(S)=1

Complement Rule

The probability of an event not occurring is 1 minus the probability that it does occur: P(A^c)=1-P(A)

Addition Rule

If A and B are disjoint events, then the probability of A or B is P(A or B)=P(A) + P(B)

Disjoint (mutually exclusive)

Two events are disjoint if they share no outcomes in common; If A and B are this then knowing that A occurs tells us that B cannot occur

Legitimate assignment of probabilities

An assignment of probabilities to outcomes is legitimate if: 1) each probability is between 0 and 1 (inclusive) 2) the sum of the probabilities is 1.

Multiplication Rule

If A and B are independent events, then the probability of A and B is P(A and B)=P(A) � P(B)

General Addition Rule

For any two events, A and B, the probability of A or B is P(A or B)=P(A) + P(B)-P(A and B)

Conditional probability

P(B|A)=P(A and B)�P(A)
P(B|A) is read "the probability of B given A

General Multiplication Rule

For any two events, A and B, the probability of A and B is P(A and B)=P(A) � P(B|A)

Independence (used formally)

Events A and B are independent when P(B|A)=P(B)

Tree diagram

A display of conditional events or probabilities that is helpful in thinking through conditioning

Random variable

assumes any of several different values as a result of some random event; these are denoted by a capital letter, such as X

Discrete random variable

A random variable that can take one of a finite number of distinct outcomes

Continuous random variable

a random variable that can take on any of an (uncountably) infinite number of outcomes

Probability model

a function that associates a probability P with each value of a discrete random variable X; denoted P(X=x) or P(x), or with any interval of values of a continuous random variable

Expected value

theoretical long-run average value, the center of its model; denoted � or E(X), it is found (if the random variable is discrete) by summing the products of variable values and probabilities
�=E(X)=?(x-�)�P(x)

Standard deviation of a random variable

Describes the spread in the model and is the square root of the variance, denoted SD(X) or ?

Changing a random variable by a constant

E(X � c)=E(X) � c, Var (X � c)=Var(X), SD(X � c)=SD(X), E(aX)=aE(X), Var(aX)=a�Var(X), SD(aX)=|a|SD(X)

Addition Rule for Expected Values of Random Variables

E(X �=E(X) � E(Y)

Addition Rule for Variances of Random Variables

(Pythagorean Theorem of Statistics) If X and Y are independent: Var(X �Y)=Var(X) + Var(Y), and SD(X � Y)=?(Var(X) + Var(Y))

Bernoulli trials

A sequence of trials in which: 1) There are exactly two possible outcomes (usually denoted success and failure) 2) The probability of success is constant 3) The trials are independent

Binomial probability distribution

appropriate for random variable that counts the number of successes in n Bernoulli trials

Binomial Model

P(X=x)=?nC?xp^xq^(n-x), where ?nC?=(n!)/[x!(n-x)!]

Poisson model

A discrete model often used to model the number of arrivals of events such as customers arriving in a queue or calls arriving into a call center