STATS

Raw data

numbers and category labels that have been collected but have not yet been processed in any way.
ex. what is your sex? (male=m, female=f) raw data=m

observation

an individual entity in a study

variable

a characteristic that may differ among individuals

Sample data

collected from a subset of a larger population

Population data

collected when all individuals in a population are measured

statistic

summary measure of sample data

parameter

a summary measure of population data

categorical variables

group or category names that don't necessarily have a logical ordering.
Examples: eye color, country of residence

ordinal variables

Categorical variables for which the categories have a logical ordering
Examples: highest educational degree earned, tee shirt size (S, M, L, XL)

quantitative variables

numerical values taken on each individual.
Examples: height, number of siblings

One Categorical Variable

Example:What percentage of college students favor the legalization of marijuana, and what percentage of college students oppose legalization of marijuana?
Ask:How many and what percentage of individuals fall into each category?

Two Categorical Variables

Example: In Case Study 1.6, we asked if the risk of having a heart attack was different for the physicians who took aspirin than for those who took a placebo.
Ask: Is there a relationship between the two variables? Does the chance of falling into a partic

One Quantitative Variable

Example: What is the average body temperature for adults, and how much variability is there in body temperature measurements?
Ask: What are the interesting summary measures, like the average or the range of values?

One Categorical and One Quantitative Variable

Example: Do men and women drive at the same "fastest speeds" on average?
Ask: Are the measurements similar across categories or do they differ? Could be asked regarding the averages or the ranges.

Two Quantitative Variables

Example: Does average body temperature change as people age?
Ask: Are these variables related so that when measurements are high (or low) on one variable the measurements for the other variable also tend to be high (or low)?

relationship between two variables

the value of the explanatory variable for an individual is thought to partially explain the value of the response variable for that individual

Numerical Summaries

Count how many fall into each category.
Calculate the percent in each category.
If two variables, have the categories of the explanatory variable define the rows and compute row percentages

frequency

distribution for a categorical variable is a listing of all categories along with their frequencies (counts)

relative frequency

a listing of all categories along with their relative frequencies (given as proportions or percentages, for example)

Pie Charts

useful for summarizing a single categorical variable if not too many categories

Bar Graphs

useful for summarizing one or two categorical variables and particularly useful for making comparisons when there are two categorical variables

extremes

high, low variables

quartiles

medians of lower and upper halves of the values

Location

center or average. e.g. median

Spread

variability e.g. difference between two extremes or two quartiles

Shape

clumped in middle or on one end (more later)

Outliers

a data point that is not consistent with the bulk of the data

Histograms

similar to bar graphs, used for any number of data values

Stem-and-leaf plots and dotplots

present all individual values, useful for small to moderate sized data sets

Boxplot or box-and-whisker plot

useful summary for comparing two or more groups

Creating a Histogram

Step 1: Decide how many equally spaced intervals to use for the horizontal axis. Between 6 and 15
Step 2: Decide to use frequencies (count) or relative frequencies (proportion) on the vertical axis.
Step 3: Draw equally spaced intervals on horizontal axis

Creating a Dotplot

Draw a number line (horizontal axis) to cover range from smallest to largest data value.
For each observation, place a dot above the number line located at the observation's data value.
When multiple observations with the same value, dots are stacked ve

Creating a Stem-and-Leaf Plot

Step 1: Determine stem values. The "stem" contains all but the last of the displayed digits of a number. Stems should define equally spaced intervals.
Step 2: For each individual, attach a "leaf" to the appropriate stem. A "leaf" is the last of the disp

Describing Shape

Symmetric, bell-shaped
Symmetric, not bell-shaped
Skewed Right: values trail off to right
Skewed Left: values trail off to left

Boxplots

Box covers the middle 50% of the data
Line within box marks the median value
Possible outliers are marked with asterisk
Apart from outliers, lines extending from box reach to min and max values

To illustrate location and spread

any of the pictures work well

To illustrate shape

histograms and stem-and-leaf plots are best

To see individual values

use stem-and-leaf plots and dotplots

To sort values

use stem-and-leaf plots

To compare groups

use side-by-side boxplots

To identify outliers

using the standard definition, use a boxplot

Mean

the numerical average

Median

the middle value (if n odd) or the average of the middle two values (n even)

Shapes

Symmetric: mean = median
Skewed Left: mean < median
Skewed Right: mean > median

The Median

If n is odd: M = middle of ordered values. Count (n + 1)/2 down from top of ordered list.
If n is even: M = average of middle two ordered values. Average values that are (n/2) and (n/2) + 1 down from top of ordered list.

Influence of Outliers on the Mean and Median

Larger influence on mean than median.
High outliers will increase the mean.
Low outliers will decrease the mean.
If ages at death are: 76, 78, 80, 82, and 84
then mean = median = 80 years.
If ages at death are: 46, 78, 80, 82, and 84
then median = 80 but

Range

#NAME?

Interquartile Range (IQR)

upper quartile - lower quartile

lower quartile

median of data values that are below the median

upper quartile

median of data values that are above the median

How to Draw a Boxplot and Identify Outliers

Step 1: Label either a vertical axis or a horizontal axis with numbers from min to max of the data.
Step 2: Draw box with lower end at Q1 and upper end at Q3.
Step 3: Draw a line through the box at the median M.
Step 4: Calculate IQR = Q3 - Q1.
Step 5: D

Percentiles

The kth percentile is a number that has k% of the data values at or below it and (100 - k)% of the data values at or above it
Lower quartile = 25th percentile
Median = 50th percentile
Upper quartile = 75th percentile

Outlier

a data point that is not consistent with the bulk of the data
Look for them via graphs.
Can have big influence on conclusions.
Can cause complications in some statistical analyses.
Cannot discard without justification.

Possible Reasons for Outliers and Reasonable Actions

Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded � they provide important information about location and spread.
Mistake made while taking measurement or entering it in

Bell-Shaped Distributions of Numbers

Many measurements follow a predictable pattern:
Most individuals are clumped around the center
The greater the distance a value is from the center, the fewer individuals have that value.
Variables that follow such a pattern are said to be "bell-shaped".

Standard deviation

measures variability by summarizing how far individual data values are from the mean
Think of the standard deviation as roughly the average distance values fall from the mean

Calculating the Standard Deviation

Step 1: Calculate, the sample mean.
Step 2: For each observation, calculate the difference between the data value and the mean.
Step 3: Square each difference in step 2.
Step 4: Sum the squared differences in step 3, and then divide this sum by n - 1.
S

population mean

represented by the symbol m ("mu")
If the data set includes measurements for an entire population, the notations for the mean and standard deviation are different, and the formula for the standard deviation is also slightly different

The Empirical Rule

For any bell-shaped curve, approximately
68% of the values fall within 1 standard deviation of the mean in either direction
95% of the values fall within 2 standard deviations of the mean in either direction
99.7% of the values fall within 3 standard d

Empirical Rule 2

the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape.
You can get a rough idea of the value of the standard deviation by dividing the range by 6.

Standardized score or z-score

observed value-mean divided by standard deviation
Example: Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80
80-70 divided by 8 is 1.25
A pulse rate of 8

For bell-shaped data

About 68% of values have z-scores between -1 and +1.
About 95% of values have z-scores between -2 and +2.
About 99.7% of values have z-scores between -3 and +3.

Scatterplot

a two-dimensional graph of data values

Correlation

a statistic that measures the strength and direction of a linear relationship between two quantitative variables

Regression equation

an equation that describes the average relationship between a quantitative response and explanatory variable.

Questions to Ask about a Scatterplot

What is the average pattern? Does it look like a straight line, or is it curved?
What is the direction of the pattern?
How much do individual points vary from the average pattern?
Are there any unusual data points?

positive association

values of one variable tend to increase as the values of the other variable increase

negative association

values of one variable tend to decrease as the values of the other variable increase

linear relationship

the pattern of their relationship resembles a straight line

outliers

points that have an un usual combination of data values.

regression line

When the best equation for describing the relationship between x and y is a straight line

Two purposes of the regression line

to estimate the average value of y at any specified value of x
to predict the value of y for an individual, given that individual's x value

Equation for the Regression Line

y-hat," and it is also referred to either as predicted y or estimated y.
b0 is the intercept of the straight line.
The intercept is the value of y when x = 0.
b1 is the slope of the straight line.
The slope tells us how much of an increase (or decrease)

Prediction Errors and Residuals

Prediction Error = difference between the observed value of y and the predicted value of y hat.
Residual = (y-y hat)
Positive residual-observed value higher than predicted.
Negative residual-observed value lower than predicted.

Least Squares Regression Line

minimizes the sum of squared prediction errors

SSE

Sum of squared prediction errors

Measuring Strength and Direction with Correlation

Correlation r indicates the strength and the direction of a straight-line relationship
The strength of the relationship is determined by the closeness of the points to a straight line.
The direction is determined by whether one variable generally increase

Interpretation of the Correlation Coefficient

r is always between -1 and +1
magnitude indicates the strength
r = -1 or +1 indicates a perfect linear relationship
sign indicates the direction
r = 0 indicates a slope of 0 so knowing x does not change the predicted value of y

Correlation relationship

Correlation r = +0.74=a somewhat strong positive linear relationship
Correlation r = -0.8=a somewhat strong negative linear association
Correlation r = +0.95=a very strong positive linear relationship
Correlation r = 0.485=a moderately strong positive lin

Squared correlation r2

between 0 and 1 and indicates the proportion of variation in the response explained by x.

SSTO

sum of squares total = sum of squared differences between observed y values and y hat

SSE

sum of squared errors (residuals) = sum of squared differences between observed y values and predicted values based on least squares line

Interpretation of r2

r2 = 0.55=Height explains 55% of the variation among observed right handspans
r2 = 0.0185=only about 1.85%; knowing a person's age doesn't help much in predicting amount of daily TV viewing

Regression and Correlation Difficulties and Disasters

Extrapolating too far beyond the observed range of x values
Allowing outliers to overly influence results
Combining groups inappropriately
Using correlation and a straight-line equation to describe curvilinear data

Extrapolation

Risky to use a regression equation to predict values far outside the range where the original data fell (called extrapolation).
No guarantee that the relationship will continue beyond the range for which we have observed data.

Correlation Does Not Prove Causation

Causation
Confounding Factors Present
Explanatory and Response are both affected by other variables
Response variable is causing a change in the explanatory variable

Observational Study

Researchers observe or question participants about opinions, behaviors, or outcomes.
Researchers do not assign any treatments or conditions.
Participants not asked to do anything differently

Experiment

Researchers manipulate something and measure the effect of the manipulation on some outcome of interest
Sometimes cannot conduct experiment due to practical/ethical issues

Randomized experiments

participants are randomly assigned to participate in one condition (called treatment) or another

Unit

a single individual or object being measured.
If an experiment, then called an experimental unit.
When units are people, often called subjects or participants.

Explanatory variable

(or independent variable) is one that may explain or may cause differences in a response variable (or outcome or dependent variable)

confounding variable

a variable that both affects the response variable and also is related to the explanatory variable.
A potential confounding variable not measured in the study is called a lurking variable

Randomized experiments

often allow us to determine cause-and-effect

Random assignment

to make the groups approximately equal in all respects except for the explanatory variable.

Fundamental Rule

Available data can be used to make inferences about a much larger group if the data can be considered to be representative with regard to the question(s) of interest.
Participants in randomized experiments are often volunteers

Randomizing the Type of Treatment

Randomly assigning the treatments to the experimental units keeps the researchers from making assignments favorable to their hypotheses and also helps protect against hidden or unknown biases

Randomizing the Order of Treatments

If all treatments are applied to each unit, randomization should be used to determine the order in which they are applied.

Replication in an experiment

More than one experimental unit is assigned to each treatment condition.
Large enough to provide suitable accurate estimates. If too small then difficult to rule out natural chance variation as reason for any observed differences.

Replication in science

A single experiment rarely provides sufficient evidence for anything, so it is important to have independent researchers try to reproduce findings

Control Groups

Treated identically in all respects except they don't receive the active treatment.
Sometimes they receive a dummy treatment or a standard/existing treatment

Placebo

Looks like real drug but has no active ingredient. Placebo effect = people respond to placebos.

Blinding

Single-blind = participants do not know which treatment they have received.
Double-blind = neither participant nor researcher making measurements knows who had which treatment.

Double Dummy

Each group given two "treatments"...
Group 1 = real treatment 1 and placebo treatment 2 Group 2 = placebo treatment 1 and real treatment 2

Matched-Pair Designs

Use either two matched individuals or same individual receives each of two treatments.
Special case of a block design.
Important to randomize order of two treatments and use blinding if possible

Block Designs

Experimental units divided into homogeneous groups called blocks, each treatment randomly assigned to one or more units in each block.
If blocks = individuals and units = repeated time periods in which receive varying treatments; called repeated-measu

Designing a Good Observational Study

Disadvantage: more difficult to try to establish cause-and-effect links.
Advantage: more likely to measure participants in their natural setting.

Retrospective

Data are from the past. Participants are asked to recall past events.

Prospective

Participants are followed into the future and events are recorded.

Case-Control Studies

Cases" who have a particular attribute or condition are compared to "controls" who do not, to see how they differ on an explanatory variable of interest.
Advantages: Efficiency and Reduction of Potential Confounding Variables through careful choice of

Confounding Variables and the Implication of Causation in Observational Studies

Common media mistake = reporting cause-and-effect relationship based on an observational study. Difficult to separate role of confounding variables from role of explanatory variables in producing the outcome variable if randomization is not used.

Extending Results Inappropriately

Many studies use convenience samples or volunteers. Need to assess if the results can be extended to any larger group for the question(s) of interest

Interacting Variables

A second explanatory variable can interact with the principle explanatory variable in its relationship with the response variable.
Results should be reported taking the interaction into account

Hawthorne effect

participants in an experiment respond differently than they otherwise would, just because they are in the experiment.
Many treatments have higher success rate in clinical trials than in actual practice.

Experimenter effects

recording data to match desired outcome, treating subjects differently, etc.
Most overcome by blinding and control groups

Ecological Validity and Generalizability

When variables have been removed from their natural setting and are measured in the laboratory or in some other artificial setting, the results may not reflect the impact of the variable in the real world.

Using the Past as a Source of Data

Can be a problem in retrospective observational studies.
Try to use authoritative sources such as medical records rather than rely on memory.
If possible, use prospective observational studies.

observational unit

a single individual entity, a person for instance, in a study

sample size

total number of observational units

dataset

complete set of raw data, for all observational units and variables, in a survey or experiment

descriptive statistics

summary numbers for either a sample or a population

measurement variable and numerical variable

synonyms for a quantitative variable

continuous variable

can be used for quantitative data when every value within some interval is a possible response
ex., height is a continuous quantitative variable because any height within a particular range is possible

distribution

describes how often the possible responses occur

frequency distribution

for a categorical variable is a listing of all categories along with their frequencies (counts)

relative frequency distribution

listing of all categories along with their relative frequencies (given as proportions or percentages, for example).

outcome variable

another name for response variable

five-number summary

consists of the median, the quartiles, and the extremes

distribution

of quantitative variable is overall pattern of how often the possible values occur

median

middle value in the data, one estimate of location

mean

usual arithmetic average

variability

among the individual measurements is an important feature of any dataset

bell-shaped

another symmetric shape

mode

the most frequent value

unimodal

shape if there is a single prominent peak in a histogram, stemplot, or dotplot

bimodal

shape if there are two prominent peaks in the distribution

symmetric

similar on both sides of the center

skewed

values are more spread out on one side of the center than the other

skewed right

higher values (toward right on a number line) are more spread out than the lower values

skewed to the left

lower values (toward the left on a number line) are more spread out than the higher values

standardized score or z-score

measures how far a value is from the mean in terms of standard deviations

standard deviation

measures the variability among data values

explanatory variable

may explain or cause differences in the response variable

dependent variable

used as a synonym for the response variable because the value for the response variable depends on the value of the explanatory variable

y variable

in a scatterplot, the response variable is plotted on the vertical axis (the y axis), so it is called the y variable

x variable

explanatory variable is plotted along the horizontal axis (x axis) and is called the x variable

nonlinear or curvilinear

curve describes the pattern of a scatterplot better than a line

regression analysis

area of statistics used to examine relationship between a quantitative response variable and one or more explanatory variables

regression equation

describes how, on average, the response variable is related to the explanatory variables

linear relationships

straight line relationships

y-intercept

b with a subscript of 0
letter y represents vertical direction

slope

how much the y variable changes for each increase of one unit in the x variable
x represents the horizontal direction

predict

regression equation predicts values of a response variable when we only know the values for the explanatory variable

deterministic relationship

if we know the value of one variable, we can exactly determine the value of the other variable

statistical relationship

there is variation from the average pattern

proportion of variation explained by x

sometimes used in conjunction with the squared correlation, r2.
if correlation has value r=0.5, squared correlation is r2=(0.5)squared = .25, or 25%
researcher may write that the explanatory variable explains 25% of the variation among observed values of

interpolation

y values are estimated or predicted for new values of x that were not in the original dataset, but are in the range of values covered by the x's in the dataset

influential observations

outliers with extreme x values have the most influence on correlation and regression and are called influential observations

experimental unit

in experiments, most basic entity (person, plant, and so on) to which different treatments can be assigned

subjects

when experimental units are people

participants

in both experiments and observational studies, subjects may be called participants

lurking variable

used to describe potential confounding variable that is not measured and is not considered in the interpretation of a study

randomization

random assignment to treatments or conditions, is the key to reducing the chance of confounding variable

repeated measures design

each experimental unit receives all treatments, ideally in a random order
easy and efficient way to control for variation among individuals

completely randomized design

when treatments are randomly assigned to experimental units without using matched pairs or blocks

matched-pair design

when matched pairs are used

randomized block design

when blocks are used