Raw data
numbers and category labels that have been collected but have not yet been processed in any way.
ex. what is your sex? (male=m, female=f) raw data=m
observation
an individual entity in a study
variable
a characteristic that may differ among individuals
Sample data
collected from a subset of a larger population
Population data
collected when all individuals in a population are measured
statistic
summary measure of sample data
parameter
a summary measure of population data
categorical variables
group or category names that don't necessarily have a logical ordering.
Examples: eye color, country of residence
ordinal variables
Categorical variables for which the categories have a logical ordering
Examples: highest educational degree earned, tee shirt size (S, M, L, XL)
quantitative variables
numerical values taken on each individual.
Examples: height, number of siblings
One Categorical Variable
Example:What percentage of college students favor the legalization of marijuana, and what percentage of college students oppose legalization of marijuana?
Ask:How many and what percentage of individuals fall into each category?
Two Categorical Variables
Example: In Case Study 1.6, we asked if the risk of having a heart attack was different for the physicians who took aspirin than for those who took a placebo.
Ask: Is there a relationship between the two variables? Does the chance of falling into a partic
One Quantitative Variable
Example: What is the average body temperature for adults, and how much variability is there in body temperature measurements?
Ask: What are the interesting summary measures, like the average or the range of values?
One Categorical and One Quantitative Variable
Example: Do men and women drive at the same "fastest speeds" on average?
Ask: Are the measurements similar across categories or do they differ? Could be asked regarding the averages or the ranges.
Two Quantitative Variables
Example: Does average body temperature change as people age?
Ask: Are these variables related so that when measurements are high (or low) on one variable the measurements for the other variable also tend to be high (or low)?
relationship between two variables
the value of the explanatory variable for an individual is thought to partially explain the value of the response variable for that individual
Numerical Summaries
Count how many fall into each category.
Calculate the percent in each category.
If two variables, have the categories of the explanatory variable define the rows and compute row percentages
frequency
distribution for a categorical variable is a listing of all categories along with their frequencies (counts)
relative frequency
a listing of all categories along with their relative frequencies (given as proportions or percentages, for example)
Pie Charts
useful for summarizing a single categorical variable if not too many categories
Bar Graphs
useful for summarizing one or two categorical variables and particularly useful for making comparisons when there are two categorical variables
extremes
high, low variables
quartiles
medians of lower and upper halves of the values
Location
center or average. e.g. median
Spread
variability e.g. difference between two extremes or two quartiles
Shape
clumped in middle or on one end (more later)
Outliers
a data point that is not consistent with the bulk of the data
Histograms
similar to bar graphs, used for any number of data values
Stem-and-leaf plots and dotplots
present all individual values, useful for small to moderate sized data sets
Boxplot or box-and-whisker plot
useful summary for comparing two or more groups
Creating a Histogram
Step 1: Decide how many equally spaced intervals to use for the horizontal axis. Between 6 and 15
Step 2: Decide to use frequencies (count) or relative frequencies (proportion) on the vertical axis.
Step 3: Draw equally spaced intervals on horizontal axis
Creating a Dotplot
Draw a number line (horizontal axis) to cover range from smallest to largest data value.
For each observation, place a dot above the number line located at the observation's data value.
When multiple observations with the same value, dots are stacked ve
Creating a Stem-and-Leaf Plot
Step 1: Determine stem values. The "stem" contains all but the last of the displayed digits of a number. Stems should define equally spaced intervals.
Step 2: For each individual, attach a "leaf" to the appropriate stem. A "leaf" is the last of the disp
Describing Shape
Symmetric, bell-shaped
Symmetric, not bell-shaped
Skewed Right: values trail off to right
Skewed Left: values trail off to left
Boxplots
Box covers the middle 50% of the data
Line within box marks the median value
Possible outliers are marked with asterisk
Apart from outliers, lines extending from box reach to min and max values
To illustrate location and spread
any of the pictures work well
To illustrate shape
histograms and stem-and-leaf plots are best
To see individual values
use stem-and-leaf plots and dotplots
To sort values
use stem-and-leaf plots
To compare groups
use side-by-side boxplots
To identify outliers
using the standard definition, use a boxplot
Mean
the numerical average
Median
the middle value (if n odd) or the average of the middle two values (n even)
Shapes
Symmetric: mean = median
Skewed Left: mean < median
Skewed Right: mean > median
The Median
If n is odd: M = middle of ordered values.Count (n + 1)/2 down from top of ordered list.
If n is even: M = average of middle two ordered values.Average values that are (n/2) and (n/2) + 1 down from top of ordered list.
Influence of Outliers on the Mean and Median
Larger influence on mean than median.
High outliers will increase the mean.
Low outliers will decrease the mean.
If ages at death are: 76, 78, 80, 82, and 84
then mean = median = 80 years.
If ages at death are: 46, 78, 80, 82, and 84
then median = 80 but
Range
#NAME?
Interquartile Range (IQR)
upper quartile - lower quartile
lower quartile
median of data values that are below the median
upper quartile
median of data values that are above the median
How to Draw a Boxplot and Identify Outliers
Step 1: Label either a vertical axis or a horizontal axis with numbers from min to max of the data.
Step 2: Draw box with lower end at Q1 and upper end at Q3.
Step 3: Draw a line through the box at the median M.
Step 4: Calculate IQR = Q3 - Q1.
Step 5: D
Percentiles
The kth percentile is a number that has k% of the data values at or below it and (100 - k)% of the data values at or above it
Lower quartile = 25th percentile
Median = 50th percentile
Upper quartile = 75th percentile
Outlier
a data point that is not consistent with the bulk of the data
Look for them via graphs.
Can have big influence on conclusions.
Can cause complications in some statistical analyses.
Cannot discard without justification.
Possible Reasons for Outliers and Reasonable Actions
Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded � they provide important information about location and spread.
Mistake made while taking measurement or entering it in
Bell-Shaped Distributions of Numbers
Many measurements follow a predictable pattern:
Most individuals are clumped around the center
The greater the distance a value is from the center, the fewer individuals have that value.
Variables that follow such a pattern are said to be "bell-shaped".
Standard deviation
measures variability by summarizing how far individual data values are from the mean
Think of the standard deviation as roughly the average distance values fall from the mean
Calculating the Standard Deviation
Step 1: Calculate, the sample mean.
Step 2: For each observation, calculate the difference between the data value and the mean.
Step 3: Square each difference in step 2.
Step 4: Sum the squared differences in step 3, and then divide this sum by n - 1.
S
population mean
represented by the symbol m ("mu")
If the data set includes measurements for an entire population, the notations for the mean and standard deviation are different, and the formula for the standard deviation is also slightly different
The Empirical Rule
For any bell-shaped curve, approximately
68% of the values fall within 1 standard deviation of the mean in either direction
95% of the values fall within 2 standard deviations of the mean in either direction
99.7% of the values fall within 3 standard d
Empirical Rule 2
the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape.
You can get a rough idea of the value of the standard deviation by dividing the range by 6.
Standardized score or z-score
observed value-mean divided by standard deviation
Example: Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80
80-70 divided by 8 is 1.25
A pulse rate of 8
For bell-shaped data
About 68% of values have z-scores between -1 and +1.
About 95% of values have z-scores between -2 and +2.
About 99.7% of values have z-scores between -3 and +3.
Scatterplot
a two-dimensional graph of data values
Correlation
a statistic that measures the strength and direction of a linear relationship between two quantitative variables
Regression equation
an equation that describes the average relationship between a quantitative response and explanatory variable.
Questions to Ask about a Scatterplot
What is the average pattern? Does it look like a straight line, or is it curved?
What is the direction of the pattern?
How much do individual points vary from the average pattern?
Are there any unusual data points?
positive association
values of one variable tend to increase as the values of the other variable increase
negative association
values of one variable tend to decrease as the values of the other variable increase
linear relationship
the pattern of their relationship resembles a straight line
outliers
points that have an unusual combination of data values.
regression line
When the best equation for describing the relationship between x and y is a straight line
Two purposes of the regression line
to estimate the average value of y at any specified value of x
to predict the value of y for an individual, given that individual's x value
Equation for the Regression Line
y-hat," and it is also referred to either as predicted y or estimated y.
b0 is the intercept of the straight line.
The intercept is the value of y when x = 0.
b1 is the slope of the straight line.
The slope tells us how much of an increase (or decrease)
Prediction Errors and Residuals
Prediction Error = difference between the observed value of y and the predicted value of y hat.
Residual = (y-y hat)
Positive residual-observed value higher than predicted.
Negative residual-observed value lower than predicted.
Least Squares Regression Line
minimizes the sum of squared prediction errors
SSE
Sum of squared prediction errors
Measuring Strength and Direction with Correlation
Correlation r indicates the strength and the direction of a straight-line relationship
The strength of the relationship is determined by the closeness of the points to a straight line.
The direction is determined by whether one variable generally increase
Interpretation of the Correlation Coefficient
r is always between -1 and +1
magnitude indicates the strength
r = -1 or +1 indicates a perfect linear relationship
sign indicates the direction
r = 0 indicates a slope of 0 so knowing x does not change the predicted value of y
Correlation relationship
Correlation r = +0.74=a somewhat strong positive linear relationship
Correlation r = -0.8=a somewhat strong negative linear association
Correlation r = +0.95=a very strong positive linear relationship
Correlation r = 0.485=a moderately strong positive lin
Squared correlation r2
between 0 and 1 and indicates the proportion of variation in the response explained by x.
SSTO
sum of squares total = sum of squared differences between observed y values and y hat
SSE
sum of squared errors (residuals) = sum of squared differences between observed y values and predicted values based on least squares line
Interpretation of r2
r2 = 0.55=Height explains 55% of the variation among observed right handspans
r2 = 0.0185=only about 1.85%; knowing a person's age doesn't help much in predicting amount of daily TV viewing
Regression and Correlation Difficulties and Disasters
Extrapolating too far beyond the observed range of x values
Allowing outliers to overly influence results
Combining groups inappropriately
Using correlation and a straight-line equation to describe curvilinear data
Extrapolation
Risky to use a regression equation to predict values far outside the range where the original data fell (called extrapolation).
No guarantee that the relationship will continue beyond the range for which we have observed data.
Correlation Does Not Prove Causation
Causation
Confounding Factors Present
Explanatory and Response are both affected by other variables
Response variable is causing a change in the explanatory variable
Observational Study
Researchers observe or question participants about opinions, behaviors, or outcomes.
Researchers do not assign any treatments or conditions.
Participants not asked to do anything differently
Experiment
Researchers manipulate something and measure the effect of the manipulation on some outcome of interest
Sometimes cannot conduct experiment due to practical/ethical issues
Randomized experiments
participants are randomly assigned to participate in one condition (called treatment) or another
Unit
a single individual or object being measured.
If an experiment, then called an experimental unit.
When units are people, often called subjects or participants.
Explanatory variable
(or independent variable) is one that may explain or may cause differences in a response variable (or outcome or dependent variable)
confounding variable
a variable that both affects the response variable and also is related to the explanatory variable.
A potential confounding variable not measured in the study is called a lurking variable
Randomized experiments
often allow us to determine cause-and-effect
Random assignment
to make the groups approximately equal in all respects except for the explanatory variable.
Fundamental Rule
Available data can be used to make inferences about a much larger group if the data can be considered to be representative with regard to the question(s) of interest.
Participants in randomized experiments are often volunteers
Randomizing the Type of Treatment
Randomly assigning the treatments to the experimental units keeps the researchers from making assignments favorable to their hypotheses and also helps protect against hidden or unknown biases
Randomizing the Order of Treatments
If all treatments are applied to each unit, randomization should be used to determine the order in which they are applied.
Replication in an experiment
More than one experimental unit is assigned to each treatment condition.
Large enough to provide suitable accurate estimates. If too small then difficult to rule out natural chance variation as reason for any observed differences.
Replication in science
A single experiment rarely provides sufficient evidence for anything, so it is important to have independent researchers try to reproduce findings
Control Groups
Treated identically in all respects except they don't receive the active treatment.
Sometimes they receive a dummy treatment or a standard/existing treatment
Placebo
Looks like real drug but has no active ingredient. Placebo effect = people respond to placebos.
Blinding
Single-blind = participants do not know which treatment they have received.
Double-blind = neither participant nor researcher making measurements knows who had which treatment.
Double Dummy
Each group given two "treatments"...
Group 1 = real treatment 1 and placebo treatment 2 Group 2 = placebo treatment 1 and real treatment 2
Matched-Pair Designs
Use either two matched individuals or same individual receives each of two treatments.
Special case of a block design.
Important to randomize order of two treatments and use blinding if possible
Block Designs
Experimental units divided into homogeneous groups called blocks, each treatment randomly assigned to one or more units in each block.
If blocks = individuals and units = repeated time periods in which receive varying treatments; called repeated-measu
Designing a Good Observational Study
Disadvantage: more difficult to try to establish cause-and-effect links.
Advantage: more likely to measure participants in their natural setting.
Retrospective
Data are from the past.Participants are asked to recall past events.
Prospective
Participants are followed into the future and events are recorded.
Case-Control Studies
Cases" who have a particular attribute or condition are compared to "controls" who do not, to see how they differ on an explanatory variable of interest.
Advantages: Efficiency and Reduction of Potential Confounding Variables through careful choice of
Confounding Variables and the Implication of Causation in Observational Studies
Common media mistake = reporting cause-and-effect relationship based on an observational study. Difficult to separate role of confounding variables from role of explanatory variables in producing the outcome variable if randomization is not used.
Extending Results Inappropriately
Many studies use convenience samples or volunteers. Need to assess if the results can be extended to any larger group for the question(s) of interest
Interacting Variables
A second explanatory variable can interact with the principle explanatory variable in its relationship with the response variable.
Results should be reported taking the interaction into account
Hawthorne effect
participants in an experiment respond differently than they otherwise would, just because they are in the experiment.
Many treatments have higher success rate in clinical trials than in actual practice.
Experimenter effects
recording data to match desired outcome, treating subjects differently, etc.
Most overcome by blinding and control groups
Ecological Validity and Generalizability
When variables have been removed from their natural setting and are measured in the laboratory or in some other artificial setting, the results may not reflect the impact of the variable in the real world.
Using the Past as a Source of Data
Can be a problem in retrospective observational studies.
Try to use authoritative sources such as medical records rather than rely on memory.
If possible, use prospective observational studies.
observational unit
a single individual entity, a person for instance, in a study
sample size
total number of observational units
dataset
complete set of raw data, for all observational units and variables, in a survey or experiment
descriptive statistics
summary numbers for either a sample or a population
measurement variable and numerical variable
synonyms for a quantitative variable
continuous variable
can be used for quantitative data when every value within some interval is a possible response
ex., height is a continuous quantitative variable because any height within a particular range is possible
distribution
describes how often the possible responses occur
frequency distribution
for a categorical variable is a listing of all categories along with their frequencies (counts)
relative frequency distribution
listing of all categories along with their relative frequencies (given as proportions or percentages, for example).
outcome variable
another name for response variable
five-number summary
consists of the median, the quartiles, and the extremes
distribution
of quantitative variable is overall pattern of how often the possible values occur
median
middle value in the data, one estimate of location
mean
usual arithmetic average
variability
among the individual measurements is an important feature of any dataset
bell-shaped
another symmetric shape
mode
the most frequent value
unimodal
shape if there is a single prominent peak in a histogram, stemplot, or dotplot
bimodal
shape if there are two prominent peaks in the distribution
symmetric
similar on both sides of the center
skewed
values are more spread out on one side of the center than the other
skewed right
higher values (toward right on a number line) are more spread out than the lower values
skewed to the left
lower values (toward the left on a number line) are more spread out than the higher values
standardized score or z-score
measures how far a value is from the mean in terms of standard deviations
standard deviation
measures the variability among data values
explanatory variable
may explain or cause differences in the response variable
dependent variable
used as a synonym for the response variable because the value for the response variable depends on the value of the explanatory variable
y variable
in a scatterplot, the response variable is plotted on the vertical axis (the y axis), so it is called the y variable
x variable
explanatory variable is plotted along the horizontal axis (x axis) and is called the x variable
nonlinear or curvilinear
curve describes the pattern of a scatterplot better than a line
regression analysis
area of statistics used to examine relationship between a quantitative response variable and one or more explanatory variables
regression equation
describes how, on average, the response variable is related to the explanatory variables
linear relationships
straight line relationships
y-intercept
b with a subscript of 0
letter y represents vertical direction
slope
how much the y variable changes for each increase of one unit in the x variable
x represents the horizontal direction
predict
regression equation predicts values of a response variable when we only know the values for the explanatory variable
deterministic relationship
if we know the value of one variable, we can exactly determine the value of the other variable
statistical relationship
there is variation from the average pattern
proportion of variation explained by x
sometimes used in conjunction with the squared correlation, r2.
if correlation has value r=0.5, squared correlation is r2=(0.5)squared = .25, or 25%
researcher may write that the explanatory variable explains 25% of the variation among observed values of
interpolation
y values are estimated or predicted for new values of x that were not in the original dataset, but are in the range of values covered by the x's in the dataset
influential observations
outliers with extreme x values have the most influence on correlation and regression and are called influential observations
experimental unit
in experiments, most basic entity (person, plant, and so on) to which different treatments can be assigned
subjects
when experimental units are people
participants
in both experiments and observational studies, subjects may be called participants
lurking variable
used to describe potential confounding variable that is not measured and is not considered in the interpretation of a study
randomization
random assignment to treatments or conditions, is the key to reducing the chance of confounding variable
repeated measures design
each experimental unit receives all treatments, ideally in a random order
easy and efficient way to control for variation among individuals
completely randomized design
when treatments are randomly assigned to experimental units without using matched pairs or blocks
matched-pair design
when matched pairs are used
randomized block design
when blocks are used