Individuals
objects described by a set of data. may be people, but they may also be animals, plants, or things
variable
characteristics of an individual. can be taken different values for different individuals
categorical variable
places an indicidual into one of several groups or categories
quantitative variable
takes numerical values for which arithmetic operations such as adding and averaging make sense. values of a quantitative variable are usually recorded in a unit of measurement such as seconds or kilograms
distribution
tells us what values it takes and how often it takes these values
distribution of a categorical variable
lists the categories and gives either the count or the percent of individuals that fall in each category
describe pattern of histogram
shape, center, spread
outlier
important kind of deviation, an individual value that falls outside the overall pattern
symmetrical distribution
if right and left sides of histogram are approximately mirror images of each other
skewed to the right
positively skewed, if right side of hisogram contains half of the observation with larger values extending farther out than left side
skewed to the left
negatively skewed, if left side extends much farther out than right side
make a stem plot
1. separate each observation into a stem, consisting of all but the final rightmost digit, and a leaf, the final digit. stems may have as many digits as needed, but each leaf shows only a single digit
2. write the stems in a vertical column with the small
dotplot
1. sort the data set and plot each observation according to its numerical value along a labeled scaled axis
2. identical observations ar typically stacked
time plot
plots each observation against the time at which it was measured. time on horizontal scale and plot variable on vertical scale
mean
average
median
midpoint
mean and median: symmetrical distribution
symmetrical then mean=median
skewed=mean farther out in the long tail than median & mode=around apex
quartile 1
median, position to left of overall median
quartile 3
median, position to right of overall median
fiver number summary
minimum, Q1, Median, Q3, Maximum
boxplot
graph of five number summary
central box vertically spans Q1 and Q3
horizontal line is median
lines extended from box are min/max
interquartile range IQR
distance between first and third quartiles
IRQ=Q3-Q1
The 1.5*IQR Rule for Outliers
call an observation a suspected outlier if it falls more than 1.5*IQR above the third quartile or below the first quartile
use mean and standard deviation
for symmetric distributions that are free of outliers
use five number summary
for skewed data
four step process
State: what is the practical question, in the context of the real-world setting?
Plan: what specific statistical operations does this problem call for?
Solve:make the graphs and carry out the calculations needed for this problem
Conclude: give your practi
response variable
measures an outcome of a study
explanatory variable
explain or influence changes in a response variable
scatterplot
shows relationship between two quantitative variables measured on the same individuals. explanatory horizontal, response vertical
Examining a scatterplot
direction, form, strength
Positively associated (direction)
above average values of one tend to accompany above average values of the other, below average values also occur together
negatively associated (direction)
above average values of one tend to accompany below average values of other, vice versa
categorical variables in scatterplits
to add a categorical variable to a scatterplot, use a different plot color or symbol for each category
correlation
measures the direction and strength of the linear relationship between two quantitative variables
least squares regression line
line that makes the sum of all squares of the vertical distance of the data points from the line as small as possible; always passes through point (xmean,ymean)
b=r(sx/sy)
a=ymean-bxmean
y=a+bx
extrapolation
use of regression line for prediction well outside the range of values of the explanatory variable x that you used to obtain the line. usually not accurate
lurking variable
variable that is not among the explanatory or response variable in a study and yet may influence the interpretation of relationships among those variables
association does not always imply causation
that is all
marginal distributions
look at the distribution of each variable separately; tells us nothing about the relationship between two variables
conditional distributions
look at only individuals who have a given value of the variable
simpson's paradox
can reverse direction when the data are combined to form a single group
observational study
observes individuals but does not attempt to influence the responses; purpose to describe and compare existing groups or situations
experiment
deliberately imposes some treatment on individuals in order to observe their responses
confounding
two variables (explanatory and response) are confounded when their effects on a response variable cannot be distinguished from each other
observational studies on the effect on one variable on another often fail to demonstrate causality because the explan
population
entire group of individuals we want info about
sample
part of the population from which we collect info
sampling design
describes exactly how to choose a sample from a population
bias
systematically favors certain outcomes
simple random sample
SRS of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal change to be the sample actually selected
probability sampling
sample chose by change
ex: simple random sampling, stratified random sampling, multistage random sampling
cohort study
subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time
factors
explanatory variable
experimental group
group of individuals receiving a treatment whose effect we seek to understand
control
group serves a a bsaseline with which the experiment group is compared
placebo
control treatment that is fake but otherwise indistinguishable from the treatment in the experimental group
principles of experimental design
control: the effects of lurking variables on the response, most simply by comparing two or more treatments
randomize: use impersonal change to assign subjects to treatments
use enough subjects: in each group to reduce change variation in the results
matched pairs design
compares exactly two treatments, either by using a series of individuals that are closely matched two by two or by using each individual twice
treatments within each pair should be randomized
double-blind experiment
neither the subjects nor the people who interact with them know which treatment watch subject is receiving
random
individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of reptitions
probability
any outcome of a random phenomenon is the proportion of times the outcome would occur in a very large series of repetitions
disjoint probability
no outcome in common and cannot occur together
P(A or B)=P(A)+P(B)=0
P(A does not occur)=1=P(A)
density curve
describes overall patter of a distribution
independent events
if one occurs it does not change the probability that the other occurs
P(A and B)=P(A)P(B)
conditional probability
probability of B, given A
P(B I A)=P(A and B)/P(A)
any two events
P(A or B)=P(A)+P(B)-P(A and B)
A and B happen together
P(A and B)=P(A)P(B I A)
independent probability
positive
P(B I A)=P(B)
or
P(A AND B)=P(A)P(B)
baye's theorem
P(Ai I B)= P(B I Ai)P(Ai)
___________
P(B I An)P(An)+...