Statistics Ch 1-4

Individuals

objects described by a set of data. may be people, but they may also be animals, plants, or things

variable

characteristics of an individual. can be taken different values for different individuals

categorical variable

places an indicidual into one of several groups or categories

quantitative variable

takes numerical values for which arithmetic operations such as adding and averaging make sense. values of a quantitative variable are usually recorded in a unit of measurement such as seconds or kilograms

distribution

tells us what values it takes and how often it takes these values

distribution of a categorical variable

lists the categories and gives either the count or the percent of individuals that fall in each category

describe pattern of histogram

shape, center, spread

outlier

important kind of deviation, an individual value that falls outside the overall pattern

symmetrical distribution

if right and left sides of histogram are approximately mirror images of each other

skewed to the right

positively skewed, if right side of hisogram contains half of the observation with larger values extending farther out than left side

skewed to the left

negatively skewed, if left side extends much farther out than right side

make a stem plot

1. separate each observation into a stem, consisting of all but the final rightmost digit, and a leaf, the final digit. stems may have as many digits as needed, but each leaf shows only a single digit
2. write the stems in a vertical column with the small

dotplot

1. sort the data set and plot each observation according to its numerical value along a labeled scaled axis
2. identical observations ar typically stacked

time plot

plots each observation against the time at which it was measured. time on horizontal scale and plot variable on vertical scale

mean

average

median

midpoint

mean and median: symmetrical distribution

symmetrical then mean=median
skewed=mean farther out in the long tail than median & mode=around apex

quartile 1

median, position to left of overall median

quartile 3

median, position to right of overall median

fiver number summary

minimum, Q1, Median, Q3, Maximum

boxplot

graph of five number summary
central box vertically spans Q1 and Q3
horizontal line is median
lines extended from box are min/max

interquartile range IQR

distance between first and third quartiles
IRQ=Q3-Q1

The 1.5*IQR Rule for Outliers

call an observation a suspected outlier if it falls more than 1.5*IQR above the third quartile or below the first quartile

use mean and standard deviation

for symmetric distributions that are free of outliers

use five number summary

for skewed data

four step process

State: what is the practical question, in the context of the real-world setting?
Plan: what specific statistical operations does this problem call for?
Solve:make the graphs and carry out the calculations needed for this problem
Conclude: give your practi

response variable

measures an outcome of a study

explanatory variable

explain or influence changes in a response variable

scatterplot

shows relationship between two quantitative variables measured on the same individuals. explanatory horizontal, response vertical

Examining a scatterplot

direction, form, strength

Positively associated (direction)

above average values of one tend to accompany above average values of the other, below average values also occur together

negatively associated (direction)

above average values of one tend to accompany below average values of other, vice versa

categorical variables in scatterplits

to add a categorical variable to a scatterplot, use a different plot color or symbol for each category

correlation

measures the direction and strength of the linear relationship between two quantitative variables

least squares regression line

line that makes the sum of all squares of the vertical distance of the data points from the line as small as possible; always passes through point (xmean,ymean)
b=r(sx/sy)
a=ymean-bxmean
y=a+bx

extrapolation

use of regression line for prediction well outside the range of values of the explanatory variable x that you used to obtain the line. usually not accurate

lurking variable

variable that is not among the explanatory or response variable in a study and yet may influence the interpretation of relationships among those variables

association does not always imply causation

that is all

marginal distributions

look at the distribution of each variable separately; tells us nothing about the relationship between two variables

conditional distributions

look at only individuals who have a given value of the variable

simpson's paradox

can reverse direction when the data are combined to form a single group

observational study

observes individuals but does not attempt to influence the responses; purpose to describe and compare existing groups or situations

experiment

deliberately imposes some treatment on individuals in order to observe their responses

confounding

two variables (explanatory and response) are confounded when their effects on a response variable cannot be distinguished from each other
observational studies on the effect on one variable on another often fail to demonstrate causality because the explan

population

entire group of individuals we want info about

sample

part of the population from which we collect info

sampling design

describes exactly how to choose a sample from a population

bias

systematically favors certain outcomes

simple random sample

SRS of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal change to be the sample actually selected

probability sampling

sample chose by change
ex: simple random sampling, stratified random sampling, multistage random sampling

cohort study

subjects sharing a common demographic characteristic are enrolled and observed at regular intervals over an extended period of time

factors

explanatory variable

experimental group

group of individuals receiving a treatment whose effect we seek to understand

control

group serves a a bsaseline with which the experiment group is compared

placebo

control treatment that is fake but otherwise indistinguishable from the treatment in the experimental group

principles of experimental design

control: the effects of lurking variables on the response, most simply by comparing two or more treatments
randomize: use impersonal change to assign subjects to treatments
use enough subjects: in each group to reduce change variation in the results

matched pairs design

compares exactly two treatments, either by using a series of individuals that are closely matched two by two or by using each individual twice
treatments within each pair should be randomized

double-blind experiment

neither the subjects nor the people who interact with them know which treatment watch subject is receiving

random

individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of reptitions

probability

any outcome of a random phenomenon is the proportion of times the outcome would occur in a very large series of repetitions

disjoint probability

no outcome in common and cannot occur together
P(A or B)=P(A)+P(B)=0
P(A does not occur)=1=P(A)

density curve

describes overall patter of a distribution

independent events

if one occurs it does not change the probability that the other occurs
P(A and B)=P(A)P(B)

conditional probability

probability of B, given A
P(B I A)=P(A and B)/P(A)

any two events

P(A or B)=P(A)+P(B)-P(A and B)

A and B happen together

P(A and B)=P(A)P(B I A)

independent probability

positive
P(B I A)=P(B)
or
P(A AND B)=P(A)P(B)

baye's theorem

P(Ai I B)= P(B I Ai)P(Ai)
___________
P(B I An)P(An)+...