Stats

Categorical Variable

Places individual into one of several groups or categories.
Pie chart
Bar graph

Quantitative Variable

Takes numerical values for which arithmetic operations make sense.
Histogram
Stemplot

Exploratory data analysis

Is the process of using statistical tools and ideas to examine data in order to describe their main features.

Distribution

The distribution of a variable tells us what values it takes and how often it takes these values.

Categorical Data

lists the categories and gives the count or percent of individuals who fall into that category.
Pie Charts
Bar Graphs

Pie charts

show the distribution of a categorical variable as a "pie" whose slices are sized by the counts or percents for the categories. Pies are about percentages. Has to = 100%

Bar graphs

represent each category as a bar whose heights show the category counts or percents. Bars we don't really care about % but rather numerical differences.

Quantitative Data

variable tells us what values the variable takes on and how often it takes those values.
Histograms
Stemplots

Histograms

show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.
For quantitative variables that take many values and/or large datasets.
Divide the possible val

Stemplots

separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.
For quantitative variables.
Separate each observation into a stem (first part of the number) and a le

Describing Distributions

A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.
A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger va

Mean

To find the mean (pronounced "x-bar") of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, ..., xn, their mean is:
Xbar = sum of observations/n= x1 + x2.../n
or in more compact notation

Median

The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.
To find the median of a distribution:
Arrange all observations from smallest to largest.
If the number of observations

Mean vs. Median

The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.

Measuring Spread: Quartiles

To calculate the quartiles:
Arrange the observations in increasing order and locate the median M.
The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.
The third quartile Q3 is the median of the obs

Five-Number Summary

The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution.
To get a quick summary of both center and spread, combine all five numbers.
The fiv

Boxplots

The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot.

Suspected Outliers: The 1.5 x IQR Rule

In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers.
The 1.5 xIQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the third quartil

Measuring Spread: Standard Deviation

The most common measure of spread looks at how far each observation is from the mean. This measure is called the...The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the s

Variance

S^2=(x1-xbar)^2 +(x2-xbar)^2.../n-1

Standard deviation

square root (s)

Center vs. Spread

The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don't have outliers.

Density curve

If the scale is adjusted so the total area under the curve is exactly 1, then this curve is called a ...
is always on or above the horizontal axis
has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The

Normal Distributions

All Normal curves are symmetric, single-peaked, and bell-shaped
A Specific Normal curve is described by giving its mean � and standard deviation ?.
The mean of a Normal distribution is the center of the symmetric Normal curve.
The standard deviation is th

The 68-95-99.7 Rule

In the Normal distribution with mean � and standard deviation ?:
Approximately 68% of the observations fall within ? of �.
Approximately 95% of the observations fall within 2? of �.
Approximately 99.7% of the observations fall within 3? of �.

The Standard Normal Distribution

is the Normal distribution with mean 0 and standard deviation 1.
If a variable x has any Normal distribution N(�,?) with mean � and standard deviation ?, then the standardized variable
z=x-mu/sigma
has the standard Normal distribution, N(0,1).
Z=how many

Table A

is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.

Scatterplot

The most useful graph for displaying the relationship between two quantitative variables. shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values

Interpreting Scatterplots

As in any graph of data, look for the overall pattern and for striking departures from that pattern.
You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship.
An important kind of departure is an outli

Positive association

Two variables have a positive association when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together.

Negative associaton

Two variables have a negative association when above-average values of one tend to accompany below-average values of the other.

Correlation r

The correlation r measures the strength of the linear relationship between two quantitative variables.
r=(1/n-1)?((x-xbar)/sx)((y-ybar)/sy)
r is always a number between -1 and 1.
r > 0 indicates a positive association.
r < 0 indicates a negative associati

Facts About Correlation

Correlation makes no distinction between explanatory and response variables.
r has no units and does not change when we change the units of measurement of x, y, or both.
Positive r indicates positive association between the variables, and negative r indic