Categorical Variable
Places individual into one of several groups or categories.
Pie chart
Bar graph
Quantitative Variable
Takes numerical values for which arithmetic operations make sense.
Histogram
Stemplot
Exploratory data analysis
Is the process of using statistical tools and ideas to examine data in order to describe their main features.
Distribution
The distribution of a variable tells us what values it takes and how often it takes these values.
Categorical Data
lists the categories and gives the count or percent of individuals who fall into that category.
Pie Charts
Bar Graphs
Pie charts
show the distribution of a categorical variable as a "pie" whose slices are sized by the counts or percents for the categories. Pies are about percentages. Has to = 100%
Bar graphs
represent each category as a bar whose heights show the category counts or percents. Bars we don't really care about % but rather numerical differences.
Quantitative Data
variable tells us what values the variable takes on and how often it takes those values.
Histograms
Stemplots
Histograms
show the distribution of a quantitative variable by using bars whose height represents the number of individuals who take on a value within a particular class.
For quantitative variables that take many values and/or large datasets.
Divide the possible val
Stemplots
separate each observation into a stem and a leaf that are then plotted to display the distribution while maintaining the original values of the variable.
For quantitative variables.
Separate each observation into a stem (first part of the number) and a le
Describing Distributions
A distribution is symmetric if the right and left sides of the graph are approximately mirror images of each other.
A distribution is skewed to the right (right-skewed) if the right side of the graph (containing the half of the observations with larger va
Mean
To find the mean (pronounced "x-bar") of a set of observations, add their values and divide by the number of observations. If the n observations are x1, x2, x3, ..., xn, their mean is:
Xbar = sum of observations/n= x1 + x2.../n
or in more compact notation
Median
The median M is the midpoint of a distribution, the number such that half of the observations are smaller and the other half are larger.
To find the median of a distribution:
Arrange all observations from smallest to largest.
If the number of observations
Mean vs. Median
The mean and median of a roughly symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly the same.
In a skewed distribution, the mean is usually farther out in the long tail than is the median.
Measuring Spread: Quartiles
To calculate the quartiles:
Arrange the observations in increasing order and locate the median M.
The first quartile Q1 is the median of the observations located to the left of the median in the ordered list.
The third quartile Q3 is the median of the obs
Five-Number Summary
The minimum and maximum values alone tell us little about the distribution as a whole. Likewise, the median and quartiles tell us little about the tails of a distribution.
To get a quick summary of both center and spread, combine all five numbers.
The fiv
Boxplots
The five-number summary divides the distribution roughly into quarters. This leads to a new way to display quantitative data, the boxplot.
Suspected Outliers: The 1.5 x IQR Rule
In addition to serving as a measure of spread, the interquartile range (IQR) is used as part of a rule of thumb for identifying outliers.
The 1.5 xIQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the third quartil
Measuring Spread: Standard Deviation
The most common measure of spread looks at how far each observation is from the mean. This measure is called the...The standard deviation sx measures the average distance of the observations from their mean. It is calculated by finding an average of the s
Variance
S^2=(x1-xbar)^2 +(x2-xbar)^2.../n-1
Standard deviation
square root (s)
Center vs. Spread
The median and IQR are usually better than the mean and standard deviation for describing a skewed distribution or a distribution with outliers.
Use mean and standard deviation only for reasonably symmetric distributions that don't have outliers.
Density curve
If the scale is adjusted so the total area under the curve is exactly 1, then this curve is called a ...
is always on or above the horizontal axis
has an area of exactly 1 underneath it
A density curve describes the overall pattern of a distribution. The
Normal Distributions
All Normal curves are symmetric, single-peaked, and bell-shaped
A Specific Normal curve is described by giving its mean � and standard deviation ?.
The mean of a Normal distribution is the center of the symmetric Normal curve.
The standard deviation is th
The 68-95-99.7 Rule
In the Normal distribution with mean � and standard deviation ?:
Approximately 68% of the observations fall within ? of �.
Approximately 95% of the observations fall within 2? of �.
Approximately 99.7% of the observations fall within 3? of �.
The Standard Normal Distribution
is the Normal distribution with mean 0 and standard deviation 1.
If a variable x has any Normal distribution N(�,?) with mean � and standard deviation ?, then the standardized variable
z=x-mu/sigma
has the standard Normal distribution, N(0,1).
Z=how many
Table A
is a table of areas under the standard Normal curve. The table entry for each value z is the area under the curve to the left of z.
Scatterplot
The most useful graph for displaying the relationship between two quantitative variables. shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the values
Interpreting Scatterplots
As in any graph of data, look for the overall pattern and for striking departures from that pattern.
You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship.
An important kind of departure is an outli
Positive association
Two variables have a positive association when above-average values of one tend to accompany above-average values of the other, and when below-average values also tend to occur together.
Negative associaton
Two variables have a negative association when above-average values of one tend to accompany below-average values of the other.
Correlation r
The correlation r measures the strength of the linear relationship between two quantitative variables.
r=(1/n-1)?((x-xbar)/sx)((y-ybar)/sy)
r is always a number between -1 and 1.
r > 0 indicates a positive association.
r < 0 indicates a negative associati
Facts About Correlation
Correlation makes no distinction between explanatory and response variables.
r has no units and does not change when we change the units of measurement of x, y, or both.
Positive r indicates positive association between the variables, and negative r indic