Ch 3 Statistics: Describing, Exploring, and Comparing Data

Measure of Center

the value at the center or middle of a data set

Arithmetic Mean (Mean or average)

the measure of center obtained by adding the values and dividing the total by the number of values

Mean Advantages

Is relatively reliable, means of samples drawn from the same population don't vary as much as other measures of center Takes every data value into account

Mean Disadvantages

Is sensitive to every data value, one extreme value can affect it dramatically; is not a resistant measure of center

Median

the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. is not affected by an extreme value - is a resistant measure of the center

Mode

the value that occurs with the greatest frequency
Data set can have one, more than one, or no mode. Mode is the only measure of central tendency that can be used with nominal data

Bimodal

two data values occur with the same greatest frequency

Multimodal

more than two data values occur with the same greatest frequency

No Mode

no data value is repeated

Midrange

the value midway between the maximum and minimum values in the original data set. Sensitive to extremes because it uses only the maximum and minimum values, so rarely used

Midrange Redeeming Features

(1) very easy to compute (2) reinforces that there are several ways to define the center (3) Avoids confusion with median

Round-off Rule for
Measures of Center

Carry one more decimal place than is present in the original set of values.

standard deviation

of a set of sample values, denoted by s, is a measure of variation of values about the mean.

Standard Deviation -
Important Properties

The standard deviation is a measure of variation of all values from the mean. The value of the standard deviation s is usually positive. The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data value

Variance

a measure of variation equal to the square of the standard deviation.

Empirical Rule

or 68-95-99.7

Chebyshev's Theorem

The proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1-1/K2, where K is any positive number greater than 1.

Rationale for using n - 1 versus n

There are only n - 1 independent values. With a given mean, only n - 1 values can be freely assigned any number before the last value is determined.
Dividing by n - 1 yields better results than dividing by n. It causes s2 to target whereas division by n c

coefficient of variation

for a set of nonnegative sample or population data, expressed as a percent, describes the standard deviation relative to the mean.

z Score (or standardized value)

the number of standard deviations that a given value x is above or below the mean

Percentiles

are measures of location. There are 99 percentiles denoted P1, P2, . . . P99, which divide a set of data into 100 groups with about 1% of the values in each group.

Finding the Percentile
of a Data Value

number of values less than x divided by total number of values multiplied by 100

Quartiles

Are measures of location, denoted Q1, Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group.

Q1 (First Quartile)

separates the bottom 25% of sorted values from the top 75%.

Q2 (Second Quartile)

same as the median; separates the bottom 50% of sorted values from the top 50%.

Q3 (Third Quartile)

separates the bottom 75% of sorted values from the top 25%.

5-number summary

consists of the minimum value; the first quartile Q1; the median (or second quartile Q2); the third quartile, Q3; and the maximum value.

boxplot (or box-and-whisker-diagram)

is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q1; the median; and the third quartile, Q3.