Statistics chapter 1 & 2 & 3 | Statistics

Descriptive statistics

Consists of collecting, summarizing, and presenting sample data using numerical and graphical methods

Inferential statistics

Consists of making estimates, decisions, predictions, or other generalizations about a larger set of data based on sampling

Population

The complete collection of measurements, objects, or individuals under study

Sample

A portion or subset taken from a population

Parameter

A numerical characteristic of a population

Statistic

A numerical characteristic of a sample

Inferential statistics

Trying to reach a conclusion beyond what the data supports.

Experiment

The process of subjecting experimental units to treatments and observing

Simple random sample

A sample selected in such a way that every possible sample of a given size has the same chance of being picked, and every item in the population has an equal change of being selected

Arithmetic mean

A group of values is a central tendency measure that is found by first adding all the values to get a total and by then dividing the total by the number of values

Median

A group of values occupies the middle position after all the values are arranged in an ascending or descending order (1 2 3 4 5 6 Median = 3.5)

Mode

A group of values is the score that occurs most often. If no value occurs more than once, there's no mode. And when there's a tie between two values for the greatest count, data set is said to be bimodal( 1 2 3 3 4 5 5 5 6, Mode = 5)

Dispersion

The amount of spread or scatter that occur in the data.

Pie Charts

circles divided into sectors, usually to show the component parts of a whole.

Variable

A characteristic of interest that's possessed by each item under study. The value of this characteristic is likely to change or vary from one item in the data set to the next

discrete variable

A variable that is generally on that has a countable or finite number of distinct values.

Continuous variable

A variable that can assume any one of the countless number of values along a line interval.

frequency distribution (Frequency Table)

This groups data items into classes and then records the number of items that appear in each class

Histogram

A bar graph that portrays the data found in a frequency distribution

Frequency Polygon

A line chart that depicts the data found in a frequency distribution. It is thus a picture that may be used as an alternative to the histogram.

Exploratory data analysis (EDA)

A term that refers to several techniques that analysts can use to get a feel for the data being studied. (Stem-and-leaf display, dotplots, and boxplots)

Stem and leaf display

The actual data items in a data set to create a plot that looks like a histogram.

Dotplot

A preliminary data analysis tool that groups the study data into many small classes or intervals and then shows each data item as a dot on a chart.

Skewed distribution

occurs when a few values are much larger or smaller than the typical values found in the data set.

Trimmed Mean

A compromise between the mean and the median (calculated by dropping the smallest and largest numbers, two largest and two smallest, etc. and then recalculating the mean and median

measure of dispersion

A Measurement of the variability that exists in a data set.

Range (Dispersion)

The range is the simplest measure of dispuersion, and we've seen that it's merely the difference between the highest and lowest values in an array. The rangen is used to report the movement of stock prices over a time period, or high and low temperature r

Mean absolute deviation (Dispersion, MAD)

X = � | (? |x- X | ) / n

Standard Deviation

The square root of the average of the squared deviation of the individual data items about their mean. Easier terms: The standard deviation is a measure of how far away items in a data set are from their mean.

Standard deviation population formula (?)

? [ (?(x - �)) / N ]

Standard deviation Sample formula (s)

X = � | ? [ (?(x - X)) / (n - 1) ]

Standard deviation Sample simplified (s)

X = � | ? [ ( n(?x�) - (?x�) ) / ( n(n-1) ) ]

chebyshev's theorem

That the proportion of any data set that lies within k standard deviations of the mean ( where k is any positive number greater than or equal to 1) is at least 1 - (1/k�)

Chebyshev's examples

2 = k | 1 - (1/k�) = 1 - (1/2�) = 1 - (1/4) = 3/4 | This result means that at least 75% of the items in any data set no matter how skewed it is must lie within two standard deviations of the mean.

Empirical Rule

68% lie within 1 standard deviation of the mean
95% lie within 2 standard deviations of the mean
99.7% lie within 3 standard deviations of the mean

Standardization (Z score | standard score)

Takes a value from a data set and indicates how many standard deviations it is above of below the mean.

Z score formula

z = x - � / ?

Percentile

the kth percentile, Pk, is a value such that at most k percent of the data are smaller in value than Pk, and at most (100 - k) percent are larger.

Percentile example

n = 50 | k = 40 percentile
50(40) / 100 = 20
The 20th position within the data set in order from lowest to highest

Interquartile range (IQR)

The width of the interval containing the middle 50 percent of the values. The first quartile, Q1, is another name for the 25th percentile. The second quartile,Q3, is another name for the 75th percentile. The interquartile range is the distance or differen

Box and whisker display

A box around the IQR and line extending outward for the remaining 50%.

Lower Hinge (Box Whisker Display)

The limit is equal to the 25th percentile, Q1.

Upper Hinge (Box Whisker Display)

The limit is equal to the 75th percentile, Q3.

Five number summary

Boxplot quick analysis:
Median | two hinges | smallest and largest valeus

Skew coefficient (SK)

X = � | s = ? | 3(X-Md) / s = Sk