Descriptive statistics
Consists of collecting, summarizing, and presenting sample data using numerical and graphical methods
Inferential statistics
Consists of making estimates, decisions, predictions, or other generalizations about a larger set of data based on sampling
Population
The complete collection of measurements, objects, or individuals under study
Sample
A portion or subset taken from a population
Parameter
A numerical characteristic of a population
Statistic
A numerical characteristic of a sample
Inferential statistics
Trying to reach a conclusion beyond what the data supports.
Experiment
The process of subjecting experimental units to treatments and observing
Simple random sample
A sample selected in such a way that every possible sample of a given size has the same chance of being picked, and every item in the population has an equal change of being selected
Arithmetic mean
A group of values is a central tendency measure that is found by first adding all the values to get a total and by then dividing the total by the number of values
Median
A group of values occupies the middle position after all the values are arranged in an ascending or descending order (1 2 3 4 5 6 Median = 3.5)
Mode
A group of values is the score that occurs most often. If no value occurs more than once, there's no mode. And when there's a tie between two values for the greatest count, data set is said to be bimodal( 1 2 3 3 4 5 5 5 6, Mode = 5)
Dispersion
The amount of spread or scatter that occur in the data.
Pie Charts
circles divided into sectors, usually to show the component parts of a whole.
Variable
A characteristic of interest that's possessed by each item under study. The value of this characteristic is likely to change or vary from one item in the data set to the next
discrete variable
A variable that is generally on that has a countable or finite number of distinct values.
Continuous variable
A variable that can assume any one of the countless number of values along a line interval.
frequency distribution (Frequency Table)
This groups data items into classes and then records the number of items that appear in each class
Histogram
A bar graph that portrays the data found in a frequency distribution
Frequency Polygon
A line chart that depicts the data found in a frequency distribution. It is thus a picture that may be used as an alternative to the histogram.
Exploratory data analysis (EDA)
A term that refers to several techniques that analysts can use to get a feel for the data being studied. (Stem-and-leaf display, dotplots, and boxplots)
Stem and leaf display
The actual data items in a data set to create a plot that looks like a histogram.
Dotplot
A preliminary data analysis tool that groups the study data into many small classes or intervals and then shows each data item as a dot on a chart.
Skewed distribution
occurs when a few values are much larger or smaller than the typical values found in the data set.
Trimmed Mean
A compromise between the mean and the median (calculated by dropping the smallest and largest numbers, two largest and two smallest, etc. and then recalculating the mean and median
measure of dispersion
A Measurement of the variability that exists in a data set.
Range (Dispersion)
The range is the simplest measure of dispuersion, and we've seen that it's merely the difference between the highest and lowest values in an array. The rangen is used to report the movement of stock prices over a time period, or high and low temperature r
Mean absolute deviation (Dispersion, MAD)
X = � | (? |x- X | ) / n
Standard Deviation
The square root of the average of the squared deviation of the individual data items about their mean. Easier terms: The standard deviation is a measure of how far away items in a data set are from their mean.
Standard deviation population formula (?)
? [ (?(x - �)) / N ]
Standard deviation Sample formula (s)
X = � | ? [ (?(x - X)) / (n - 1) ]
Standard deviation Sample simplified (s)
X = � | ? [ ( n(?x�) - (?x�) ) / ( n(n-1) ) ]
chebyshev's theorem
That the proportion of any data set that lies within k standard deviations of the mean ( where k is any positive number greater than or equal to 1) is at least 1 - (1/k�)
Chebyshev's examples
2 = k | 1 - (1/k�) = 1 - (1/2�) = 1 - (1/4) = 3/4 | This result means that at least 75% of the items in any data set no matter how skewed it is must lie within two standard deviations of the mean.
Empirical Rule
68% lie within 1 standard deviation of the mean
95% lie within 2 standard deviations of the mean
99.7% lie within 3 standard deviations of the mean
Standardization (Z score | standard score)
Takes a value from a data set and indicates how many standard deviations it is above of below the mean.
Z score formula
z = x - � / ?
Percentile
the kth percentile, Pk, is a value such that at most k percent of the data are smaller in value than Pk, and at most (100 - k) percent are larger.
Percentile example
n = 50 | k = 40 percentile
50(40) / 100 = 20
The 20th position within the data set in order from lowest to highest
Interquartile range (IQR)
The width of the interval containing the middle 50 percent of the values. The first quartile, Q1, is another name for the 25th percentile. The second quartile,Q3, is another name for the 75th percentile. The interquartile range is the distance or differen
Box and whisker display
A box around the IQR and line extending outward for the remaining 50%.
Lower Hinge (Box Whisker Display)
The limit is equal to the 25th percentile, Q1.
Upper Hinge (Box Whisker Display)
The limit is equal to the 75th percentile, Q3.
Five number summary
Boxplot quick analysis:
Median | two hinges | smallest and largest valeus
Skew coefficient (SK)
X = � | s = ? | 3(X-Md) / s = Sk