STAT 2000 Test 1

Statistics

the science of designing studies and analyzing the data that those studies produce. Statistics is the science of learning from data.

Population

the entire set of subjects in which we are interested in

Sample

subset of the population for whom we have data

Subject

an entity that we measure in a study

Parameter

a numerical value summarizing the population data

Statistic

a numerical value summarizing the sample data

Population Mean ( � )

an average for a full population

Sample Mean ( x-bar )

an average calculated from a sample

Population Proportion ( p )

the proportion for an entire population

Sample Proportion ( p hat )

the proportion for just a sample

Design

how to obtain the data to answer questions of interest

Description

summarizing and describing the obtained data

Inference

making decisions and predictions based on the sample data

Random Sample

a sample in which every subject has some chance of being selected for the sample

simple random sampling

a sample in which every subject has an equally likely chance of being selected for the sample

Stratified Sampling

the population is divided into non-overlapping groups ( strata ) and a simple random sample is then obtained from each group

Cluster Sampling

the population is divided into non-overlapping groups and all individuals within the randomly selected group or groups are sampled

Systematic Sampling

selecting every kth subject from the population

Convenience Sampling

sampling where the individuals are easily obtained

Variable

a characteristic or property of an individual population unit

Categorical Data

classifies subjects based on some attribute or characteristic-each observation belongs to a set of categories

Quantitative Data

takes on numeric values

Discrete Variable

(Quantitative) there is a countable number of distinct possible values that the variable can equal

Continuous Variable

(Quantitative) for any two values of that variable there are an infinite number of other possible values in between

Frequency Table

lists the number of occurrences for each category in the data

Bar Graph

a graph constructed by putting the categories on the horizontal axis and the frequency/proportion on the vertical axis

Pareto Chart

a bar graph whose bars are drawn in decreasing order of frequency/proportion

Pie Chart

a circle divided into sectors. each sector represents a category of data with the size of each sector corresponding to the proportion of responses falling in that category

Histogram

a display that is similar to a bar graph, but shows quantitative data

Stem-and-leaf plot

looks like a bar graph on its side. consists of all digits except for the final one, which is the leaf

Mode

the number in the data set that appears most often

Median

the central value of an ordered data set

Mean

the sum of all the numbers divided by the total number of numbers in the set

Mean < Median

graph is skewed left

Mean > Median

graph is skewed right

Variability

used to measure the spread or volatility contained in the data set

Range

the difference between the largest and the smallest values in the data

Variance

the average of the squared deviations from the mean, calculated using n-1 as the divisor

Standard Deviation

positive square root of the varience

Population Standard Deviation ( ? )

( SIGMA ) the standard deviation for an entire population

Sample Standard Deviation ( s )

the standard deviation for a sample population

Empirical Rule

if a distribution is bell-shaped, we can approximate the percentage of data that lie within 1, 2, or 3 standard deviations from the mean using this rule ( 1 = 68% , 2 = 95%, 3 = almost all of the data )

Quartiles

specific percentiles that split the data into quarters ( 3 of them)

First Quartile ( Q1)

a value such that 25% of the data values are smaller that Q1 and 75% are larger

Second Quartile ( Q2)

a value such that 50% of the data values are smaller than Q2 and 50% are larger ( median)

Third Quartile ( Q3)

a value such that 75% of the data values are smaller than Q3 and 25% are larger

Outliers

extreme observations in the data that often occur because of error in the measurement of the variable, during data entry, or from errors in sampling

Interquartile Range (IQR)

the difference between the third and first quartile and represents the range covered by the middle 50% of the data ( Q3-Q1) values outside of ( Q1-1.5
IQR) and (Q3+1.5
IQR) are outliers

5-number summary

the five values that split the data into quarters ( minimum, Q1, Q2 (median), Q3, maximum)

Boxplot

graphical representation of the five number summary

Distribution for Boxplots

median left of center of box and/or the right line is much longer than the left line = skewed right
median right of center of box and/or the left line is much longer than the right line=skewed left

Z-Score

measures the position a value has in the data set, relative to the mean ( measured in standard deviations)

Response Variable

a variable that can be explained by, or is determined by, another variable (y-axis)

Explanatory Variable

explains, or affects, the response variable ( x-axis)

Association

exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable

Lurking Variable

a variable that is related to the response or explanatory variable ( or both), but is not the variable being studied

Contingency Table ( 2-way table)

a table that relates two categorical variables. each box inside the table is referred to as a cell

Conditional Proportion

the proportion for a value of a variable, given a specific value of the other variable

Relative Risk

conditional proportion for one group/ conditional proportion for another group

Scatterplot

a graphical display for two quantitative variables

Positive Association

exists between two variables if as x increases, y also increases

Negative Association

exists between two variables if as x increases, y actually decreases

No Association

as x increases, there is no definite shift in the values of y

Linear Correlation

exists when the data tend to follow a straight line path. if as x increases, y also increases it is a positive correlation; or if as x increases, y decreases it is a negative correlation

No Correlation

as x increases there is no definite shift in the values of y ( no linear relationship between x and y )

Regression Line

predicts the value for the response variable ( y) as a straight-line function of the value of the explanatory variable (x)

Residual

the difference between the actual value and the predicted value of y ( y-y(hat))

Extrapolation

using the regression line to predict the costs for other properties ( observations that have similar x values as our data)