Statistics
the science of conducting studies to collect, organize, summarize, analyze, and draw conclusions from data.
Variable
a characteristic or attribute that can assume different values.
Data
the values (measurements or observations) that the variables can assume. Information.
Random Variables
variables whose values are determined by chance.
Data Set
a collection of data values.
Data Value (Datum)
each value in the data set.
Descriptive Statistics
consists of the collection, organization, summarization, and presentation of data. Describing a situation.
Inferential Statistics
consists of generalizing from samples to populations, performing estimations & hypothesis tests, determining relationships among variables, & making predictions.
Probability
the chance of an event occurring.
Population
consists of all subjects (human or otherwise) that are being studied.
Sample
a group of subjects selected from a population.
Hypothesis Testing
a decision making process for evaluating claims about a population, based on information obtained from samples.
Placebo
substance with no medical benefit or harm.
Qualitative Variable
Variables that can be placed into distinct categories, according to some characteristic or attribute. (i.e. gender- male/female)(i.e. religion- catholic, Muslim, Hindu, Mormon).
Quantitative Variable
Numerical and can be ordered by rank. (i.e. age or height). Can be classified into 2 groups: Discrete & Continuous.
Discrete Variable
quantitative variables that assume values that can be counted.
Continuous Variable
quantitative variable that can assume an infinite # of values between any 2 specific values. They are obtained by measuring. Often include fractions or decimals.
Nominal Level of Measurement
classifies data into mutually exclusive (non-overlapping), exhausting categories in which no order or ranking can be imposed on the data. (i.e. political parties).
Ordinal Level of Measurement
classifies data into categories that can be ranked; however, precise differences between the ranks do not exist. (i.e. A,B,C,D... or 1st, 2nd, & 3rd place).
Interval Level of Measurement
ranks of data & precise differences between units of measurement do exist; however, there is no meaningful zero. (i.e. temperature- 0 degrees Fahrenheit).
Ratio Level of Measurement
possesses all the characteristics of interval measurement, and there exists a true zero. In addition, true ratios exist when the same variable is measured on 2 different members of the population.
Random Sample
selected by using chance methods or random numbers.
Systematic Sample
obtained by numbering each subject of the population and then selecting every Kth subject.
Stratified Sample
obtained by dividing the population into groups (called strata) according to some characteristic that is important to the study, then sampling from each group.
Cluster Sample
the population is divided into groups called clusters by some means. Some of the clusters are selected, and all of the members of the cluster are used.
Observational Study
the researcher merely observes what is happening or what has happened in the past and tries to draw conclusions based on these observations.
Experimental Study
the researcher manipulates one of the variables and tries to determine how the manipulation influences other variables.
Quasi-Experimental Study
an experimental study that uses already intact groups.
Independent Variable
a variable that is being manipulated by the researcher. Also called the Explanatory Variable.
Dependent Variable
the variable that is studied to see of it has changed significantly due to the manipulation of the independent variable. Also called the Outcome Variable.
Treatment Group
a group, in a study, that receives special instructions, or some type of special treatment.
Control Group
a group in an experimental study that is not given any specific instructions or special treatment.
Hawthorne Effect
subjects, who knowingly participate in an experimental study, that change their behavior in ways that affect the result of a study.
Confounding Variable
a variable that influences the outcome variable, but was not separated from the independent variable.
Detached Statistic
a claim in which no comparison is made.
Implied Connection
a claim that attempts to imply a connection between variables that may not actually exist.
Raw Data
data in it's original form.
Frequency Distribution
the organization of raw data in table form, using classes and frequencies.
Class
raw data that is placed into a quantitative or qualitative category.
Frequency
the number of data values contained in a specific class.
Categorical Frequency Distribution
used for data that can be placed into specific categories, such as nominal or ordinal level data.
Grouped Frequency Distribution
used when the range of the data is large, and must be grouped into classes that are more than one unit in width.
Lower Class Limit
represents the smallest data value that can be included in a class.
Upper Class Limit
represents the largest data value that can be included in a class.
Class Boundary
used to separate classes so that there are no gaps in the frequency distribution.
Class Width
this is found, for a class frequency distribution, by subtracting the lower/upper class limit of one class from the lower/upper class limit of the next class.
Class Limit "Recommendations
A) classes must be equal in width. B) There should be between 5 & 20 classes. C) Preferably an odd quantity of classes (this makes finding the class median easier). D) classes must be mutually exclusive. E) classes must be continuous. F) classes must be e
Cumulative Frequency Distribution
a distribution that shows the number of data values less than or equal to a specific value, usually an upper boundary.
Ungrouped Frequency Distribution
a frequency distribution that can be constructed using single data values for each class. This is used when the range of data values are relatively small.
Histogram
a graph that displays the data by using contiguous vertical bars of various heights to represent the frequencies of the class.
Frequency Polygon
a graph that displays the data by using lines that connect points plotted for the frequencies at the midpoints of the classes. The heights of the points determines the frequencies.
Cumulative Frequencies
the sum of the frequencies accumulated on the upper boundary of a class in the distribution.
Ogive
a graph that represents the cumulative frequencies for the classes in a frequency distribution.
Relative Frequency Graph
a graph using proportions instead of raw data as frequencies.
Bar Graph
represents the data by using vertical or horizontal bars whose heights or lengths represent the frequency of the data.
Pareto Chart
used to represent a frequency distribution for a categorical variable, and the frequencies are displayed by the heights of vertical bars, which are arranged on order from highest to lowest.
Time Series Graph
represents data that occur over a specific period of time.
Pie Graph
a circle that is divided into sections or wedges according to the percentage of frequencies in each category of the distribution.
Stem & Leaf Plots
a data plot that uses part of the data value as the stem and part of the data value as the leaf to form groups or classes.
a Statistic
a characteristic or measure obtained by using the data values from a sample.
a Parameter
a characteristic or measure obtained by using all the data values from a specific population.
Mean
Also known as the arithmetic average, the mean is the sum of the values divided by the total number of values.
Median
the midpoint of the data array. Before you can find this point, the data must be arranged in numerical order from lowest to highest.
Mode
the value that occurs most often in the data value set.
Bimodal
data values consisting of 2 modes.
Multimodal
data values consisting of 3 or more modes.
Modal Class
mode for grouped data. The class with the largest frequency.
Outliers
an extremely high or extremely low data value in the data set.
Identifying Outliers
1) arrange the data in order, and find Q1 & Q3. 2) Find the IQR. 3) Multiply the IQR by 1.5. 4) Subtract that value from Q1 & add that value to Q3. 5) Check the data set for any data value that is smaller than Q1-1.5(IQR), or larger than Q3+1.5(IQR).
Midrange
a rough estimate of the middle. found by adding the lowest & the highest data values in the data set, and dividing by 2.
Weighted Mean
used when the values are not all equally represented. This is found by multiplying each value by its corresponding weight & dividing the sum of the products by the sum of the weights.
Range
the highest value minus the lowest value.
Variance
the average of the squares of the distance each value is from the mean.
Standard Deviation
the square root of the variance.
Coefficient of Variation
the standard deviation divided by the mean.
The Empirical Rule
when applied to a bell shaped distribution A) 68% of the data will fall within 1 standard deviation of the mean. B) 95% of the data will fall within 2 standard deviations of the mean. C) 99.7% of the data will fall within 3 standard deviations of the mean
Standard or Z Score
a score for a value obtained by subtracting the mean from the value and dividing the result by the standard deviation. If Z = 0, then the data value = the mean. Z = value - mean / standard deviation.
Percentiles
divide the data set into 100 equal groups.
Quartiles
found by dividing the distribution into 4 groups, separated by Q1, Q2, & Q3. Can be used as a rough estimate of variability.
Inner Quartile Range
defined as the difference between Q1 & Q3, and is the range of the middle 50% of the data.
Deciles
Found by dividing the distribution into 10 groups.
Exploratory Data Analysis
In EDA, data can be organized using a stem & leaf plot. The act of analyzing data to determine what information can be obtained by using stem & leaf plots, medians, IQRs, & boxplots.