Population
the entire group that is the target of interest
Data
pieces of information about individuals organized into variables
Individual
a particular person or object
Variable
a particular characteristic of the individual
Dataset
a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables
Quantitative variable
takes a numerical value and represents some kind of measurement
Categorical variable
takes a category or label value and places an individual into one of several groups. Categorical variables are sometimes called qualitative variables
Exploratory Data Analysis (EDA)
how we make sense of the data by converting them from their raw form to a more informative one
EDA consists of:
� organizing and summarizing the raw data,
� discovering important features and patterns in the data and any striking deviations from those patterns, and then
� interpreting our findings in the context of the problem
Distribution
what values the variable takes and how often the variable takes those values.
The distribution of a categorical variable is summarized using
Graphical display of categorical variables
pie chart or bar chart, supplemented by numerical summaries (category counts and percentages).
Histogram
a graphical display of the distribution of a quantitative variable. It plots the number (count) of observations that fall in intervals of values. The histogram is the best graph to use to display the distribution of a quantitative variable
Stemplot
a graphical display of the distribution of a quantitative variable. It has additional unique features, such as preserving the original data and sorting the data
Four features of a distribution include:
1. Center
2. Spread
3. Shape
4. Outliers
Symmetrical/normal distribution
the left and right sides of the distribution mirror each other, with one peak (mode).
Skewed right distribution
the right tail of the histogram (larger values) is much longer than the left tail (small values).
Skewed left distribution
the left tail of the histogram (smaller values) is much longer than the right tail (larger values).
Peakeness/modality
Number of peaks (modes) the distribution has
Unimodal distribution
one with one mode around which the observations are concentrated
Bimodal distribution
one with two modes around which the observations are concentrated
Uniform distribution
one that is kind of flat
Midpoint
the center of the distribution, or the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.
Center
of the distribution can be described as the most commonly occurring value in the distribution
Mean
describes the center as an average. The formula for computing the mean (x bar) is: ?x/n
Weighted average
the mean is computed by "weighting" each value by its frequency. Some values will have more weight than others.
Median
the middle value in a distribution (50th percentile) or the POINT above and below which 1/2 of the scores fall. Because the median is not affected by extreme scores, it is most appropriate for skewed distributions of quantitative data
To find the median
1. order values from smallest to largest
2. If N is odd, the median is middle score
3. If N is even, the median falls between the two middle scores
Mode
the most commonly occurring value in a distribution
Spread
of the distribution can be described by the approximate range covered by the data. Three measures of spread are: range, interquartile range, and standard deviation
Range
the distance between the smallest data point (min) and the largest one (Max)
Interquartile range (IQR)
measures the variability of a distribution by giving us the range covered by the middle 50% of the data. IQR = Q3 - Q1
Five number summary
the combination of all five numbers (min, Quartile 1, Median, Quartile 3, Max) that provides a quick numerical description of both the center and spread of a distribution
Boxplot
graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. Boxplots are most useful when presented side-by-
Standard deviation
measures the spread by reporting a typical (average) distance between the data points and their average (mean
Properties of the standard deviation
(1) It should be paired as a measure of spread with the mean as a measure of center; (2) the only way, mathematically, in which the SD = 0, is when all the observations have the same value (Ex: 5, 5, 5, ... , 5), in which case, the deviations from the mea
The standard deviation rule
� Approximately 68% of the observations fall within 1 standard deviation of the mean.
� Approximately 95% of the observations fall within 2 standard deviations of the mean.
� Approximately 99.7% (or virtually all) of the observations fall within 3 standar
Role-type classification
When we look at relationships between two variables, each variable can be described in terms of it's proposed role in the relationship, and the type of information associated with that variable, which determines it's categorical designation. While these d
Role
explanatory or response
Type
categorical or quantitative
Categorical
mutually exclusive categories exist (gender, treatment group)
Quantitative
something is measured or counted (height, family size)
Independent variable
Another way of describing the explanatory variable
Dependent variable
Another way of describing the response variable
Side-by-side boxplots
A figure with multiple boxplots, each of which corresponds to a single value (or "level") of a categorical variable. These are great for showing C ? Q relationships. Notice that a single boxplot does not show a data relationship by itself, but rather show
Two-way table
This is a table that can be used to display C -> C relationships. A row near the top shows the different levels of one categorical variable, while a column on the left side shows the different levels of another categorical variable. Each categorical varia
Conditional percent
The count data converted into percent values that are based on the totaling each level of the explanatory variable. All "percents" are fractions of a total multiplied by 100. The question is, what "total" should be used, row totals or column totals? The a
Double bar chart
A bar chart that shows conditional percents on the Y axis, and two categorical variables (one split by the other) on the x axis. Splitting by sex (male/female) and experimental group (treatment/placebo) are common ways to split a categorical variable.
Scatterplot
A graph with an X,Y coordinate that shows plotted data. Each datum represents a "case" or "sample" and is described by two quantitative variables, one of which is designated as explanatory variable (the x axis) and the other as the response variable (the
direction of relationship
Scatterplots can show positive, negative or neither.
positive relationship
Plotted data tend to increase together in both the x and y directions.
negative relationship
Plotted data tend to decrease in the y direction, as they increase in the x direction.
form of a relationship
Describes the general shape of the plotted data for a scatterplot. Linear, curvilinear and nonlinear are all possible forms.
strength of a relationship
Describes how closely the data follow the form of a relationship. A "strong" linear relationship indicates the data are located very close to the best-fit line.
outliers
Data that deviate largely from the form of a relationship
correlation coefficient (r)
A number between -1 and 1 that indicates the strength and direction of a linear relationship between two quantitative variables. "1" indicates a perfect and positive linear relationship. All data fall exactly on the line. "-1" indicates a perfect, negativ
regression
A technique that specifies the dependence of the response variable on the explanatory variable.
linear regression
A method for finding the line of best fit that for a linear relationship.
sum of squares
A measure of variation from a model, taken by first squaring all the differences between an observed value and a predicted value, and then summing those squared differences. Squaring the deviations eliminates the negative signs and allows the deviations t
least-squares regression line
This is the line of best fit for a linear relationship. This line will have the smallest sum of squared vertical deviations (Y observed - Y predicted). Usually written as: Y = a + bX
slope
The unit change in Y for every unit change in X. The regression line symbol for slope is "b", and can be found using the correlation coefficient r, and the standard deviations for the Y (Sy) and X (Sx) values.
b=r(S_y/S_x )
intercept
The value of Y, when X is zero. The regression line symbol for intercept is "a". a is found after finding b, and then by inserting the mean values of X and Y, denoted X ?,Y ?, into the regression equation.
a=Y ? - bX ?
extrapolation
Making predictions based on values of an explanatory variable that are outside those used to establish the relationship. Generally considered not valid
lurking variables
A variable that is not readily observable in the data as presented, but which is responsible for a mistaken relationship between two other variables. Also called a "third" variable or a "confounding" variable
Simpson's paradox
A trend that is reversed in direction, when the data are considered in either an aggregated form or a disaggregated form. The trend in the aggregated data is misleading, and is caused by a lurking variable that is only visible when examining the disaggreg
study design
The means by which data are generated, or collected
sampling
The process of choosing representatives from a population for investigation
simple random sample
A randomly selected sample where every possible grouping of subjects is equally likely. The only method that is not subject to bias but also the most difficult to obtain
volunteer sample
Just how it sounds. Volunteers select themselves to be studied. Biased by design towards inclusions of subjects that want to be a part of the study, and may be overtly low risk in some way, or that perhaps already believe in the virtue, or value of the st
volunteer response
The subjects for inclusion in a study are those that voluntarily responded to an invitation to participate. Even if the subjects invited to participate were chosen by a simple random sampling method, the study may be biased if there are many nonresponses.
convenience sample
A sample that is collect by a method that is primarily chosen because it is in fact, convenient. These methods are likely to be biased, because convenient methods are rarely random
sampling frame
A list or grouping of potential individuals to be sampled. These lists are often constructed for a purpose unrelated to the study. Members on the list often share something in common, which is why they are on the list. This commonality renders the samplin
systematic sampling
The use of an interval, or ordered scheme for selecting individuals. E.g. Every 75th phone number in the phone book
cluster sampling
Choosing a random sample of the natural subgroups of a population, often geographical in nature (but not always), and then including all members of the chosen subgroup. E.g. Choosing 4 dorms at random, from all the dorms on campus, and then including all
stratified sampling
Choosing simple random samples from each of the natural subgroups, or "strata", identified in a population. E.g. Obtaining a simple random sample of 10 individuals from every major at CU. Note that the sample as a whole is not a simple random sample, even
multistage sampling
Using multiple approaches in series to obtain a sample. E.g. Using cluster sampling to first select a simple random sample of 10 McDonald's in the state of Colorado and then on a single day systematically inviting every 10th customer to participate in a s
observational study
Data are collected without interference to the subjects. It is difficult to imply causation due to lack of control of lurking variables.
differentiate between prospective and retrospective observational studies
Prospective studies involve collecting data forward in time, while retrospective studies involve collecting data backward in time
controlling lurking/confounding variable in observational studies
Involves identifying them prior to data collection and collecting data on them separately (e.g., sex, race)
experimental studies
Researchers assign values of explanatory variable to subjects. Researchers intervene, manipulate, or otherwise alter conditions associated with the subjects.
factor
An explanatory variable, categorical or quantitative, that is controlled by an experimenter
treatments
The different imposed values of the explanatory variable
treatment group
The group receiving the treatment
control group
A group of subjects in an experiment, that are denied a treatment applied to subjects in treatment groups
placebo effect
When subjects improve when they are told they are receiving treatment, even if they are not
randomized controlled experiment
Researchers control values of the explanatory variable with a randomization procedure that reduces the potential influence of lurking variables
blind
Subjects are not aware of what treatment is being administered to them. Researchers may also be blind to the treatments subjects are administered
double-blind
When neither the researcher nor the subjects know which treatment was assigned to the subject
randomized controlled double-blind experiment
The most reliable research design in determining whether the explanatory variable is actually causing changes in the response variable
Hawthorne effect
When subjects in an experiment behave differently from how they would normally behave due to their knowledge of being observed
lack of realism/lack of ecological validity
a tradeoff in well-controlled experimental studies such that the research setting is unrealistic (not natural) and thus, threatens the generalizability of results to real-life situations
noncompliance
Failure of the subject to submit to the assigned treatment
blocking
A modification to randomization that helps to ensure the effect of treatments, as well as background variables, are most accurately measured. In blocking, subjects are split into blocks based upon the different values of the background variable, and then
matched pairs
A modification to randomization that helps to pinpoint the effects of the explanatory variable by comparing responses for the same individual under two explanatory values, or for two individuals who are as similar as possible except that the first gets on
sample surve
Subjects report values themselves, which often are opinions
differentiate between open and closed questions
Open questions allow for almost unlimited responses, whereas closed questions are forced responses
Outliers
observations that fall outside the overall pattern