Stats #1

Population

the entire group that is the target of interest

Data

pieces of information about individuals organized into variables

Individual

a particular person or object

Variable

a particular characteristic of the individual

Dataset

a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables

Quantitative variable

takes a numerical value and represents some kind of measurement

Categorical variable

takes a category or label value and places an individual into one of several groups. Categorical variables are sometimes called qualitative variables

Exploratory Data Analysis (EDA)

how we make sense of the data by converting them from their raw form to a more informative one

EDA consists of:

� organizing and summarizing the raw data,
� discovering important features and patterns in the data and any striking deviations from those patterns, and then
� interpreting our findings in the context of the problem

Distribution

what values the variable takes and how often the variable takes those values.
The distribution of a categorical variable is summarized using

Graphical display of categorical variables

pie chart or bar chart, supplemented by numerical summaries (category counts and percentages).

Histogram

a graphical display of the distribution of a quantitative variable. It plots the number (count) of observations that fall in intervals of values. The histogram is the best graph to use to display the distribution of a quantitative variable

Stemplot

a graphical display of the distribution of a quantitative variable. It has additional unique features, such as preserving the original data and sorting the data

Four features of a distribution include:

1. Center
2. Spread
3. Shape
4. Outliers

Symmetrical/normal distribution

the left and right sides of the distribution mirror each other, with one peak (mode).

Skewed right distribution

the right tail of the histogram (larger values) is much longer than the left tail (small values).

Skewed left distribution

the left tail of the histogram (smaller values) is much longer than the right tail (larger values).

Peakeness/modality

Number of peaks (modes) the distribution has

Unimodal distribution

one with one mode around which the observations are concentrated

Bimodal distribution

one with two modes around which the observations are concentrated

Uniform distribution

one that is kind of flat

Midpoint

the center of the distribution, or the value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values.

Center

of the distribution can be described as the most commonly occurring value in the distribution

Mean

describes the center as an average. The formula for computing the mean (x bar) is: ?x/n

Weighted average

the mean is computed by "weighting" each value by its frequency. Some values will have more weight than others.

Median

the middle value in a distribution (50th percentile) or the POINT above and below which 1/2 of the scores fall. Because the median is not affected by extreme scores, it is most appropriate for skewed distributions of quantitative data

To find the median

1. order values from smallest to largest
2. If N is odd, the median is middle score
3. If N is even, the median falls between the two middle scores

Mode

the most commonly occurring value in a distribution

Spread

of the distribution can be described by the approximate range covered by the data. Three measures of spread are: range, interquartile range, and standard deviation

Range

the distance between the smallest data point (min) and the largest one (Max)

Interquartile range (IQR)

measures the variability of a distribution by giving us the range covered by the middle 50% of the data. IQR = Q3 - Q1

Five number summary

the combination of all five numbers (min, Quartile 1, Median, Quartile 3, Max) that provides a quick numerical description of both the center and spread of a distribution

Boxplot

graphically represents the distribution of a quantitative variable by visually displaying the five-number summary and any observation that was classified as a suspected outlier using the 1.5(IQR) criterion. Boxplots are most useful when presented side-by-

Standard deviation

measures the spread by reporting a typical (average) distance between the data points and their average (mean

Properties of the standard deviation

(1) It should be paired as a measure of spread with the mean as a measure of center; (2) the only way, mathematically, in which the SD = 0, is when all the observations have the same value (Ex: 5, 5, 5, ... , 5), in which case, the deviations from the mea

The standard deviation rule

� Approximately 68% of the observations fall within 1 standard deviation of the mean.
� Approximately 95% of the observations fall within 2 standard deviations of the mean.
� Approximately 99.7% (or virtually all) of the observations fall within 3 standar

Role-type classification

When we look at relationships between two variables, each variable can be described in terms of it's proposed role in the relationship, and the type of information associated with that variable, which determines it's categorical designation. While these d

Role

explanatory or response

Type

categorical or quantitative

Categorical

mutually exclusive categories exist (gender, treatment group)

Quantitative

something is measured or counted (height, family size)

Independent variable

Another way of describing the explanatory variable

Dependent variable

Another way of describing the response variable

Side-by-side boxplots

A figure with multiple boxplots, each of which corresponds to a single value (or "level") of a categorical variable. These are great for showing C ? Q relationships. Notice that a single boxplot does not show a data relationship by itself, but rather show

Two-way table

This is a table that can be used to display C -> C relationships. A row near the top shows the different levels of one categorical variable, while a column on the left side shows the different levels of another categorical variable. Each categorical varia

Conditional percent

The count data converted into percent values that are based on the totaling each level of the explanatory variable. All "percents" are fractions of a total multiplied by 100. The question is, what "total" should be used, row totals or column totals? The a

Double bar chart

A bar chart that shows conditional percents on the Y axis, and two categorical variables (one split by the other) on the x axis. Splitting by sex (male/female) and experimental group (treatment/placebo) are common ways to split a categorical variable.

Scatterplot

A graph with an X,Y coordinate that shows plotted data. Each datum represents a "case" or "sample" and is described by two quantitative variables, one of which is designated as explanatory variable (the x axis) and the other as the response variable (the

direction of relationship

Scatterplots can show positive, negative or neither.

positive relationship

Plotted data tend to increase together in both the x and y directions.

negative relationship

Plotted data tend to decrease in the y direction, as they increase in the x direction.

form of a relationship

Describes the general shape of the plotted data for a scatterplot. Linear, curvilinear and nonlinear are all possible forms.

strength of a relationship

Describes how closely the data follow the form of a relationship. A "strong" linear relationship indicates the data are located very close to the best-fit line.

outliers

Data that deviate largely from the form of a relationship

correlation coefficient (r)

A number between -1 and 1 that indicates the strength and direction of a linear relationship between two quantitative variables. "1" indicates a perfect and positive linear relationship. All data fall exactly on the line. "-1" indicates a perfect, negativ

regression

A technique that specifies the dependence of the response variable on the explanatory variable.

linear regression

A method for finding the line of best fit that for a linear relationship.

sum of squares

A measure of variation from a model, taken by first squaring all the differences between an observed value and a predicted value, and then summing those squared differences. Squaring the deviations eliminates the negative signs and allows the deviations t

least-squares regression line

This is the line of best fit for a linear relationship. This line will have the smallest sum of squared vertical deviations (Y observed - Y predicted). Usually written as: Y = a + bX

slope

The unit change in Y for every unit change in X. The regression line symbol for slope is "b", and can be found using the correlation coefficient r, and the standard deviations for the Y (Sy) and X (Sx) values.
b=r(S_y/S_x )

intercept

The value of Y, when X is zero. The regression line symbol for intercept is "a". a is found after finding b, and then by inserting the mean values of X and Y, denoted X ?,Y ?, into the regression equation.
a=Y ? - bX ?

extrapolation

Making predictions based on values of an explanatory variable that are outside those used to establish the relationship. Generally considered not valid

lurking variables

A variable that is not readily observable in the data as presented, but which is responsible for a mistaken relationship between two other variables. Also called a "third" variable or a "confounding" variable

Simpson's paradox

A trend that is reversed in direction, when the data are considered in either an aggregated form or a disaggregated form. The trend in the aggregated data is misleading, and is caused by a lurking variable that is only visible when examining the disaggreg

study design

The means by which data are generated, or collected

sampling

The process of choosing representatives from a population for investigation

simple random sample

A randomly selected sample where every possible grouping of subjects is equally likely. The only method that is not subject to bias but also the most difficult to obtain

volunteer sample

Just how it sounds. Volunteers select themselves to be studied. Biased by design towards inclusions of subjects that want to be a part of the study, and may be overtly low risk in some way, or that perhaps already believe in the virtue, or value of the st

volunteer response

The subjects for inclusion in a study are those that voluntarily responded to an invitation to participate. Even if the subjects invited to participate were chosen by a simple random sampling method, the study may be biased if there are many nonresponses.

convenience sample

A sample that is collect by a method that is primarily chosen because it is in fact, convenient. These methods are likely to be biased, because convenient methods are rarely random

sampling frame

A list or grouping of potential individuals to be sampled. These lists are often constructed for a purpose unrelated to the study. Members on the list often share something in common, which is why they are on the list. This commonality renders the samplin

systematic sampling

The use of an interval, or ordered scheme for selecting individuals. E.g. Every 75th phone number in the phone book

cluster sampling

Choosing a random sample of the natural subgroups of a population, often geographical in nature (but not always), and then including all members of the chosen subgroup. E.g. Choosing 4 dorms at random, from all the dorms on campus, and then including all

stratified sampling

Choosing simple random samples from each of the natural subgroups, or "strata", identified in a population. E.g. Obtaining a simple random sample of 10 individuals from every major at CU. Note that the sample as a whole is not a simple random sample, even

multistage sampling

Using multiple approaches in series to obtain a sample. E.g. Using cluster sampling to first select a simple random sample of 10 McDonald's in the state of Colorado and then on a single day systematically inviting every 10th customer to participate in a s

observational study

Data are collected without interference to the subjects. It is difficult to imply causation due to lack of control of lurking variables.

differentiate between prospective and retrospective observational studies

Prospective studies involve collecting data forward in time, while retrospective studies involve collecting data backward in time

controlling lurking/confounding variable in observational studies

Involves identifying them prior to data collection and collecting data on them separately (e.g., sex, race)

experimental studies

Researchers assign values of explanatory variable to subjects. Researchers intervene, manipulate, or otherwise alter conditions associated with the subjects.

factor

An explanatory variable, categorical or quantitative, that is controlled by an experimenter

treatments

The different imposed values of the explanatory variable

treatment group

The group receiving the treatment

control group

A group of subjects in an experiment, that are denied a treatment applied to subjects in treatment groups

placebo effect

When subjects improve when they are told they are receiving treatment, even if they are not

randomized controlled experiment

Researchers control values of the explanatory variable with a randomization procedure that reduces the potential influence of lurking variables

blind

Subjects are not aware of what treatment is being administered to them. Researchers may also be blind to the treatments subjects are administered

double-blind

When neither the researcher nor the subjects know which treatment was assigned to the subject

randomized controlled double-blind experiment

The most reliable research design in determining whether the explanatory variable is actually causing changes in the response variable

Hawthorne effect

When subjects in an experiment behave differently from how they would normally behave due to their knowledge of being observed

lack of realism/lack of ecological validity

a tradeoff in well-controlled experimental studies such that the research setting is unrealistic (not natural) and thus, threatens the generalizability of results to real-life situations

noncompliance

Failure of the subject to submit to the assigned treatment

blocking

A modification to randomization that helps to ensure the effect of treatments, as well as background variables, are most accurately measured. In blocking, subjects are split into blocks based upon the different values of the background variable, and then

matched pairs

A modification to randomization that helps to pinpoint the effects of the explanatory variable by comparing responses for the same individual under two explanatory values, or for two individuals who are as similar as possible except that the first gets on

sample surve

Subjects report values themselves, which often are opinions

differentiate between open and closed questions

Open questions allow for almost unlimited responses, whereas closed questions are forced responses

Outliers

observations that fall outside the overall pattern