data analysis definitions

Statistics

provide a way of understanding, illustrating, or otherwise making sense of quantitative data.
never prove anything; they only increase the confidence that a treatment resulted in an outcome.

Randomized control trials (RCTs)

are true experimental designs where the subjects are randomly assigned to control and treatment groups.

systematic reviews

are processes whereby published research from RCTs are pulled together on a specific topic using strict inclusion criteria, reviewed collectively, and presented in a meaningful way so the reader understands the topic in light of many studies viewed togeth

independent variable

subject of experiment or the change agent

dependent variable

characteristic of interest

Simple random sampling

is the strongest method because it randomly selects a sample from a larger group. This theoretically reduces the introduction of any human bias into this part of the process.

population

subject included or studied in experiment

convenience sampling

This approach uses a group for the simple reason of accessibility to the researcher.

Descriptive statistics

are test results that describe or characterize the data.
For example: the study consisted of 30% males and 70% females, the mean age was 24, and the average number of courses taken is five.

Inferential statistics

are used to imply something (or predict something) of a larger group based on the results from a sample.
For example: based on the results of the study with a sample of 24-year olds in the U.S., there is a statistically significant difference in GRE score

Nominal level data

is the lowest order. It is a naming level such as sex (male or female), race (African American, Caucasian, Hispanic, Pacific Islander, etc.), and blood type (A, B, AB, O).

Ordinal level data

is one step above nominal data. Ordinal level data is a ranking level, that is, the numbers indicate placing, but do not have a significant value otherwise.
You cannot perform mathematical functions on the numbers. Examples are placing in a contest (1st,

Interval level data

is one of the two higher order levels. the numbers have a mathematical value and the intervals between two numbers have value, does not include an absolute zero.
For example: temperature is an interval level measure. Ninety degrees to 50o is a range of te

Ratio level data

is the other higher order level. Ratio level data is the same as interval data except it has an absolute zero. Blood pressure is an example of ratio level of data. Here, a zero blood pressure means an absence of blood pressure. The dollar amount in a chec

presenting data

illustration of data

measures of central tendency

include the mean, median, and mode (descriptive measures of data)

mean

is the mathematical average

median

is the middle number in a set of numbers in ascending or descending order

mode

is the most frequently occurring number in a set of data

measures of dispersion (variability)

include the range, variance, and standard deviation

range

is the representation of how wide the distribution of scores is

width of range

is expressed as a value and is found by subtracting the low score from the high score.

variance

is the amount of spread of the data set

standard deviation

indicates how far, on average, a score is from the mean.

Probability

is the chance of something of interest occurring.
formula: P = number of nominated outcomes (outcomes of interest) / number of possible outcomes (all possible outcomes).

critical probability

the point at which the stated outcome is considered to be an unlikely the result of chance alone
It is usually represented as p < 0.01 and means the outcome of the experiment is expected to occur less than one time in 100 chance (p = 1/100). This would be

normal distribution curve

is a way of illustrating a common outcome of statistical tests

statistical outliers

Observations that are more than �3 standard deviations from the mean

Z scores

simply reflect the standard deviations.
z = (raw score - mean score) / standard deviation.

T scores

simply use increments of 10. The mean is given a T score of 50 and the standard deviations increase/decrease by 10. A T score of 60 is one standard deviation above the mean.

skew of a distribution

refers to how the curve leans. When a curve has extreme scores on the right hand side of the distribution, it is said to be positively skewed. When the tail of the curve is pulled downward by extreme low scores, it is said to be negatively skewed. T

Sampling error

is the error that results when using a sample mean to estimate a population characteristic.

sampling distribution.

The way each of these sample means cluster around the population mean

The Central Limit Theorem

states the means of a larger number of samples drawn randomly from the same population will be normally distributed
This theorem further states if one calculated the mean of those sample means, it would equal the mean of the population in question.

the standard error of the mean

If one were to calculate a standard deviation of these sample means

Confidence interval

indicates how accurate one believes the estimate to be.
the larger the sample size, the greater the confidence in the results.

directional hypothesis

(one-tailed hypothesis) stating some significant correlation

null hypothesis

(non-directional, two-tailed hypothesis). stating no relationship

A type I error is

the rejection of a true null hypothesis

A type II error

is the failure to reject a false null hypothesis

Degrees of freedom (df)

are based on the t-distribution. The t-distribution is a way of reflecting our confidence in a sample mean and standard deviation while accurately reflecting a population. This confidence is based on sample size: the smaller the sample, the less confident

chi-square tests

This type of test is common with nominal level data and is considered a lower order statistical test. Basically, it tells if there is or is not a statistically significant difference between the highest and lowest modes among the groups.

correlation

statistically significant relationship between two variables.

Spearman's rank order test

is used when one of the two variables is ordinal level data and the other is interval level data. For example, you would use a Spearman's if you were looking for a correlation between class rank (ordinal) and GPA (interval).
"rs" for Spearman.

The Pearson's product moment (product moment correlation coefficient)

is used when both variables are interval level data. For example, you would use a Pearson's to look for a correlation between GPA (interval) and SAT scores (interval).
"r" for Pearson

the coefficient of determination (r2)

What this is telling the reader is the percentage of one variable explained by the second variable. If, for example, you found a correlation of 0.7 between GPA and SAT (r = 0.7), the coefficient of determination would be 0.49 (r2 = 0.49). What this is say

regression analysis

basically evaluates how one set of data relates to another.

If you have ordinal level data and the two groups are unrelated (independent of each other), you would use a

Mann-Whitney U test.

If you have ordinal level data and the samples are related, you would use a

Kolmogorov-Smirnov test.

t-test.

If you have interval level data, you would use a

ANOVA

allows us to look at both the average amount of difference between groups (same as a t-test) as well as average amount of difference within each group. The additional advantage of an ANOVA is it can look at differences between more than two groups (t-test

Prevalence:

the proportion of the population that has a disease in question at a specific point in time.

Incidence:

the number of new cases identified during a particular time period.

Relative risk:

the ratio of the incidence rates among exposed to unexposed individuals in a population.

2x2 tables

are used to assess treatments with dichotomous outcomes (yes or no; did or did not; etc.).

Experimental event rate (EER):

a measure of how often a particular event (response or outcome) occurs within the experimental group during a study.

Control event rate (CER):

a measure of how often a particular event (response or outcome) occurs within the control group during a study.

Absolute risk reduction (ARR):

also known as attributable risk reduction; the difference in the risk of the outcome between patients who have undergone one therapy and those who have undergone another. Again, using the 2x2 table as an example, the formula for determining ARR is: [C/(C+

Relative risk reduction:

an estimate of the percentage of baseline risk that is removed as a result of the therapy and is calculated as the ARR between the treatment and control groups divided by the absolute risk among patients in the control group (see ARR) and the formula is:

Odds ratio:

simply the odds of an event occurring. The formula is:
(A/C)/(B/D).

Number Needed to Treat (NNT):

the number of patients who need to be treated to prevent one adverse event. It is the reciprocal of the ARR (1/ARR).

convenience sampling

This approach uses a group for the simple reason of accessibility to the researcher.