Statistics

Statistic

a numerical characteristic of a sample obtained by using the data values of from a sample

Parameter

a characteristic or measure obtained by using all the data values for a specific population; a statistic is used to estimate a parameter; a numerical characteristic of a population, distinct from a statistic of a sample

Sample

a group of subjects selected from a population to represent the population

Population

the totality of all subjects posessing certain common characteristics that are being studied

Statistical Inference

the process of deducing properties of an underlying distribution by analysis of data; inferential statistical analysis infers properties about a population: this includes testing hypotheses and deriving estimates

Variable

a characteristic or attribute that can assume different values when observed in different persons, places, or things

Continuous Variable

a variable that can assume all/any values between any two specific values; a variable obtained by measuring

Discrete Variable

a variable that assumes values that can be counted; the value can not be any value

Categorical Variable

a variable that can take on ONE of a limited, and usually fixed, number of possible values, thus assigning each individual to a particular group or "category

Quantitative Variable

a variable that is numerical in nature and that can be ordered or ranked

Qualitative Variable

a variable that can be placed into distinct categories, according to some characteristic or attribute

Observational Study

a study in which the researcher merely observes what is happening or what has happened in the past and draws conclusions based on these observations

Experimental Study

a study in which the researcher MANIPULATES one of the variable and tries to determine how the manipulation influences other variables

Simple Random Sampling

samples obtained by using random or chance methods; a sample for which every member of the population has an equal chance of being selected (may not be perfectly representative of the population) (probabilistic)

Systematic Random Sampling

samples obtained by numbering each element in the population and ten selecting every nth number from the population to be included in the sample

Stratified Random Sampling

samples obtained by dividing the populations into subgroups, called strata, according to various homogenous characteristics and then randomly selecting members from each stratum/group

Cluster Sampling

samples obtained by selecting a preexisting or natural group, called a cluster, and using the members in the cluster for the sample

Nonprobabalistic Sampling

nonrandom, cannot be used to infer from the sample to the general population, includes convenience/accidental sampling, snowball sampling, judgement sampling, deviant cases, case studies, and ad-hoc quotas

Convenience Sampling

choosing individuals who are easiest to reach

Judgement Sampling

samples in which the selection criteria are based on the researcher's personal judgment about the representativeness of the population under study; the researcher selects who should be in the study/who would be most apropriate for the study (nonprobabilistic)

Central Limit Theorem

as the sample size increases, the distribution of the sample mean of a randomly selected sample approaches the normal distribution; 95% of Xbar will +or- 1.96 Standard Error (SE) of population mean (mu); for when sample is larger than 30

Histogram

a graph that displays data by using vertical bars of various heights to represent the frequencies of a distribution; the columns are positioned over a label that represents a quantitative variable; the column label can be a single value or a range of values; the height of the column indicates the size of the group defined by the column label.

Pie Chart

a circle divided into sections according to the percentage of frequencies in each category of the distribution (% x 3.6 = degree of angle)

Bar Graph

a graph that also displays data by using vertical bars; the columns are positioned over a label that represents a categorical variable; the height of the column indicates the size of the group defined by the column label.

Boxplot

a graph used to represent a data set when the data set contains a small number of values; a boxplot splits the data set into quartiles. The body of the boxplot consists of a "box" which goes from the first quartile (Q1) to the third quartile (Q3); within the box, a vertical line is drawn at the Q2, the median of the data set. Two horizontal lines, called whiskers, extend from the front and back of the box. The front whisker goes from Q1 to the smallest non-outlier in the data set, and the back whisker goes from Q3 to the largest non-outlier. Used to represent Exploratory Data Analysis (EDA) using the median.

Frequency Distribution

a table listing the possible values of the variableand their frequencies (counts of the number of times each value occurs)

Relative Frequency Distribution

tabular summary; a table listing the possible values of the variable along with their relative frequencies (proportions in fraction, percent, or ratio); shows the fraction of the total number of items in several classes

Tabular Summary

presented in rows and columns (Spreadsheet)

Mean

the sum of the values, divided by the total number of values; its magnitude can be affected by a single extremely large or small value (mu), nonresistant statistic-affected by outliers

Median

the midpoint of a data array (corresponds to the 50th percentile); resistant statistic-less affected by outliers

Mode

the value that occurs most often in a data set; the peak of a curve; meaningful when data is qualitative

Variance

the sum of squares of variables' distances from the mean divided by the number of variables minus 1.

Coefficient of Variation

CVar; a standardized measure of dispersion of a probability distribution or frequency distribution; the ratio of the standard deviation to the mean

Standard Deviation

the square root of the variance (sigma), nonresistant statistic-affected by outliers; as sample size decreases, SD of the sample mean increases

Bimodal Distribution

a distribution with two peaks/two modes

Positive Skewed/Right Skewed Distribution

non-symmetric distribution; the tail in the positive direction extends further than the tail in the negative direction; mean and median are larger than the mode

Negative Skewed/Left Skewed Distribution

non-symmetric distribution; the tail in the negative direction extends further than the tail in the positive direction; mean and median are smaller than the mode

Leptokurtic Distribution

the distribution has relatively more scores in its tails

Symmetric Distribution

the mean is the same as the median and mode

Mean Absolute Deviation

the mean of the distances of each (absolute) value from their mean

Finding Percentiles

multiply the percent and the total number of values for the index, round up the index and count the ordered values from left to right to find the percentile

Chebyshev's Theorem

for any distribution, at least 75% of the data values will fall within 2 standard deviations of the mean

Interquartile Range

difference between the first and third quartiles (Q3-Q1)

Outlier

any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile

Empirical Rule

for normal distribution/curve the approximate % of observations within 1 standard deviation (68%), 2 standard deviations (95%), and 3 standard deviations (99.7%) of the mean

Normal Distribution

a unimodal, symmetric, bell-shaped distribution of the data analysis for a selected variable; the total area under a normal curve is always equal to one; the curves never quite reach Y=0; curve extends to infinity and negative infinity; the mean = median = mode; 68% of variables within ±1 SD, 95% within ±2 SD, 99.7% within ±3 SD.

Standard Normal Curve

a normally distributed variable having mean 0 and standard deviation 1 is said to have the standard normal distribution; its associated normal curve is called the standard normal curve

Dispersion

variability, scatter, or spread

Linear Regression

relationship of a response or dependent variable (y), to a single independent feature or measurement variable (x) by a linear model.

Predictor Variable

The dependent variable in a correlational study that is used to predict the score on another variable

Independent/Predictor/Explanatory Variable

x; the variable that is varied or manipulated by the researcher; a variable whose values are independent of changes in the values of other variables; the variable that explains the response

Dependent/Response Variable

y; variables of interest in an experiment (those that are measured or observed)

Method of Least Squares

a statistical method to find a line that best fits a set of data; it is used to break out the fixed and variable components of a mixed cost; line that yields the smallest sum of squared residuals for all Y values

Residual

the difference between the prediction and the actual value/score

Correlation Coefficient

r; a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and −1, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation; positive will slope upward to the right; negative will slope downward to the right.

Proportion of the Total Variation

r squared; coefficient of determination; a number that indicates how well data fit a statistical model

Positive Relationship

when both variables (x & y) increase or decrease at the same time

Negative Relationship

when one variable increases and the other decreases

Coefficient of Nondetermination

1.00 - r squared; the percent of variation which is unexplained by the regression equation; the unexplained variation divided by the total variation

Probability

may range from 0.00 to 1.00; 1.00 being most likely to happen and 0.00 being unlikely to happen

Classical Probability

uses sample spaces to determine the numerical probability that an event will happen. It assumes all outcomes in the sample space are equally likely to occur.

Conditional Probability

the chance that a second event will happen, given tha the first event has already happened

Mutually Exclusive/Disjoint

a statistical term used to describe a situation where the occurrence of one event is not influenced or caused by another event; it is impossible for mutually exclusive events to occur at the same time

Independent Events

events for which the outcome of one event does not affect the probability of the other

Complement of an Event

set of all outcomes in the sample space that are not in the event

Confidence Interval

a range of values for a variable of interest; a specific interval estimate of a parameter determined by using data obtained by from a sample and by using the specific confidence level of the estimate; the specified probability is called the confidence level and the end points of the confidence interval are called the confidence limits; round to 3 decimal places

Standard Normal Distribution

a normal distribution with a mean of 0 and a standard deviation of 1

t Distribution

a distribution specified by degrees of freedom used to model test statistics for the one-sample t test, the two-sample t test, etc. where σ ('s) is (are) unknown. Also used to obtain a confidence interval for estimating a population mean, or the difference between two populations means, etc.

True Proportion

a proportion in which the ratios have been proven to be equal by multiplying the means and extremes

Null Hypothesis

states that there is no difference between a parameter and a specific value, or that there is no difference between two parameters; hypothesis that predicts NO relationship between variables; the aim of research is to reject this hypothesis; referred to as the "status quo" or a statement of "no effect or no difference

Type I Error

if a null hypothesis is true and it is rejected

p-value

the probability level which forms basis for deciding if results are statistically significant (not due to chance)

Rejection Region

area of a sampling distribution that corresponds to test statistic values that lead to rejection of the null hypothesis

Two Tailed Test

used when we predict that there is a relationship but do not predict the direction; used to test a nondirectional research hypothesis

Level of Significance

the maximum probability of committing a type I error; represented by alpha (α); when a null hypothesis is rejected, the probability of a type I error will be .10, .05, or .01 depending on which level of significance is used

Z-value/Z-score

the number of standard deviations a given observation is from the population mean

Confidence Level

the probability that the interval estimate will contain the parameter, assuming that a large number of samples are selected and that the estimation process on the same parameter is repeated

Estimator

a rule, method, or criterion for arriving at an estimate of the value of the parameter; a statistic based on sample observations that is used to estimate the numerical value of an unknown population parameter

Interval Estimate

an interval or range of values used to estimate the parameter

Interval Estimation

the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter, in contrast to point estimation, which is a single number

Degrees of Freedom (d.f.)

d.f. = n-1; number of values that are free to vary after a sample statistic has been computed

Least Squares Method

in linear regression, results in the values of the y-intercept and the slope which minimizes the sum of the squared deviations between the observed values of the dependent variable and the estimated values of the dependent variable

Point Estimate of Population Parameter

a single value of a statistic, ie. the sample mean (Xbar) is a point estimate of the population mean (μ); similarly, the sample porportion (p-hat) is a point estimate of the population proportion (P); interval estimate.

t-distribution

SMALL SAMPLE SIZE (n<30), s (standard deviation of sample); a distribution in which sigma is replaced with s; the t distribution is symmetric around zero and is bell-shaped, but the spread is greater because there is more variation

Sampling Error

is the difference between a sample statistic used to estimate a population parameter and the actulat value of the parameter

Scatter Plot

the visual representation of data in simple regression analysis

Simple Regression Analysis

the extent to which the relationship between two variables can be represented by a straight line, represented by a scatter plot

Statistical Hypothesis

a conjecture about a population parameter; may or may not be true

Alternative/Research Hypothesis

(H-subscript-1) is a statistical hypothesis that states the existence of a difference between a parameter and a specific value, or that there is a difference between two parameters

p-value

A measure of statistical significance. The lower, the more likely the results of an experiment did not occur simply chance.

Two-tailed Test

H0:μ = 82 and H1:μ doesn't= 82; H1 says the mean will be different either positive or negative

Right-tailed Test

H0:μ = 36 and H1:μ > 36; H1 says the mean will be greater than 36

Left-tailed Test

H0:μ = 78 and H1:μ < 78; H1 says the mean will be less than 78

Type II Error

if the null hypothesis is false but not rejected; the probability of a type II error is represented by betta (β)

Critical Value (C.V.)

seperates the critical region from the noncritical region

Critical/Rejection Region

the range of values of the test value that indicates that there is a significant difference and that the null hypothesis should be rejected

Noncritical/Nonrejection Region

the range of values of the test value that indicates that the difference was probably due to chance and that the null hypothesis SHOULD NOT be rejected

One-tailed Test

indicates that a null hypothesis should be rejected when the test value is in the critical region on one side of the mean; will be either right- or left-tailed test, depending on the direction of the inequality of the alternative hypothesis

Confidence Interval

When CI is 90%, Zα/2=1.65; 95%, Zα/2=1.96; 99%, Zα/2=2.58

Dependent Events

when the occurrence of the first affects the outcome or occurrence of the second event in such a way hat probability is changed

Discrete Probability Distribution

f(x)>=0 and the Sum of f(x)=1

Binomial Experiment

An experiment in which there are exactly two possible outcomes for each trial, a fixed number of INDEPENDENT trials, and the probabilities for each trial are the same.

Probability Density Function

a function used to compute probabilities for a continuous random variable. The area under the graph of a probability density function over an interval represents probability