Stat final exam and make-up review

Most analysts focus on the cost of tuition as the way to measure the cost of a college education. But
incidentals, such as textbook costs, are rarely considered. A researcher at the University of Oklahoma
wishes to estimate the textbook costs of first-yea

All first-year University of Oklahoma Students

The evening host of a dinner reached into a bowl, mixed all tickets around, and selected the ticket to award
the grand door prize. What kind of sample will be generated?

Simple random sample

A telemarketer set the company's computerized dialing system to contact every 25th person listed in the
local telephone directory. What kind of sample will be generated?

Systematic sample

The Dean of Students mailed a survey to a total of 400 students. The sample included 100 students
randomly selected from each of the freshman, sophomore, junior, and classes on campus last term. What
kind of sample will be generated?

Stratified sample

The width of each bar in a histogram corresponds to the

Differences between the boundaries of the class

In a perfectly symmetrical bell-shaped "normal" distribution

The arithmetic mean equals the median, The median equals the mode, and The arithmetic mean equals the mode

. In right-skewed distributions, which of the following is the correct statement?

The distance from Q1 to Q2 is less than the distance from Q2 to Q3

According to the empirical rule, if the data has a "bell-shaped" normal distribution, about _____________
percent of the observations will be contained within 2 standard deviations around the arithmetic mean.

95

Which of the following is NOT a measure of central tendency? (arithmetic mean, geometric mean, mode, interquartile range)

The interquartile range

You were told that the 1st, 2nd, 3rd quartiles of female students' weight at the University of Oklahoma 95
lbs, 125 lbs, and 138 lbs. What is the percentage of students who weigh more than 138 lbs?

25%

The rate of return for a stock over three year period is 0.527, 0.145, and 0.684. Which of the following
measures is the best measure of central tendency for these rates? (The arithmetic mean of return, the median return, the geometric mean rate of return

The geometric mean rate of return

Which of the following descriptive measures can be used to identify the outliers in a data set?

The Z-score for each observation

Let's play a game. You may win $200 with the probability of about 33%, and you may loose $100 with
the probability of about 66%. What is the expected value of this game?

$ 0

The Central Limit Theorem implies that:

Regardless of the population distribution, the sampling distribution of the mean is
approximately normal when the sample size is large enough.

The Standardized Normal Distribution

is bell-shaped and symmetric, with its mean being equal to zero (0) and its standard deviation
being equal to one (1).

Let's assume that B1 and B2 are mutually exclusive and collectively exhaustive events. Also assume
that the joint probability of A and B1 and the joint probability of A and B2 are non-zero. Given these
assumptions, identify that wrong statement:

The probability of an event like A can be written as the product of the conditional probability
of A given B1 and the conditional probability of A given B2

Using ________________, one may make use of new information to update a conditional probability.

Bayes' Theorem

The historical data on the number of times that a given team in Major League Baseball has clinched its
division (i.e., has made it to the next round of the games) is available to almost everyone. You are
asked to report the probability that a given team c

Binomial Probability Distribution Function

The marketing department of a middle-sized manufacturer has 45 employees. 20 of them are female
and 25 of them are male. A group of 5 employees are randomly chosen to travel and meet with regional
sales departments. You are asked to compute the probabilit

Hypergeometric Probability Distribution Function

The historical data on the number of electric network outages per month are available for a local utility
provider. You are asked to compute the probability that more than one (1) outage occurs each month.
What kind of Probability Distribution Function wo

Poisson Probability Distribution Function

X is a continuous random variable (e.g. time required to download a music file), which is normally
distributed with the mean ? and standard deviation ?. Probability of X being less than XL is equal to PL.
Probability of X being more than XU is equal to PU

P(XL ? X ? XU) = 1 - (PU + PL)

Let Y be a discrete random variable with a Poisson distribution. And let X be a continuous random
variable with a Normal distribution. Also, let C be a constant. Identify the correct statement.

P(X=C) is always equal to zero

The mean of the Sampling Distribution of the Means is an unbiased estimator of the population
mean...

when the sample size is large enough

Identify the correct statement:

***A.We make use of sample statistics (e.g. the sample mean) to estimate population parameters
(e.g. the population mean)***
b. We make use of population parameters (e.g. population mean) to estimate sample statistics
(e.g. the sample mean)
c. The samplin

A and B are two independent events. P(A|B) is equal to:

P(A)

X is a continuous random variable, which is distributed normally. From the X continuum, we choose a
given value, called X
. The Z-value of X
is equal to Z
. Probability of Z<=Z
is equal to ?*. What is
the probability of X>X*?

P(X>X
) = 1 - ?

X is a continuous random variable. Z is the Z-value associated with the observations in X. We can say
that X is normally distributed when:

X is a linear function of Z

A population of interest is not distributed normally. A group of researchers repeatedly choose a number
of random samples from this population. As they choose more samples, they increase the sample sizes.
Complete the following statement:
As sample sizes

decreases

X is a continuous normal variable with a Normal Probability Distribution. Z is the Z-values associated
with X. The Cumulative Standardized Normal Distribution table/function includes:

P(Z<=Z*)

DCOVA

DEFINE the variables that you want to study in order to solve a problem or meet an objective
COLLECT the data for those variables from appropriate sources
ORGANIZE the data collected by developing tables
VISUALIZE the data collected by developing charts
A

Categorical Variables

Have values that can only be placed into categories such as yes and no

Numerical Variables

Have values that represent a counted or measured quantity

Discrete variables

have numerical values that arise from a counting process.
EX: Number of items purchased

Continuous variables

have numerical values that arise from a measuring process.
EX: Time spent waiting in a checkout line

Nominal scale

classifies data into distinct categories in which no ranking is implied

Ordinal scale

classifies values into distinct categories in which ranking is implied

Interval scale

an ordered scale in which the difference between measurements is a meaningful quantity but does not involve a true zero point

Ratio scale

ordered scale in which the difference between the measurements involves a true zero point, as in height, age, or salary measurements.

Primary Data Source

Collect your own data

Secondary data source

Someone else collected the data you are using

Population

Consists of all items or individuals about which you want to reach conclusions

Sample

portion of the population selected for analysis

Structured data

data that follows some organizing principle or plan, typically a repeating pattern

Unstructured data

follows no repeating pattern

Mutually exclusive

The category definitions cause each data value to be placed in one and only one category

Collectively exhaustive

The set of categories you create for the new, recoded variables include all the data values being recoded

Simple random sample

Every item from a frame has the same chance of selection as every other item, and every sample of a fixed size has the same chance of selection as every other sample of that size.

Systematic sample

You partition the N items in the frame into n groups of k items, where k=N/n
Round k to the nearest integer. To select a systematic sample, you choose the first item to be selected at random from the first k items in the frame. Then, you select the remain

Stratified sample

You first subdivide the N items in the frame into separate subpopulations, or strata. A stratum is defined by some common characteristic, such as gender or year in school. You select a simple random sample within each f the strata and combine the results

Cluster sample

divide the N items in the frame into clusters that contain several items. Clusters are often naturally occurring groups, such as counties. You then take a random sample of one or more clusters and study all items in each selected cluster

Summary table

Tallies the values as frequencies or percentages for each category. Helps you see the differences among the categories by displaying the frequency, amount, or percentage of items in a set of categories in a separate column

Contingency table

cross-tabulates, or tallies jointly , the values of two or more categorical variables, allowing you to study patterns that may exist between the variables. Tallies can be shown as a frequency, percentage of the overall total, percentage of the row total,

Frequency distribution

Tallies the values of a numerical variable into a set of numerically ordered classes. Each class groups a mutually exclusive range of values, called a class interval. Each value can be assigned to only one class, and every value must be contained in one o

What are class intervals identified by?

Their class midpoints

Relative frequency distribution

presents the relative frequency, or proportion, of the total for each group that each class represents

Percentage distribution

presents the percentage of the total for each group that each class represents
=proportionx100%

Proportion (or relative frequency)

is equal to the number of values in each class divided by the total number of values.

Cumulative percentage distribution

provides a way of presenting information about the percentage of values that are less than a specific amount. You use a percentage distribution as the basis to construct a cumulative percentage distribution.

Bar chart

visualizes a categorical variable as a series of bars, with each bar representing the tallies for a single category. The length of each bar represents either the frequency or percentage of values for a category and each bar is separated by a gap

Pareto chart

The tallies for each category are plotted as vertical bars in descending order, according to their frequencies, and are combined with a cumulative percentage line on the same chart. They get their name from the pareto principle, the observation that in ma

Side-by-side bar chart

uses sets of bars to show the joint responses from 2 categorical variables

Histogram

visualizes data as a vertical bar chart in which each bar represents a class interval from a frequency or percentage distribution.

Percentage polygon

Used when using a categorical variable to divide the data of a numerical variable into 2 or more groups. This chart uses the midpoints of each class interval to represent the data of each class and then plots the midpoints, at their respective class perce

Cumulative percentage polygon (ogive)

uses the cumulative percentage distribution to plot the cumulative percentages along the Y axis. Unlike the percentage polygon, the lower boundary of the class interval for the numerical variable are plotted, at their respective class percentages, as poin

Multidimensional contingency table

used to tally the responses of 3 or more categorical variables.

Lurking variable

A variable that is affecting the results of the other variables

Central tendency

the extent to which the values of a numerical variable group around a typical, or central, value.

Variation

measures the amount of dispersion, or scattering, away from a central value that the values of a numerical variable show. The shape of a variable is the pattern of the distribution of values from the lowest value to the highest value

Arithmetic mean

typically referred to as the mean, is the most common measure of central tendency.

Median

Middle value in an ordered array of data that has been ranked from smallest to largest.
=(n+1)/2
If you have an even amount of numbers, average 2 middle values.

Geometric mean

Used when you want to measure the rate of change over time
=the nth root of the product of n values

Variance and Standard deviation

2 commonly used measures of variation that account for how all the values are distributed

How to hand compute sample variance

1. Compute the difference between each value and the mean
2. square each difference
3. sum the squared differences
4. divide this total by n-1 to compute sample variance
5. take the square root of the sample variance to compute sample standard deviation

Coefficient of Variation

measures the scatter in the data relative to the mean.
=standard deviation/mean

z score

the difference between that value and the mean, divided by the standard deviation. A z score of 0 indicates that the value is the same as the mean. If it is a positive or negative number, it indicates whether value is above or below the mean and by how ma

Left skewed

Most values are in upper portion

Right skewed

Most values are in the lower portion

Mean<Median

negative, or left-skewed distribution

Mean=Median

Symmetrical distribution with 0 skewness

Mean>Median

Positive, or right-skewed distribution

Kurtosis

Measures the extent to which values that are very different from the mean affect the shape of the distribution of a set of data. It affects the peakedness of the curve of the distribution- that is, how sharply the curve rises approaching the center of the

lepokurtic

a kurtosis value that is greater than 0

platykurtic

a kurtosis value that is less than 0

Quartiles

split the values into 4 equal parts

First quartile

divides the smallest 25% of the values from the other 75% that are larger

Second quartile

the median; 50% of the values are smaller than or equal to the median, and 50% are larger than or equal to the median.

Third quartile

divided the smallest 75% of the values from the largest 25%

Percentiles

Split a variable into 100 equal parts

Interquartile range

measures the difference in the center of a distribution between the third and first quartiles

Resistant measures

Descriptive statistics such as the median, Q1,Q3, and the interquartile range, which are not influenced by extreme values

Population mean

Sum of the values in the population divided by the population size, N.

Empirical Rule

States that for population data that form a normal distribution, the following are true:
1. Approximately 68% of the values are within +- 1 standard deviation from the mean
2. Approx. 95% of the values are within +-2 standard deviations from the mean
3. A

The Chebyshev Rule

States for any data set, regardless of shape, the percentage of values that are found within distances of k standard deviations from the mean must be at least (1-1/k^2)x100%. You can use this rule for any value of k greater than 1. Use this rule for heavi

Covariance

measures the strength of the linear relationship between 2 numerical variables

Coefficient of Correlation

Measures the relative strength of a linear relationship between 2 numerical variables. Range from -1 for a perfect negative correlation to +1 for a perfect positive relationship

Probability

Numerical value representing the chance, likelihood, or possibility that a particular event will occur.

priori probability

Probability of an occurrence is based on prior knowledge of the process involved.

Empirical probability

Probabilities are based on observed data, not on prior knowledge of a process

Subjective probability

differs from person to person; usually based on a person's past experience, personal experience, and analysis of a particular situation

Event

Each possible outcome of a variable

Simple event

Described by a single characteristic

Joint event

An event that has 2 or more characteristics

Sample space

collection of all possible events

Simple probability

probability of the occurrence of a simple event

Joint probability

probability of occurrence involving 2 or more events

Marginal probability

Consists of a set of joint probabilities (Add them all together)

General addition rule

P(A or B)= P(A)+P(B)-P(A and B)

Conditional Probability

refers to the probability of event A, given information about the occurrence of another event, B.
P(AlB) = P(A and B)/ P(B)

Decision tree

alternative to a contingency table

Independence

When the outcome of one event does not affect the probability of occurrence of another event. 2 events are independent if P(A l B) = P(A)

General multiplication rule

P(A and B) = P(AlB)P(B)

Multiplication rule for independent events

P(A and B)= P(A)P(B)

Bayes' theorem

used to revise previously calculated probabilities based on new information

Probability distribution for a discrete variable

mutually exclusive list of all the possible numerical outcomes along with the probability of occurrence of each outcome

Expected value

the mean of the probability distribution

Covariance of a probability distribution

measures the strength of the relationship between 2 variables

Mathematical modelv

mathematical expression that represents a variable of interest.

Probability distribution function

math model for discrete random variables

Binomial distribution

Used when the discrete variable is the number of events of interest in a sample of n observations; has 4 important properties:
1. The sample consists of a fixed number of observations, n.
2. Each observation is classified into one of 2 mutually exclusive

Poisson Distribution

Used to calculate probabilities in situations such as these if the following properties hold:
1. You are interested in counting the number of times a particular event occurs in a given area of opportunity. The area of opportunity is defined by time, lengt

Hypergeometric Distribution

The sample data are selected without replacement from a finite population, thus the result of one observation is dependent on the results of the previous observations.

Normal distribution

the most common continuous distribution used in statistics. It is vitally important in statistics for 3 main reasons:
1. Numerous continuous variables common in business have distributions that closely resemble the normal distribution
2. The normal distri

Important theoretical properties of the normal distribution

1. It is symmetrical, and its mean and median are therefore equal
2. It is bell-shaped in appearance
3.Its interquartile range is equal to 1.33 standard deviations. Thus, the middle 50% of the values are contained within an interval of two-thirds of a sta

Normal probability plot

a visual display that helps you evaluate whether the data are normally distributed

Sampling distribution of the mean

The distribution of all possible sample means if you select all possible samples of a given size

Central Limit theorem

As the sample size gets large enough, the sampling distribution of the mean is approx. normally distributed. This is true regardless of the shape of the distribution of the individual values in the population

Conclusions of the central limit theorem

1. For most distributions, regardless of the shape of the population, the sampling distribution of the mean is approx. normally distributed if samples of at least size 30 are selected
2. If the distribution of the population is fairly symmetrical, the sam

Sampling Error

the variation that occurs due to selecting a single sample from the population. The size of the sampling error is primarily based on the amount of variation in the population and on the sample size. Large samples have less sampling error than small sample

t distribution

very similar in appearance to the standardized normal distribution. The t distribution has more area in the tails and less in the center than does the standardized normal distribution.

What 3 quantities do you need to compute the sample size

1. The desired confidence level, which determines the value of the critical value from the standardized normal distribution
2. The acceptable sampling error
3. The standard deviation

Null Hypothesis

The hypothesis that the population parameter is equal to the company specification.

Alternative hypothesis

the conclusion reached by rejecting the null hypothesis

Summary of null and alternative hypothesis

1. The null hypothesis represents the current belief in the situation
2. The alternative hypothesis is the opposite of the null hypothesis and represents a research claim or specific inference you would like to prove
3. If you reject the null hypothesis,

Critical Value

The first thing you determine to make a decision concerning the null hypothesis; it divides the nonrejection region from the rejection region. The size of the rejection region is directly related to the risks involved in using only sample evidence to make

Type 1 Error

occurs if you reject the null hypothesis when it is true and should not be rejected; known as a "false alarm

Type 2 error

occurs if you do not reject the null hypothesis when it is false and should be rejected; known as a "missed opportunity" to take some corrective action

Level of significance

probability of committing a type 1 error

B risk

probability of committing a type 2 error

Confidence coefficient

the complement of the probability of a type 1 error; the probability that you will not reject the null hypothesis when it is true and should not be rejected.

Power of a statistical test

The complement of the probability of a type 2 error; the probability that you will reject the null hypothesis when it is false and should be rejected

p-value

the probability of getting a test statistic equal to or more extreme than the sample result, given that the null hypothesis is true; known as the observed level of significance. Using the p-value to determine rejection and nonrejection is another approach

The decision rules for rejecting the null hypothesis in the p-value approach are

1. If the p-value is greater than or equal to a, do not reject the null hypothesis
2. If the p-value is less than a, reject the null hypothesis
IF THE P-VALUE IS LOW, THE NULL HYPOTHESIS MUST GO

Robust test

t test is an example; it does not lose power if the shape of the population departs somewhat from a normal distribution, particularly when the sample size is large enough to enable the test statistic to follow the t distribution.

Summary of the null and alternative hypotheses for one-tail tests

1. The null hypothesis represents the status quo or the current belief in a situation
2. The alternative hypothesis is the opposite of the null hypothesis and represents a research claim or specific inference you would like to prove
3. If you reject the n

pooled-variance t test

Can be used if you assume that the random samples are independently selected from 2 populations and that the populations are normally distributed and have equal variances to determine whether there is a significant difference between the means

When do you reject the null hypothesis in a two tail test?

if the computed test statistic is greater than the upper-tail critical value from the t distribution or if the computed test statistic is less than the lower tail critical value from the t distribution

Separate -variance t test

Used if you can assume that the 2 independent populations are normally distributed but cannot assume that they have equal variances, you cannot pool the two sample variances into the common estimate and therefore cannot use the pooled-variance t test.

paired t test

Can use if you assume that the difference scores are randomly and independently selected from a population that is normally distributed in order to determine whether there is a significant population mean difference