ISDS 1102 Ch. 1-3

statistics

the methodology of extracting useful information from a data set

find the right data
use appropriate statistical tools
clearly communicate the numerical information into written language

to do good statistics...
-
-
-

descriptive statistics & inferential statistics

two branches of statistics

descriptive statistics

collecting, organizing, and presenting the data

inferential statistics

drawing conclusions about a population based on sample data from that population

population

consists of all items of interest

sample

a subset of the population

sample statistic

calculated from the data and is used to make inferences about the population parameter

too expensive to gather information on the entire population
often impossible to gather information on the entire population

reasons for sampling from the population

cross-sectional data

data collected by recording a characteristic of many subjects at the same point in time, or without regard to differences in time

individuals
households
firms
industries
regions
countries

subjects of cross-sectional data

time series data

data collected by recording a characteristic of a subject over several time periods

daily
weekly
monthly
quarterly
annual
(observations)

time series data can include:
-
-
-
-
-

35 is likely the estimated average age. It would be rather impossible to reach all actual video game players

Many people regard video games as an obsession for youngsters, but in fact, the average age of video game players is 35 years. Is the value likely the actual or estimated average age of the population? Explain.

All marketing managers
No! The average salary was likely computed from a sample in order to save time and money.

Business graduates in the U.S. with a marketing concentration earn high salaries. According to the Bureau of Labor Statistics, the average annual salary for marketing managers was $104,400 in 2007.
What is the relevant population?
Do you think the average

variable

general characteristic being observed on an object of interest

qualitative & quantitative

types of variables

gender
race
political affiliation
(categorical)

qualitative variables examples

test scores
age
weights

quantitative variables examples

discrete

assumes a countable number of distinct values

discrete & continuous

types of quantitative variables

continuous

can assume an infinite number of values within some interval

discrete

The number of children in a family, or the number of points scored in a basketball game is an example of what quantitative variable?

continuous

weight, height, and investment return are examples of what quantitative variable?

nominal & ordinal

qualitative scales of measurement

interval & ratio

quantitative scales of measurement

nominal scale

least sophisticated level of measurement; data are simply categories for grouping data

ordinal scale

data may be categorized AND ranked with respect to some characteristic or trait
differences between categories are meaningless because the actual numbers used may be arbitrary

interval scale

data may be categorized AND ranked with respect to some characteristic or trait
differences between the values are equal and meaningful. Thus arithmetic operations of addition and subtraction are meaningful.
No "absolute 0" or starting point defined. Mean

ratio scale

strongest level of measurement
data may be categorized AND ranked with respect to some characteristic or trait
differences between interval values are equal and meaningful.
there is an "absolute 0" or defined starting point. "0" does mean "the absence of.

weight...sales
time...profits
distance...inventory levels

these variables are measure on a ratio scale
-...-
-...-
-...-

qualitative

is the following variable qualitative or quantitative?
if quantitative, is it discrete or continuous?
Colors of cars in a mall parking lot:

quantitative and continuous

is the following variable qualitative or quantitative?
if quantitative, is it discrete or continuous?
Time it takes each student to complete a final exam:

quantitative and discrete

is the following variable qualitative or quantitative?
if quantitative, is it discrete or continuous?
The number of patrons who frequent a restaurant

quantitative and ratio

define the type of measurement scale:
an investor collects data on the weekly closing price of gold throughout a year

qualitative and ordinal

define the type of measurement scale:
analyst assigns a sample of bond issues to one of the following credit ratings: AAA, AA, BBB, BB, CC, DD

qualitative and nominal

define the type of measurement scale:
the dean of the business school categorizes the students by majors (i.e. accounting, marketing, management, isds, economics, finance)

true

true or false?
a statistic is usually unobservable while a parameter is usually observable.

true

true or false?
a sample is the portion of the universe that is selected for analysis

inferential

Mediterranean fruit flies were discovered in CA a few years ago and badly damaged the oranges grown in that state. Suppose the manager of a large farm wanted to study the impact of the fruit flies on the orange crops on a daily basis over a 6-week period.

descriptive

The Human Resources Director of a large corporation wishes to develop an employee benefits package and decides to select 500 employees from a list of all (N=40,000) workers in order to study their preferences for the various components of a potential pack

2,100 business college students at the time of the study
statistics
96 students involved in the study

Three professors at NKU compared two different approaches to teaching courses in the school of business. At the time of the study, there were 2,100 students in the business school and 96 students were involved in the study. Demographic data collected on t

parameter

Mediterranean fruit flies were discovered in CA a few years ago and badly damaged the oranges grown in that state. Suppose the manager of a large farm wanted to study the impact of the fruit flies on the orange crops on a daily basis over a 6-week period.

inferential statistics

The Human Resources Director of a large corporation wishes to develop an employee benefits package and decides to select 500 employees from a list of all (N=40,000) workers in order to study their preferences for the various components of a potential pack

true

true or false?
Compiling the number of registered voters who turned out to vote for the primary in Iowa is an example of descriptive statistics.

descriptive

The Commissioner of Health in NY State wanted to study malpractice litigation in NY. A sample of 31 thousand medical records was drawn from a population of 2.7 million patients who were discharged during the year 2009. The collection, presentation, and ch

statistics

Mediterranean fruit flies were discovered in CA a few years ago and badly damaged the oranges grown in that state. Suppose the manager of a large farm wanted to study the impact of the fruit flies on the orange crops on a daily basis over a 6-week period.

parameters

The Quality Assurance Department of a large urban hospital is attempting to monitor and evaluate patient satisfaction with hospital services. Prior to discharge, a random sample of patients is asked to fill out a questionnaire to rate such services as med

variable

a characteristic of an item or individual

operational definitions

universally accepted meanings that are clear to all associated with an analysis

statistic

a measure that describes a characteristic of a sample

population

all the items or individuals about which you want to draw a conclusion

inferential statistics

methods that use the data collected from a small group to draw conclusions about a larger group

parameter

a measure that describes a characteristic of a population

data

different values associated with a variable

sample

the portion of a population selected for analysis

descriptive statistics

methods that help collect, summarize, present, and analyze a set of data

true

true or false?
A professor computed the sample average exam score of 20 students and used it to estimate the average exam score of the 1,500 students taking the exam was an example of inferential statistics.

frequency distribution

for qualitative data groups data into categories and records how many observations fall into each category

relative frequency

divide each category's frequency by the sample size

100

to express relative frequencies in terms of percentages, multiply each proportion by ______

1.00 & 100%

Note that with relative frequency, the total of the proportions must add to be ___ and the total of the percentages must add to be _____

pie chart

is a segmented circle whose segments portray the relative frequencies of the categories of some qualitative variable

bar chart

depicts the frequency or the relative frequency for each category of the qualitative data as a bar rising vertically from the horizontal axis

frequency distribution

for quantitative data groups data into intervals called classes, and records the number of observations that fall into each class

classes are mutually exclusive
classes are exhaustive

guidelines when constructing frequency distribution

5-20

the number of classes usually ranges from __ to __

largest value-smallest value
------------------------
number of classes

approximating the class width:

cumulative frequency distribution

specifies how many observations fall below the upper limit of a particular class

relative frequency distribution

identifies the proportion or fraction of values that fall into each class

class frequency
--------------------------
total number of observations

class relative frequency =

cumulative relative frequency distribution

gives the proportion or fraction values that fall below the upper limit of each class

histogram

a visual representation of a frequency or a relative frequency distribution

respective class frequency (relative frequency)

bar height in a histogram represents

class width

bar width in a histogram represents

y-axis scale

the only difference between frequency and relative frequency histograms is the

shape of distribution

typically symmetric or skewed

symmetric

mirror image on both sides of its center

positively skewed

data form a long, narrow tail to the right

negatively skewed

data form a long, narrow tail to the left

polygon

is a visual representation of a frequency or a relative frequency distribution

x-axis, y-axis

In a polygon:
plot the class midpoints on the ______ and associated frequency or (relative frequency) on the ______

straight line

neighboring points in a polygon are connected with a _______ ______

ogive

is a visual representation of cumulative frequency or a cumulative relative frequency distribution

stem-and-leaf diagram

provides a visual display of quantitative data; gives an overall picture of the data's center and variability

stem...leaf

each value in a stem-and-leaf data set is separated into two parts: the _____ consists of the leftmost digits, and the ____ is the last digit

scatterplot

is used to determine if two variables are related

linear relationship

upward or downward sloping trend of data

positive

linear relationship in which as x increases, so does y

negative

linear relationship in which as x increases, y decreases

curvilinear relationship

linear relationship in which as x increases, y increases at an increasing (or decreasing) rate
as x increases, y decreases at an increasing (or decreasing) rate

no relationship

linear relationship in which data are randomly scattered with no discernible pattern

true

true or false?
the quality ("terrible", "poor", "fair", "acceptable", "very good", and "excellent") of a day care center is an example of a nominal scaled variable.

False. Ogive is from cumulative %. Polygon

If you wish to construct a graph of a relative frequency distribution, you would most likely construct an ogive.

false. frequency

true or false?
the sum of cumulative frequencies in a distribution always equals 1

true

true or false?
the amount of calories contained in a pack of 12-ounce cheese is an example of a discrete variable

false

true or false?
faculty rank (professor or lecturer) is an example of discrete numerical data

false

true or false?
the grade level (K-12) of a student is an example of a numerical variable

discrete

An insurance company evaluates many numerical variables about a person before deciding on an appropriate rate for automobile insurance. The number of claims a person has made in the last 3 years is an example of a ________ numerical variable

discrete

A personal computer user survey was conducted. Number of personal computers owned is an example of a ________ numerical variable

true

true or false?
the relative frequency is the frequency in each class divided by the total number of observations

continuous

a personal computer user survey was conducted, hours of personal computer use per week is an example of a ___________ numerical variable

categorical

in purchasing an automobile, there are a number of variables to consider... the color of the car is an example of a ___________ variable

continuous

An insurance company evaluates many numerical variables about a person before deciding on an appropriate rate for automobile insurance. The distance a person drives in a year is an example of a __________ numerical variable

true

true or false?
the possible responses to the question "how long have you been living at your current residence?" are values from a continuous variable

false

true or false?
the amount of calories contained in a pack of 12-ounce cheese will be measured on a ratio scale.

class midpoint

the point halfway between the boundaries of each class interval in a grouped frequency distribution is called the _______________

false... usually between 5 and 15

true or false?
In general, a frequency distribution should have at least 8 class groups but no more than 20

true

true or false?
the possible responses to the question "how many times in the past three months have you visited a city park?" are values from a discrete variable.

true

true or false?
the larger the number of observations in a numerical data set, the larger the number of class intervals needed for a grouped frequency distribution

categorical

in purchasing an automobile, there are a number of variables to consider; the body style of the car (sedan, coupe, wagon, etc.) is an example of a ___________ variable

false... use frequency polygon

true or false?
apple computer inc. collected information on the age of their customers; the youngest customer was 12 and the oldest was 72; to study the distribution age among its customers, it can use a percentage polygon

false

true or false?
the number of defective apples in a single box is an example of a continuous variable

fals... in a histogram there are no gaps between adjacent bars

true or false?
a histogram can have gaps between the bars, whereas bar charts cannot have gaps

central tendency

the extent to which all the data values group around a typical or central value

variation

the amount of dispersion or scattering of values

shape

the pattern of the distribution of values from the lowest value to the highest value

mean

most common measure of central tendency; sum of values divided by the number of values; affected by extreme values (outliers)

median

in an ordered array, the ______ is the "middle" number (50% above, 50% below); not affected by extreme values

n+1
---
2

median position=
position in the ordered data, not the value of the median

rules of the median position

if the number of values is odd, the median is the middle number
if the number of values is even, the median is the average of the two middle numbers

mode

value that occurs most often; not affected by extreme values; used for either numerical or categorical (nominal) data; there may be none, but there may be several

mean

which measure to choose?
is generally used, unless extreme values (outliers) exist

median

which measure to choose?
is often used, since it is not sensitive to extreme values

mean and median

which measure to choose?
in some situations it makes sense to report on ____ and ______

measures of variation

give information on the spread of variability or dispersion of the data values

range

simplest measure of variation; difference between the largest and the smallest values; sensitive to outliers

ignores the way in which data are distributed

why can the range be misleading?

sample variance

average (approximately) of squared deviations of values from the mean

sample standard deviation

most commonly used measure of variation; shows variation about the mean; is the square root of the variance; has the same units as the original data

steps for computing sample standard deviation

1. compute the difference between each value and the mean
2. square each difference
3. add the squared differences
4. divide this total by n-1 to get the sample variance
5. take the square root of the sample variance to get the sample standard deviation

greater

the more the data are spread out, the _______ the range, variance, and standard deviation

smaller

the more the data are concentrated, the _______ the range, variance, and standard deviation

zero

if all the values are the same (no variation), all these measures will be ____

negative

range, variance, and standard deviation are never ________

z-score

to compute the _______ of a data value, subtract the mean and divide by the standard deviation

z-score

is the number of standard deviations a data value is from the mean

extreme outlier

a data value is considered an _______ _______ if its Z-score is less than -3.0 or greater than +3.0

larger...farther

the ______ the absolute value of the Z-score, the _______ the data value is from the mean

x-X| data value-sample mean
---- ---------------------
S sample standard deviation

Z=
z-score formula

shape of a distribution

describes how data are distributed

skewness and kurtosis

two useful shape related statistics are:

skewness

measures the amount of asymmetry in a distribution

kurtosis

measures the relative concentration of values in the center of a distribution as compared with the tails

left skewed

long tail (distortion) to the left caused by extremely low values, pulls down the mean so it is less than the median
mean<median

symmetric

right and left tails are equal so mean=median

right skewed

long tail (distortion) to the right caused by extremely high values which pull the mean upward so mean is greater than median
mean > median

quartiles
five-numbered summary
boxplot

another way to describe numerical data
-
-
-

quartile measures

split the ranked data into 4 segments with an equal number of values per segment

Q1=(n+1)/4
Q2=(n+1)/2
Q3=3(n+1)/4
where n is the number of observed values (positions)

find a quartile by determining the value in the appropriate position in the ranked data
3 formulas:

quartile measure: calculation rules

1. if the result is a whole number then it is the ranked position to use
2. if the result is a fractional half (2.5,7.5,8.5, etc.) then average the two corresponding data values
3. if the result is neither (not whole or fractional half) then round the res

median

Q2 is a measure of central tendency or ______

IQR- interquartile range

is Q3-Q1 and measures the spread in the middle 50% of the data

midspread

IQR is also called the _________ because is covers the middle 50% of the data

influenced by outliers or extreme values

IQR is a measure of variability that is not __________ __ ________ __ _______ ______

resistant measures

measures like the median, Q1, Q3, and IQR that are not influenced by outliers are called _________ ________

the five-number summary

the five numbers that help describe the center, spread, and shape of the data

boxplot

a graphical display of the data based on the five-number summary; implies a skewness to the right

median

in a boxplot:
vertical line inside the box

location of Q1

in a boxplot:
vertical line at the left side of box is the

location of Q3

in a boxplot:
vertical line at right side of box is the

smalles value

the vertical line at the far left is the

25% of the data

area between the left side of the box and the smallest value represents

middle fifty

the area inside the box is the

largest value

vertical line to the far right is the

the top 25% of the data

area between the right side of the box and the largest value represents

symmetrical data

the box plot implies

median...endpoints

if data are symmetric around the ______ then the box and central line are centered between the _________

vertical or horizontal

a boxplot can be shown in either a ________ or __________ orientation

sample...population

descriptive statistics discussed previously described a ______ not the __________

parameters

summary measures describing a population are called __________, and are denoted with Greek letters

population mean
population variance
population standard deviation

important population parameters are
-
-
-

population mean

is the sum of the values in the population divided by the population size

population variance

average of squared deviations of values from the mean

population standard deviation

most commonly used measure of variation; shows variation about the mean; is the square root of the population variance; has the same units as the original data

empirical rule

approximates the variation of data in a bell-shaped distribution

68%

about ___ of the data in a bell shaped distribution is within 1 standard deviation of the mean

95%

about ___ of the data in a bell-shaped distribution lies within two standard deviations of the mean

99.7%

about _____ of the data in a bell-shaped distribution lies within three standard deviations of the mean

empirical rule

helps measure how values distribute above and below the mean and help identify outliers

data analysis is objective

should report the summary measures that best describe and communicate the important aspects of the data set

data interpretation is subjective

should be done in fair, neutral and clear manner

ethical considerations

numerical descriptive measures:
-should document both good and bad results
-should be presented in a fair, objective and neutral manner
-should not use inappropriate summary measures to distort facts

20

The data below represent the amount of grams of carbohydrates in a serving of breakfast cereal in a sample of 11 different servings
11 15 23 29 19 22 21 20 15 25 17
The median carbohydrate amount in the cereal is __ grams.

895.5
the number of values is even, so the median is the average of the two middle numbers
869+922=1791/2=895.5

what is the median of the set of data
308 423 593 708 869 922 1223 1425 1589 1720

true

true or false?
a boxplot is a graphical representation of a five-number summary

false

true or false?
the median of the values 3.4, 4.7, 1.9, 7.6, and 6.5 is 4.05; put the array in order. median is the middle value
1.9, 3.4, 4.7, 6.5, 7.6

true

true or false?
as a general rule, an observation is considered an extreme value if its Z-score is greater than 3

-3.0...+3.0

a data value is considered an extreme outlier if its z-score is less than ____ or greater than ____

25%
only 25% of the observations are greater than the third quartile

you were told that the 1st, 2nd, and 3rd quartiles of students' weight at a major university are 95 lbs, 125 lbs, and 138 lbs; what percentage of the student weigh more than 138 lbs?

false
Q1 and Q3 are measures of non-central location
Q2= the median, a measure of central tendency

true or false?
the interquartile range is a measure of central tendency in a set of data

true
*if the number value is even, the median is the average of the two middle numbers

true or false?
the median of a data set with 20 items would be the average of the 10th and the 11th items in the ordered array

false

true or false?
the larger the Z-score, the farther is the distance from the observation to the median

true

true or false?
the z-scores can be used to identify outliers

true

true or false?
in exploratory data analysis, a box plot can be used to illustrate the median, quartiles, and extreme values

true

true or false?
the line drawn within the box of the boxplot always represents the median

mean absolute deviation

average of the absolute differences between the values of the data set and the mean

arithmetic mean

the additive average of the values in the data set

geometric mean

the multiplicative average of the values in a data set

geometric

the _________ mean is the appropriate measure to use when evaluating growth rates

A
the greater the coefficient, the greater the dispersion

if fund A has a coefficient of variation of 1.1, and fund B has a coefficient of variation of 0.9, fund _ has the greater relative dispersion because...________________

weighted mean

where a mean is calculated and some observations are given greater importance or value, the mean is known as a ________ mean

outliers

________ may distort the mean as a measure of central location

median

when outliers are present the ______ is the best measure of central location

mode

when summarizing a qualitative data set the ____ is the best measure of central location

dividing the standard deviation by the mean
o/mu

the coefficient of variation can be found by: