Logistic Regression

odds

the ratio of events to non-eventsodds = #yes/#no = p/(1-p)

probability

the ratio of event to the total number of outcomesp= #yes/(#yes + #no) = odds/(1+odds)

odds and probability are _______

related

odds ratio

indicates how likely (in terms of odds) an event is for one group relative to another:OR= Odds of A/Odds of B

OR>1/RR>1

Event more likely for A than for B

OR<1/RR<1

Event more likely for B than for A

OR=1/RR=1

Event equally likely in each group

Relative Risk

indicates how likely (in terms of probability) an event is for one group relative to another:RR= probability of A/probability of B

Assumptions for OLS Regression (5)

1. The random error term has a normal distr. w/ a mean of 02. The random error term has constant variance3. The error terms are independent4. Linearity of the mean5. No perfect collinearity

What is regression actually doing?

*modeling the expected(mean/avg) response conditional on the predictors

The expected value for a binary (0/1) response y_i

the probability of the event:E(y_i)=P(y_i=1)=p_i

Problems with modeling the linear probability model (2)

1. probabilities are bounded, but linear functions can take on any value2. The relationship between probabilities and X is usually nonlinear. (Ex:1 unit change in X will have different effects when the probability is near 1 or 0.5)

logistic regression model

solution to problem with linear model-predicted probabilities always between 0 and 1-parameter estimates do not enter the model equation linearly-the rate of change of the probability varies as the X's vary

logit link transformation

-Creats a linear model using a link function on the probabilities-the relationship between the parameters and the logits are linear-logits unbounded

Box-Tidwell Transformation

-commonly used as a test for linearity of the X's relative to the logit in logistic regression models-it is a power transformation on the X's

General Additive Model (GAM)

log(odds)= B_0+f_1(x_1,i) +...+f_k(x_k,i)-use spline functions to estimate f_j(x_j)-if splines say straight line is good, then assumption met!

What assumptions of OLS Regression does Logistic regression violate?

-the random error term has a Normal distribution-the random error term has constant varianceThus, OLS is not the best method for parameter estimation

How are estimates obtained in logistic regression?

Maximum Likelihood Estimation (MLE)

The likelihood function measures...

how probable a specific grid of B values is to have produced your data(we want to maximize this)

Likelihood estimation provides a basis for...

hypothesis testing

Likelihood Ratio Test (LRT)

compares these full and reduced models

Oversampling

-Duplicate current event cases in training set to balance better with non-event cases-keep test set as original population proportion

Undersampling

-Randomly sample current non-event cases to keep in the training set to balance with event cases-keep test set as original population proportion

when the sample proportion is out of line with the population proportion...

adjustments need to be made to correct the bias

methods for adjusting oversampling

1.adjusting the intercept2. weighting observations

adjusting the intercept vs weighting observations

adjusting the intercept is done after the model is built while wieighting observations adjusts while the model is being built

When would we use the adjusting the intercept method?

If model is correct and the sample is small (n<=1000)

When would we use the weighted observations method?

If model is misspecified

When would we use either method?

If large sample(n>1000) and model is correct

When might missing values in the data set signal a problem?

Missing values depending on the values of the target variable

Assumptions of logistic regression

1. Independence across observations2. Linearity of the logit for only continuous variables

What are potential issues with logistic regression?

-convergence satisfied (complete or quasi-complete separation of the data)-multicollinearity-sparse tables

What are possible solutions for complete and quasi-complete separation?

-remove from the model (for complete separation, avoid this for quasi-complete)-recode variable-penalized likelihood method (Firth, exact methods)

What is complete separation? Quasi-complete?

Complete: one or more variable(s) perfectly predicts the response variableQuasi-Complete: one or more variable(s) almost perfectly predicts the response variable (1 level of the variable perfectly predicts the response variable)

Purpose of modeling

estimation and prediction

estimation

quantifying the expected change in response associated with predctors(relationships)

prediction

use the model to predict new response

What is logistic regression

a model for probability of an event, NOT the occurence of an eventcan be a classification model as well

Discrimination

ability to separate the events from the non-events. How good is model at distinguishing the 1's from the 0's

Calibration

how well predicted probabilities agree with the actual frequency of the outcomes. Are predicted probabilities systematically too low/high?