Statistical Thinking for Machine Learning: Lecture 4Reda Mastouri 
UChicago MastersTrack: Coursera
Thank you to Gregory Bernstein for parts of these slides1

What have we covered?

Distributions, CDF, PDF, Method of Moments
ANOVA, Simple Regression
Hypothesis testing, Multiple Regression

Agenda3

AgendaLogistic RegressionWhy use Logistic Regression?
Forming the Logistic Regression
The Link Function
Interpreting coefficients
Determing the effect size

4

Why use Logistic Regression?5

Why use Logistic Regression?

id	hours	pass
1	0	0
2	0	0
3	1	0
4	1	0
5	1	0
6	2	1
7	2	0
8	3	0
9	3	0
10	4	1
11	4	0
12	4	1
13	5	1
14	5	0
15	6	0

Why don't we use Linear Regression (i.e., Linear Probability Model [LPM])?
$Pass test = β_{0} + β_{1} hours studied$

Model output is unbounded: $(- \infty, \infty)$
Constant change predicted probability for 1 unit increase in $X$
Residual variance is not constant for different values of $X$
Residuals can be large (outliers)

Large outliers, Non-constant variance

## 
## Call:
## lm(formula = pass ~ hours, data = study_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.79696 -0.31379 -0.02389  0.29967  0.78284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.02389    0.15988   0.149  0.88254   
## hours        0.09663    0.02866   3.372  0.00263 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4261 on 23 degrees of freedom
## Multiple R-squared:  0.3308,    Adjusted R-squared:  0.3017 
## F-statistic: 11.37 on 1 and 23 DF,  p-value: 0.002633

LPM: If we study 500 hours: 4834.0933768% probability of passing.

Logit more interpretable

## 
## Call:
## glm(formula = pass ~ hours, family = "binomial", data = study_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8852  -0.7913  -0.3866   0.7670   1.8532  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -2.5563     1.1255  -2.271   0.0231 *
## hours         0.5185     0.2099   2.470   0.0135 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.617  on 24  degrees of freedom
## Residual deviance: 25.161  on 23  degrees of freedom
## AIC: 29.161
## 
## Number of Fisher Scoring iterations: 4

Logit: If we study 500 hours: 100% probability of passing.

Why use Logistic Regression?

Reasons to use Logistic Regression

Model is bounded between $[0, 1]$
Each incremental unit increase does not necessarily increase probability by the same weight

Logistic formula:

Logistic is a linear classifier
We need a smooth function that is not discontinuous between $[0, 1]$
We will use the standard logistic sigmoid function: $y = \frac{1}{1 + e^{- x}}$

Forming the Logistic Regression11

Forming the Logistic Regression

In a linear model, both $X$ and $Y$ have a range of $(- \infty, \infty)$
If we have a categorical dependent variable, Y now has a range of $[0, 1]$ while X still have a range of $(- \infty, \infty)$
We must convert Y so that is has the same range as X to create a linear predictor

$p (Y) \in (- \infty, \infty)$ Convert probability to odds:

$\frac{p (Y)}{1 - p (Y)}, \in [0, \infty)$ Convert odds to log odds:

$log odds = log (\frac{p (Y)}{1 - p (Y)}), \in (- \infty, \infty)$

Forming the Logistic Regression

Linear model after conversion: $log (\frac{p (Y)}{1 - p (Y)}) = β X_{i}$

Calculating probability:

$\begin{aligned} \frac{p (Y)}{1 - p (Y)} & = e^{β X_{i}} \\ p (Y) & = (1 - p (Y)) e^{β X_{i}} \\ p (Y) & = e^{β X_{i}} - p (Y) e^{β X_{i}} \\ p (Y) + p (Y) e^{β X_{i}} & = e^{β X_{i}} \\ p (Y) (1 + e^{β X_{i}}) & = e^{β X_{i}} \\ p (Y) & = \frac{e^{β X_{i}}}{(1 + e^{β X_{i}})} \end{aligned}$

Link Function

Here is a our logistic regression model: $p (Y = 1 | X_{i}) = \frac{1}{1 + e^{- β X_{i}}}$
Let's compare to linear regression: $Y = β X_{i}$
For logistic regression, our desired output y is the probability of success
There is always a link function between predictors and output. For linear regression, it’s just the identity function. For logistic regression, we use a lopit link function
Linear regression is linear between X and Y. Logistic regression is linear between log odds and X.
We use link function to transform log odds into interpretability.

Estimating CoefficientsWe will not use sum of squares to evaluate accuracy of this model, since this function is subject to multiple local minimums
Instead, we’ll use logistic loss function: ylog(p)+(1−y)log(1−p)ylog(p)+(1−y)log(1−p)
Betas will be estimated using maximum likelihood estimation
Maximum likelihood: Given a sample, what is the parameter with the highest probability of observing or what is the parameter with the maximum likelihood of being correct?
15

Interpretation of Coefficients and OutputA 1 unit increase in X causes the log odds to increase by βXiβXi
If log odds increase, odds increase, and probability increase
If we just want to quickly classify observations, we can say any positive output from the model is a success and any negative output from the model is a failure
Why? log0.51−0.5=0log0.51−0.5=0
16

Partial Effects17

An alternative link — Probit

Inverse CDF of normal distribution of probability = $β_{0} + β_{1} X_{1}$
Link function is normal CDF

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Statistical Thinking for Machine Learning: Lecture 4

Reda Mastouri UChicago MastersTrack: CourseraThank you to Gregory Bernstein for parts of these slides

What have we covered?

Agenda

Agenda

Why use Logistic Regression?

Why use Logistic Regression?

Large outliers, Non-constant variance

Logit more interpretable

Why use Logistic Regression?

Reasons to use Logistic Regression

Logistic formula:

Forming the Logistic Regression

Forming the Logistic Regression

Forming the Logistic Regression

Link Function

Estimating Coefficients

Interpretation of Coefficients and Output

Partial Effects

An alternative link — Probit

What have we covered?

Help

Reda Mastouri
UChicago MastersTrack: Coursera
Thank you to Gregory Bernstein for parts of these slides