+ - 0:00:00
Notes for current slide
Notes for next slide

Statistical Thinking for Machine Learning: Lecture 4

Reda Mastouri
UChicago MastersTrack: Coursera
Thank you to Gregory Bernstein for parts of these slides

1

What have we covered?

  • Distributions, CDF, PDF, Method of Moments

  • ANOVA, Simple Regression

  • Hypothesis testing, Multiple Regression

2

Agenda

3

Agenda

  • Logistic Regression
    • Why use Logistic Regression?
    • Forming the Logistic Regression
    • The Link Function
    • Interpreting coefficients
    • Determing the effect size
4

Why use Logistic Regression?

5

Why use Logistic Regression?

id hours pass
1 0 0
2 0 0
3 1 0
4 1 0
5 1 0
6 2 1
7 2 0
8 3 0
9 3 0
10 4 1
11 4 0
12 4 1
13 5 1
14 5 0
15 6 0

Why don't we use Linear Regression (i.e., Linear Probability Model [LPM])?
Pass test=β0+β1hours studied

  • Model output is unbounded: (,)
  • Constant change predicted probability for 1 unit increase in X
  • Residual variance is not constant for different values of X
  • Residuals can be large (outliers)
6

Large outliers, Non-constant variance

##
## Call:
## lm(formula = pass ~ hours, data = study_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79696 -0.31379 -0.02389 0.29967 0.78284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02389 0.15988 0.149 0.88254
## hours 0.09663 0.02866 3.372 0.00263 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4261 on 23 degrees of freedom
## Multiple R-squared: 0.3308, Adjusted R-squared: 0.3017
## F-statistic: 11.37 on 1 and 23 DF, p-value: 0.002633

LPM: If we study 500 hours: 4834.0933768% probability of passing.

7

Logit more interpretable

##
## Call:
## glm(formula = pass ~ hours, family = "binomial", data = study_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8852 -0.7913 -0.3866 0.7670 1.8532
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5563 1.1255 -2.271 0.0231 *
## hours 0.5185 0.2099 2.470 0.0135 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 34.617 on 24 degrees of freedom
## Residual deviance: 25.161 on 23 degrees of freedom
## AIC: 29.161
##
## Number of Fisher Scoring iterations: 4

Logit: If we study 500 hours: 100% probability of passing.

8

Why use Logistic Regression?

9

Reasons to use Logistic Regression

  • Model is bounded between [0,1]
  • Each incremental unit increase does not necessarily increase probability by the same weight

Logistic formula:

  • Logistic is a linear classifier
  • We need a smooth function that is not discontinuous between [0,1]
  • We will use the standard logistic sigmoid function: y=11+ex

10

Forming the Logistic Regression

11

Forming the Logistic Regression

  • In a linear model, both X and Y have a range of (,)
  • If we have a categorical dependent variable, Y now has a range of [0,1] while X still have a range of (,)
  • We must convert Y so that is has the same range as X to create a linear predictor

p(Y)(,) Convert probability to odds:

p(Y)1p(Y),[0,) Convert odds to log odds:

log odds=log(p(Y)1p(Y)),(,)

12

Forming the Logistic Regression

Linear model after conversion: log(p(Y)1p(Y))=βXi

Calculating probability:

p(Y)1p(Y)=eβXip(Y)=(1p(Y))eβXip(Y)=eβXip(Y)eβXip(Y)+p(Y)eβXi=eβXip(Y)(1+eβXi)=eβXip(Y)=eβXi(1+eβXi)

13

Link Function

  • Here is a our logistic regression model: p(Y=1|Xi)=11+eβXi
  • Let's compare to linear regression: Y=βXi

  • For logistic regression, our desired output y is the probability of success

  • There is always a link function between predictors and output. For linear regression, it’s just the identity function. For logistic regression, we use a lopit link function
  • Linear regression is linear between X and Y. Logistic regression is linear between log odds and X.
  • We use link function to transform log odds into interpretability.
14

Estimating Coefficients

  • We will not use sum of squares to evaluate accuracy of this model, since this function is subject to multiple local minimums
  • Instead, we’ll use logistic loss function: ylog(p)+(1y)log(1p)
  • Betas will be estimated using maximum likelihood estimation
  • Maximum likelihood: Given a sample, what is the parameter with the highest probability of observing or what is the parameter with the maximum likelihood of being correct?
15

Interpretation of Coefficients and Output

  • A 1 unit increase in X causes the log odds to increase by βXi
  • If log odds increase, odds increase, and probability increase
  • If we just want to quickly classify observations, we can say any positive output from the model is a success and any negative output from the model is a failure
  • Why? log0.510.5=0
16

Partial Effects

17

An alternative link — Probit

  • Inverse CDF of normal distribution of probability = β0+β1X1

  • Link function is normal CDF

18

What have we covered?

  • Distributions, CDF, PDF, Method of Moments

  • ANOVA, Simple Regression

  • Hypothesis testing, Multiple Regression

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow