regression grind

October 9, 2024


a dump of my plan + notes for studying for my finals for a class that i should be doing well but is not because i'm just not good at math and stats apparently. might be the most i've studied for a class ever in my life

go through

  • notes
  • hw
  • quizzes
  • pq1, q1
  • pq2, q2
  • pfinals

prediction

  • distribution of SSE (sigma_hat)
  • e(sse)
  • show y_hat independent to residual
  • distribution of beta_hat
  • log reg why use logit? issues with linear model
  • explain what hii is?
  • why 0 < hii < 1
  • what is stud(ei)?
  • press

Outliers

  • outlier in x: leverage (hii) > 3p/n
  • outlier in y: discrepancy (studentized e) > t (n-1-p), 1-a/2 (outlier in
  • both: influence (cooks distance) >4/n

multicollinearity

problems: inflated SE checks:

  1. swing/change sign coefficients in f-test
  2. correlation matrix
  3. VIF

solution

  1. drop
  2. feature engineer
  3. regularized regression
  4. dimensionality reduction
  5. partial least square

heteroskedasticity

  • unbiased

detection

  • residual plot

problem: no longer BLEU -> wrong SE(beta) and CI/PI widths solution

  1. log / square
  2. boxcox
  3. robust SE
  4. WLS

if ei is non linear, use nonparametric regression (knn, moving average)

non-normal

  • still BLEU
  • no inference,

detection

  • histogram
  • qq plot
  • test for normality: shapiro

for normal: skewness: 0 (third moment) kurtosis: 3 (fourth moment)

  • omnibus k2 test (want high p-value to reject)
  • JB test

problems: unreliable t.test, wrong CI/PI

false assumption of linearity

  • transform y -> may introduce hetero if homo
  • transform x -> nice when only prob is non-linearity
  • transform both

Model selection under: biased coefs + predictions (under/overstimate), overestimate sigma2

extra vars: unbiased, MSE has fewer degrees, wider CI and lower power

over(multicol): inflated SE for coefs, rank deficient

adjusted R 1 - MSE / SST = 1 - SSE / n-p / SST / n-1 takes into account the "cost" of losing DF

Mallows Cp

  • identify subset where Cp is near k+1 where k is no of preds
  • this means bias is small
  • if all not near, missing predictor
  • if a number of them, choose model with smallest

AIC, BIC

  • estimates infomation lost in a model
  • trade-off goodness in fit vs simplicity, penalized by no. of model params (p)
  • larger penalty term in BIC than AIC : ln(n)p vs 2p

PRESS

  • modified SEE, uses predicted value for ith obs from model fit on data excluding that point