Things to consider while doing a linear regression
Main assumptions
- Validity
- Additivity
- Linearity
- Independence and equal variance of errors (we’re talking about OLS)
- Normality of errors (might not be present, but its distribution has to be understood)
Checklist before modeling
- Distribution of the target variable
- Distributions and correlations between exogenous variables
- Missing values and outliers
- Interpretation of values (also applies to outlier identification) and data leakage
- Would an intercept make sense from interpretability point of view?
- Should we expect any variable interactions?
Checklist for transformations
- Where do we need normalization?
- Where do we need one-hot encoding?
- Do we really need it, can we split some variable values into ordinal bins?
- Any non-linear transformations would make sense here?
- Check correlation and distribution of interactions
Checklist for fit and validation
- Does the sign of the coefficient make sense? Is it statistically significant?
- Does Rsq make sense? Would you expect it to be big or small and why?
- Is Rsq of validation set lower that of a training set?
- distribution and qqplot of residuals, does it make sense? If not, is there a transformation to make things better
- Shapiro-Wilk, cook distance?
- VIF is less then 10 for each variable
- partial regression plots, do they make sense? Should some variables be kicked out?
Things for production
- Do we get enough data, is it clean enough?
- How do we see if that data or the process changed enough to make the model invalid?
- Do we get the data quickly enough for the answer to make sense?
- Is the model interpretable enough and accepted by the business?
- How would the model errors affect the business process? And expected feedback loops?
Written on March 5, 2020