Things to consider while doing a linear regression

Main assumptions

Validity
Additivity
Linearity
Independence and equal variance of errors (we’re talking about OLS)
Normality of errors (might not be present, but its distribution has to be understood)

Distribution of the target variable
Distributions and correlations between exogenous variables
Missing values and outliers
Interpretation of values (also applies to outlier identification) and data leakage
Would an intercept make sense from interpretability point of view?
Should we expect any variable interactions?

Where do we need normalization?
Where do we need one-hot encoding?
- Do we really need it, can we split some variable values into ordinal bins?
Any non-linear transformations would make sense here?
Check correlation and distribution of interactions

Does the sign of the coefficient make sense? Is it statistically significant?
Does Rsq make sense? Would you expect it to be big or small and why?
Is Rsq of validation set lower that of a training set?
distribution and qqplot of residuals, does it make sense? If not, is there a transformation to make things better
- Shapiro-Wilk, cook distance?
- VIF is less then 10 for each variable
- partial regression plots, do they make sense? Should some variables be kicked out?

Do we get enough data, is it clean enough?
How do we see if that data or the process changed enough to make the model invalid?
Do we get the data quickly enough for the answer to make sense?
Is the model interpretable enough and accepted by the business?
How would the model errors affect the business process? And expected feedback loops?

Written on March 5, 2020