The regression model works on the constructive evaluation principle. We build a model, check from metrics, and then make improvements. And continue until we achieve a desirable accuracy. Evaluation metrics explain the performance of a model.
Model evaluation used in all type algorithms Linear Regression
- Adjusted R-Squared
R-squared is an evaluation metric. Through which we can measure, how good the model is higher the R-square better the accuracy.
Let say after evaluation we got R-squared = 0.81. This means we can explain 81% of the variance in data, also we can say the accuracy of a model is 81%.
We can compute the RSS (Residual sum squared) with the square sum of (actual — predicted).
In TSS (Total sum squared) we need to take squared sum of (predicted — mean value)
As we can see R2 = 1 which mean Residual is 0 R-Squared = 1
Best fit, as you can see as fit line getting poor the R-Squared value. Also, get reduce and become close to zero R-Squared lies between 0 to 1
0 < R-Squared <=1
As you can see in the final model we got R-Squared = 0.839
This means we can explain 83% of the variance in data, also we can say the accuracy of a model is 83%.
Adjusted R-Squared Penalize the model which have too many feature or variable in that.
Let say if we have 2 model each with 3 features R-squared is 83.9% and adjusted-R-Squared 83.2%
Now if we add 3 more features in the first model R-Squared will get an increase. But adjusted-R-Squared will penalize the model. And it will tell I am giving you a lower value as you have added a new feature which could because of a problem in modeling.
The adjusted R-squared increases only if the new term improves in the model. Adjusted R-squared gives the percentage of variation explained by only those independent variables. That affects the dependent variable. Adjusted R-Squared also measures the goodness of the model.
As we can see in our final Model-1 R-Squared and adjusted R-squared are very close.
R-Squared = 0.839
Adjusted R-Squared =0.832.
And with this, we assume none of the other variables need to add into the model as a predictor.
P-Value and VIF
The P-value will tell us how significant the variable. And VIF will tell multicollinearity between the independent variable. If VIF > 5, which means a high correlation.
The P-Value will tell the probability of the accepted Null Hypothesis. Which means the probability of failing to reject the Null Hypothesis.
Higher the P-value higher the probability of fail to reject a Null Hypothesis.
Lower the P-Value higher probability of the Null Hypothesis will reject.
Our Observation if we see the above final model. All variable has less the 0.05 P-value. Which means they are highly significant variables.
VIF value is less than 5 in our final model. We can say there is no correlation between the independent variable.
RMSE: (Root mean Square)
The most popular used metric RMSE. This will measure the differences between sample values predicted, and the values observed. Because Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data.
The RMSE is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are.
It will evaluate the sum of mean squared error over the number of total sample observations.
In Model-1 we calculated the RMSE: 0.08586514469292264. Lower the RMSE better the model performance.
Durbin-Watson: Durbin-Watson test evaluate the Autocorrelation it should lie between 0 to 4
In our above observation in final model we got Durbin-Watson: 2.132
There is some disadvantage if we not consider that our evaluation metrics, which will increase the in-performance measurement. For all metrics, R-Squared, Adjusted R-Squared, RMSE, VIF, P-Value, Residual, Durbin-Watson, Will tell us how good the model is (best fit model) each metrics is important for model evaluation as I describe above.
In Linear Regression there are some assumption like:
which we need to check while measuring the matrics.
- RMASE has a disadvantage RMSE will be affected by the outliers. If we have outliers in our data sample
- If the sample population is not linear RMSE will affect as there would be no trend and which means standard division will vary.
- Residual also affect by the outliers which will increase the residual squared sum.
stroke, engine size, fuelsystem_idi, car_company_bmw, cylindernumber_two, car_company_subaru
These are the driving factor as per the Model-1 on which the pricing of cars depends.
- I can see a good model with R-Square = 0.839
- I can see good-adjusted R-Squared = 0.832
- As R-Squared and adjusted R-Squared is very close, we can assume none of the other variable need to add into the model as a predictor.
- Model have Durbin-Watson:2.132 which is between 0 and 4, we can assume there is no Autocorrelation
- As we can see VIF<5 which means no Multicollinearity
- P-value is less 0.05 which means they are highly significant variables.
OK, that’s it, we are done now. If you have any questions or suggestions, please feel free to reach out to me. I’ll come up with more Machine Learning topic soon.