Introduction to Regression

Introduction to Regression#

Regression analysis is one of the most powerful tools in a data analyst’s toolkit. It allows us to understand relationships between variables — how one variable changes when another changes. In business and economics, this helps us answer questions like:

How does distance from the city centre affect hotel prices?
Does advertising spending increase sales?
What factors predict employee performance?

In this chapter, we’ll learn how to run and interpret regression models using Python’s statsmodels library.

What is Regression?#

At its core, regression finds the best-fitting line through your data. The equation for a simple linear regression is:

\[y = \alpha + \beta \cdot x + \epsilon\]

Where:

y is the dependent variable (what we’re trying to predict)
x is the independent variable (what we use to make predictions)
α (alpha) is the intercept (the value of y when x = 0)
β (beta) is the slope (how much y changes for each unit increase in x)
ε (epsilon) is the error term (the difference between predicted and actual values)

The goal of regression is to estimate α and β so we can make predictions.

Setting Up#

Let’s load our libraries and data. We’ll use the Vienna Hotels dataset, which contains information about hotel prices and characteristics.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf

# Load the Vienna hotels data
hotels = pd.read_csv("https://raw.githubusercontent.com/sakibanwar/python-notes/main/data/vienna_hotels.csv")

Tip

You can also download the dataset and load it locally:

hotels = pd.read_csv("vienna_hotels.csv")

hotels.shape

(428, 24)

We have 428 hotels with 24 variables. Let’s see what we’re working with:

hotels.head()

   country city_actual  rating_count center1label center2label  \
Austria      Vienna          36.0  City centre    Donauturm
Austria      Vienna         189.0  City centre    Donauturm
Austria      Vienna          53.0  City centre    Donauturm

  neighbourhood  price    city  stars  ratingta  ...  distance  rating
  17. Hernals     81  Vienna    4.0       4.5  ...       2.7     4.4
  17. Hernals     81  Vienna    4.0       3.5  ...       1.7     3.9
   Alsergrund     85  Vienna    4.0       3.5  ...       1.4     3.7

Filtering the Data#

For our analysis, let’s focus on 3-4 star hotels with reasonable prices:

# Filter the data
hotels = hotels[hotels["accommodation_type"] == "Hotel"]
hotels = hotels[hotels["city_actual"] == "Vienna"]
hotels = hotels[(hotels["stars"] >= 3) & (hotels["stars"] <= 4)]
hotels = hotels[hotels["stars"].notnull()]
hotels = hotels[hotels["price"] <= 600]

print(f"Hotels remaining: {len(hotels)}")

Hotels remaining: 207

We’ve filtered down to 207 hotels that are 3-4 stars in Vienna with prices under $600.

Exploring the Data First#

Before running a regression, we should always explore our data. Let’s look at our two key variables: price and distance from the city centre.

hotels.filter(["price", "distance"]).describe().T

          count        mean         std   min   25%    50%    75%   max
price     207.0  109.975845   42.221381  50.0  82.0  100.0  129.5  383.0
distance  207.0    1.529952    1.161507   0.0   0.8    1.3    1.9    6.6

The average hotel price is about $110, and the average distance from the city centre is about 1.5 miles.

Visualising the Relationship#

A scatter plot shows us the relationship between distance and price:

sns.scatterplot(data=hotels, x="distance", y="price", color="red", alpha=0.5)
plt.xlabel("Distance to city center (miles)")
plt.ylabel("Price (US dollars)")
plt.title("Hotel Prices vs Distance from City Centre")
plt.show()

[Scatter plot showing a negative relationship - hotels closer to the
 city centre tend to be more expensive]

There appears to be a negative relationship — as distance increases, price tends to decrease. But how can we quantify this?

Simple Linear Regression#

Let’s fit a regression line to our data. In Python, we use statsmodels:

import statsmodels.formula.api as smf

# Fit the regression model
model = smf.ols("price ~ distance", data=hotels).fit()

# Print the results
print(model.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.157
Model:                            OLS   Adj. R-squared:                  0.153
Method:                 Least Squares   F-statistic:                     38.20
Date:                Fri, 03 May 2024   Prob (F-statistic):           3.39e-09
Time:                        16:24:25   Log-Likelihood:                -1050.3
No. Observations:                 207   AIC:                             2105.
Df Residuals:                     205   BIC:                             2111.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    132.0170      4.474     29.511      0.000     123.197     140.837
distance     -14.4064      2.331     -6.181      0.000     -19.002      -9.811
==============================================================================

That’s a lot of output! Let’s break down the key parts.

Interpreting Regression Output#

The Coefficients#

The most important numbers are in the coefficient table:

Variable	Coefficient	Interpretation
Intercept	132.02	When distance = 0, predicted price is $132.02
distance	-14.41	For each additional mile from centre, price decreases by $14.41

So our regression equation is:

\[\text{price} = 132.02 - 14.41 \times \text{distance}\]

This tells us:

A hotel right in the city centre (distance = 0) would be predicted to cost $132.02
Each mile further from the centre reduces the price by about $14.41

Statistical Significance#

The p-value (P>|t|) tells us whether each coefficient is statistically significant:

Intercept p-value: 0.000 (highly significant)
Distance p-value: 0.000 (highly significant)

A p-value less than 0.05 is typically considered statistically significant. Both our coefficients are highly significant, meaning we can be confident the relationship is real, not just random noise.

Confidence Intervals#

The [0.025, 0.975] columns show the 95% confidence interval for each coefficient:

Distance coefficient: [-19.00, -9.81]

We’re 95% confident the true effect of distance is between -$19.00 and -$9.81 per mile.

R-squared: How Good is Our Model?#

R-squared = 0.157 means our model explains about 15.7% of the variation in hotel prices.

Is that good? Well, it depends:

It means distance alone doesn’t fully explain price differences
Other factors (star rating, amenities, reviews) also matter
But distance does have a statistically significant effect

Note

R-squared ranges from 0 to 1. Higher values mean the model explains more of the variation in the dependent variable. An R-squared of 0.157 means 15.7% of price variation is explained by distance alone.

Visualising the Regression Line#

Seaborn makes it easy to plot the regression line with the data:

sns.lmplot(
    x="distance", y="price", data=hotels,
    scatter_kws={"color": "red", "alpha": 0.5, "s": 20},
    line_kws={"color": "blue"},
    ci=None  # Don't show confidence interval
)
plt.xlabel("Distance to city center (miles)")
plt.ylabel("Price (US dollars)")
plt.show()

[Scatter plot with a downward-sloping blue regression line through the
 red data points]

The blue line shows our predicted prices at each distance. Notice how the actual prices (red dots) scatter around this line — that scatter is what R-squared measures.

Comparing Groups: Close vs Far#

Before diving into regression, a simpler approach is to compare group means. Let’s create a binary variable for “close” vs “far” hotels:

# Create a binary variable: Far if distance >= 2 miles, Close otherwise
hotels["dist_category"] = np.where(hotels["distance"] >= 2, "Far", "Close")

# Calculate mean price for each group
hotels.groupby("dist_category")["price"].agg(["mean", "count"])

               mean  count
dist_category
Close        116.43    157
Far           89.72     50

Hotels close to the centre (< 2 miles) cost on average $116.43, while those further away average $89.72 — a difference of about $27!

Visualising Group Differences#

A box plot shows the full distribution for each group:

sns.boxplot(data=hotels, x="dist_category", y="price")
plt.xlabel("Distance Category")
plt.ylabel("Price (US dollars)")
plt.title("Hotel Prices: Close vs Far from City Centre")
plt.show()

[Box plot showing "Close" hotels with higher median and wider spread
 than "Far" hotels]

The “Close” box is higher (higher median price) and has more spread (more price variation).

Making Predictions#

One key use of regression is making predictions. We can extract the predicted values:

# Add predicted prices to our dataframe
hotels["predicted_price"] = model.fittedvalues
hotels["residual"] = model.resid  # Actual - Predicted

# Look at a few examples
hotels[["distance", "price", "predicted_price", "residual"]].head()

   distance  price  predicted_price   residual
     1.7     81       107.526056 -26.526056
     1.4     85       111.847984 -26.847984
     1.7     83       107.526056 -24.526056
     1.2     82       114.729269 -32.729269
     0.9    103       119.051194 -16.051194

The residual is the prediction error: actual price minus predicted price.

Finding Underpriced Hotels#

Which hotels are the best deals? Those with the most negative residuals (priced below what the model predicts):

hotels.nsmallest(5, "residual")[["distance", "price", "predicted_price", "residual"]]

     distance  price  predicted_price   residual
     1.1     54       116.169910 -62.169910
      1.1     60       116.169910 -56.169910
     1.0     63       117.610552 -54.610552
     1.4     58       111.847984 -53.847984
     0.9     68       119.051194 -51.051194

These hotels are priced $50-62 less than the model would predict based on their location!

Finding Overpriced Hotels#

Conversely, which hotels are priced above what we’d expect?

hotels.nlargest(5, "residual")[["distance", "price", "predicted_price", "residual"]]

     distance  price  predicted_price    residual
     1.9    383       104.644774  278.355226
      2.9    208        90.238353  117.761647
     0.0    242       132.016973  109.983027
     0.1    231       130.576331  100.423669
     0.5    223       124.813762   98.186238

One hotel costs $383 but the model predicts only $105 — it’s $278 more expensive than expected. Perhaps it has luxury amenities we haven’t accounted for!

Checking Model Fit: Residual Distribution#

A good regression model should have residuals (errors) that are roughly normally distributed around zero. Let’s check:

sns.histplot(data=hotels, x="residual", bins=20, color="blue", alpha=0.7)
plt.xlabel("Residual (Actual - Predicted)")
plt.ylabel("Count")
plt.title("Distribution of Prediction Errors")
plt.axvline(x=0, color="red", linestyle="--", label="Zero")
plt.show()

[Histogram showing most residuals clustered around 0, with a slight
 right skew due to some overpriced hotels]

Most residuals are near zero (good!), but there’s a right skew due to some expensive outliers.

Robust Standard Errors#

In real-world data, the spread of residuals often varies at different values of x (this is called heteroskedasticity). We can use robust standard errors to account for this:

# Fit with robust standard errors
model_robust = smf.ols("price ~ distance", data=hotels).fit(cov_type="HC3")
print(model_robust.summary())

                            OLS Regression Results
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    132.0170      4.876     27.072      0.000     122.459     141.575
distance     -14.4064      2.718     -5.301      0.000     -19.733      -9.080
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Notice the coefficients are the same, but the standard errors are slightly different. The conclusions don’t change here, but robust standard errors are generally safer to use.

Log Transformations#

Sometimes relationships aren’t linear. Log transformations can help capture non-linear patterns and have useful interpretations.

Creating Log Variables#

# Create log-transformed variables
hotels["ln_price"] = np.log(hotels["price"])
hotels["ln_distance"] = np.log(hotels["distance"].clip(lower=0.05))  # Avoid log(0)

Warning

You can’t take the log of zero or negative numbers. If your variable has zeros, you can either:

Add a small constant (e.g., log(x + 1))
Use a minimum value (e.g., clip(lower=0.05))

Four Types of Regression Models#

Model Type	Equation	Interpretation of β
Level-Level	price ~ distance	1-unit increase in x → β change in y
Level-Log	price ~ ln(distance)	1% increase in x → β/100 change in y
Log-Level	ln(price) ~ distance	1-unit increase in x → β×100% change in y
Log-Log	ln(price) ~ ln(distance)	1% increase in x → β% change in y

Let’s run all four:

1. Level-Level (Our Original Model)#

reg1 = smf.ols("price ~ distance", data=hotels).fit()
print(f"Coefficient: {reg1.params['distance']:.4f}")
print(f"R-squared: {reg1.rsquared:.3f}")

Coefficient: -14.4064
R-squared: 0.157

Interpretation: Each additional mile from the centre reduces price by $14.41.

2. Level-Log#

reg2 = smf.ols("price ~ ln_distance", data=hotels).fit()
print(f"Coefficient: {reg2.params['ln_distance']:.4f}")
print(f"R-squared: {reg2.rsquared:.3f}")

Coefficient: -24.7683
R-squared: 0.280

Interpretation: A 1% increase in distance reduces price by about $0.25 (coefficient ÷ 100).

Notice the R-squared is higher (0.280 vs 0.157) — the level-log model fits better!

3. Log-Level#

reg3 = smf.ols("ln_price ~ distance", data=hotels).fit()
print(f"Coefficient: {reg3.params['distance']:.4f}")
print(f"R-squared: {reg3.rsquared:.3f}")

Coefficient: -0.1313
R-squared: 0.205

Interpretation: Each additional mile from the centre reduces price by about 13.1%.

4. Log-Log#

reg4 = smf.ols("ln_price ~ ln_distance", data=hotels).fit()
print(f"Coefficient: {reg4.params['ln_distance']:.4f}")
print(f"R-squared: {reg4.rsquared:.3f}")

Coefficient: -0.2158
R-squared: 0.334

Interpretation: A 1% increase in distance reduces price by about 0.22%.

The log-log model has the highest R-squared (0.334), explaining 33.4% of price variation!

Comparing Models#

Model	R-squared	Best for…
Level-Level	0.157	Simple interpretation
Level-Log	0.280	When effect diminishes at higher x
Log-Level	0.205	When y changes by percentage
Log-Log	0.334	When both change by percentages (elasticity)

Tip

In economics, the log-log model gives you the elasticity — how responsive one variable is to percentage changes in another. An elasticity of -0.22 means a 10% increase in distance leads to a 2.2% decrease in price.

Polynomial Regression#

What if the relationship is curved rather than straight? We can add polynomial terms (squared, cubed) to capture curvature.

Adding Squared Term#

# Create squared distance
hotels["distance_sq"] = hotels["distance"] ** 2

# Fit quadratic model
reg_quad = smf.ols("price ~ distance + distance_sq", data=hotels).fit()
print(reg_quad.summary())

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    154.8580      6.045     25.617      0.000     142.939     166.777
distance     -46.0140      6.394     -7.197      0.000     -58.620     -33.408
distance_sq    6.9277      1.316      5.263      0.000       4.332       9.523
==============================================================================
R-squared: 0.258

The squared term is statistically significant (p < 0.05), suggesting a curved relationship!

Interpretation: The negative coefficient on distance and positive coefficient on distance_sq means prices fall quickly at first, then level off further from the centre.

Visualising the Curve#

sns.lmplot(
    x="distance", y="price", data=hotels,
    order=2,  # Polynomial order
    scatter_kws={"color": "red", "alpha": 0.6},
    line_kws={"color": "blue"}
)
plt.xlabel("Distance to city center (miles)")
plt.ylabel("Price (US dollars)")
plt.title("Quadratic Fit: Price vs Distance")
plt.show()

[Scatter plot with a curved blue line that drops steeply at first
 then flattens out at greater distances]

The curve captures how the price premium for central locations is strongest for hotels very close to the centre.

Adding Cubic Term#

We can add a cubic term for even more flexibility:

hotels["distance_cb"] = hotels["distance"] ** 3

reg_cubic = smf.ols("price ~ distance + distance_sq + distance_cb", data=hotels).fit()
print(f"R-squared: {reg_cubic.rsquared:.3f}")

R-squared: 0.276

The R-squared improves slightly, but be careful — adding too many polynomial terms can lead to overfitting, where the model captures noise rather than true patterns.

Comparing Multiple Models#

The stargazer package creates publication-quality regression tables:

# Install if needed: pip install stargazer
from stargazer.stargazer import Stargazer

# Create comparison table
table = Stargazer([reg1, reg2, reg_quad])
table.custom_columns(["Level-Level", "Level-Log", "Quadratic"], [1, 1, 1])
table.rename_covariates({"Intercept": "Constant"})

This creates a side-by-side comparison showing coefficients, standard errors, and fit statistics for each model.

Key Regression Concepts Summary#

Concept	What It Tells You
Coefficient (β)	The effect of x on y
Standard Error	Uncertainty in the coefficient estimate
t-statistic	Coefficient ÷ Standard Error
p-value	Probability of seeing this result if true effect is zero
R-squared	Proportion of y variation explained by the model
Adjusted R-squared	R-squared penalised for adding more variables
Residuals	Actual - Predicted values

When to Use Which Model#

Situation	Recommended Model
Simple relationship, easy interpretation	Level-Level
Diminishing returns	Level-Log
Percentage changes in outcome	Log-Level
Elasticity / percentage relationships	Log-Log
Curved relationship	Polynomial

Warning

Correlation is not causation! Our regression shows that distance and price are associated, but it doesn’t prove that distance causes price differences. There could be confounding factors — perhaps cheaper hotels locate further out because land is cheaper, not because guests demand lower prices.

Putting It All Together#

Here’s a complete workflow for running a regression analysis:

# 1. Load and prepare data
hotels = pd.read_csv("https://raw.githubusercontent.com/sakibanwar/python-notes/main/data/vienna_hotels.csv")
hotels = hotels[hotels["accommodation_type"] == "Hotel"]
hotels = hotels[hotels["price"] <= 600]

# 2. Explore the relationship
sns.scatterplot(data=hotels, x="distance", y="price")
plt.show()

# 3. Fit the model
model = smf.ols("price ~ distance", data=hotels).fit()
print(model.summary())

# 4. Check significance
# - Look at p-values (< 0.05 is significant)
# - Look at confidence intervals (shouldn't include 0)

# 5. Evaluate fit
# - R-squared tells you how much variation is explained
# - Check residual distribution

# 6. Make predictions
hotels["predicted"] = model.fittedvalues
hotels["residual"] = model.resid

# 7. Try alternative specifications
# - Log transformations
# - Polynomial terms
# - Additional control variables

Multiple Regression#

So far, we’ve used a single predictor (distance) to explain hotel prices. But think about it — does distance alone really determine price? What about star rating? Amenities? Location quality?

Our simple model had an R-squared of 0.157 — meaning distance explains only about 16% of the variation in price. That leaves 84% unexplained! Multiple regression lets us add more predictors to explain more of that variation.

Why Multiple Regression?#

There are three big reasons to include multiple predictors:

Better predictions — more variables usually explain more variation
Control for confounding — isolate the effect of one variable while holding others constant
Compare importance — see which factors matter most

Fitting a Multiple Regression Model#

The syntax in statsmodels extends naturally — just add more variables to the formula with +:

# Multiple regression with distance and star rating
model_multi = smf.ols("price ~ distance + stars", data=hotels).fit()
print(model_multi.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                  price   R-squared:                       0.194
Model:                            OLS   Adj. R-squared:                  0.186
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     42.3841     26.181      1.619      0.107      -9.213      93.981
distance     -13.7826      2.300     -5.993      0.000     -18.317      -9.248
stars         24.7361      7.064      3.502      0.001      10.820      38.652
==============================================================================

Notice how the interpretation changes slightly compared to simple regression:

distance: Holding star rating constant, each additional mile from the centre reduces price by $13.78
stars: Holding distance constant, each additional star increases price by $24.74

That phrase “holding constant” (sometimes called “controlling for”) is the key difference. In simple regression, the distance coefficient mixed together the effects of distance and anything correlated with distance (like star rating). Now we’ve separated them.

The R-squared increased to 0.194 (from 0.157) — not a huge jump, but we’re explaining more of the picture.

Using Categorical Variables with `C()`#

What happens if one of your predictors isn’t a number but a category — like “new” vs “used” condition, or “yes” vs “no” for stock photo? You need to tell statsmodels to treat it as a category using C(). Let’s see what happens if you forget:

# What happens if we try to use a string column WITHOUT C()?
# model = smf.ols("total_pr ~ cond", data=auctions).fit()
# This would throw an error!

The right way is to wrap it in C():

# Example: using categorical predictors
# Suppose we have auction data with condition (new/used) and a stock photo (yes/no)

import pandas as pd
import statsmodels.formula.api as smf

# Create sample auction data
auctions = pd.DataFrame({
    'total_pr': [52, 44, 38, 62, 48, 55, 42, 58, 46, 65,
                 35, 50, 40, 57, 43, 60, 39, 54, 41, 63],
    'cond': ['new', 'used', 'used', 'new', 'used', 'new', 'used', 'new', 'used', 'new',
             'used', 'new', 'used', 'new', 'used', 'new', 'used', 'new', 'used', 'new'],
    'stock_photo': ['yes', 'yes', 'no', 'yes', 'no', 'yes', 'no', 'yes', 'no', 'yes',
                    'no', 'no', 'yes', 'yes', 'no', 'yes', 'yes', 'no', 'no', 'yes'],
    'duration': [7, 3, 5, 7, 1, 5, 3, 7, 1, 5,
                 3, 7, 5, 3, 1, 7, 5, 3, 1, 7],
    'wheels': [1, 0, 0, 2, 0, 1, 0, 2, 1, 2,
               0, 1, 0, 1, 0, 2, 1, 1, 0, 2]
})

# Fit model with categorical and numerical predictors
model = smf.ols("total_pr ~ C(cond) + C(stock_photo) + duration + wheels", data=auctions).fit()
print(model.summary())

==============================================================================
                        coef    std err          t      P>|t|
------------------------------------------------------------------------------
Intercept            33.7895      3.290     10.270      0.000
C(cond)[T.used]      -9.2356      2.279     -4.052      0.001
C(stock_photo)[T.yes] 2.7413      2.261      1.213      0.244
duration              0.9127      0.380      2.402      0.030
wheels                5.6830      1.330      4.273      0.001
==============================================================================

Let’s break down how statsmodels handles the categorical variables:

C(cond)[T.used] — This is the coefficient for “used” compared to “new”. “New” is the reference category (chosen alphabetically by default). The coefficient of -9.24 means used items sell for about $9.24 less than new items, all else equal.
C(stock_photo)[T.yes] — The coefficient for having a stock photo compared to not having one. The p-value of 0.244 tells us this difference is not statistically significant.

Warning

If your categorical columns contain strings (like “new”/”used”), you must wrap them in C() in the formula. If you forget, statsmodels will try to treat them as numerical and throw an error. This is one of the most common mistakes when building regression models!

R-squared vs Adjusted R-squared#

Here’s a trap that catches many beginners: when you add more predictors, R-squared always increases (or stays the same) — even if the new variable is completely useless! Why? Because even random noise will explain a tiny bit of variation just by chance.

This makes R-squared misleading when comparing models with different numbers of predictors. Enter adjusted R-squared, which penalises for adding variables:

Metric	What it does	When to use it
R²	Proportion of variance explained	Comparing models with the same number of predictors
Adjusted R²	R² with a penalty for extra predictors	Comparing models with different numbers of predictors

Let’s see this in action:

# Compare R-squared and Adjusted R-squared
model_simple = smf.ols("total_pr ~ wheels", data=auctions).fit()
model_full = smf.ols("total_pr ~ C(cond) + C(stock_photo) + duration + wheels", data=auctions).fit()

print(f"Simple model (wheels only):")
print(f"  R²:          {model_simple.rsquared:.4f}")
print(f"  Adjusted R²: {model_simple.rsquared_adj:.4f}")

print(f"\nFull model (4 predictors):")
print(f"  R²:          {model_full.rsquared:.4f}")
print(f"  Adjusted R²: {model_full.rsquared_adj:.4f}")

Simple model (wheels only):
  R²:          0.4123
  Adjusted R²: 0.3796

Full model (4 predictors):
  R²:          0.7891
  Adjusted R²: 0.7328

Notice that adjusted R² is always lower than R² (that’s the penalty at work). The full model has a much higher adjusted R² (0.733 vs 0.380), confirming that the additional predictors are genuinely useful — not just adding noise.

Tip

Rule of thumb: Always report adjusted R² when comparing models with different numbers of predictors. If adding a variable increases R² but decreases adjusted R², that variable probably isn’t helping.

Model Diagnostics#

We’ve been fitting regression models and interpreting coefficients. But here’s a question we haven’t asked yet: How do we know our model is any good?

A regression model makes several assumptions. If these assumptions are badly violated, our p-values, confidence intervals, and predictions may be unreliable. The key assumptions are:

Linearity — the relationship between x and y is linear
Independence — observations are independent of each other
Normality of residuals — the prediction errors follow a normal distribution
Constant variance (homoscedasticity) — the spread of residuals is similar at all values of x

How do we check these? With diagnostic plots. Let’s walk through each one.

Residual Plot#

The most important diagnostic is the residual plot — it shows residuals (prediction errors) against fitted values. If the model is behaving well, you should see a random cloud of points scattered around zero with no pattern:

# Fit model
model = smf.ols("price ~ distance", data=hotels).fit()

# Create residual plot
plt.figure(figsize=(10, 5))
plt.scatter(model.fittedvalues, model.resid, alpha=0.5, color='steelblue')
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

[Scatter plot with residuals randomly scattered around the red zero line.
 Some points far from zero indicate outliers.]

What should you look for? Here’s a quick guide:

Pattern you see	What it means	What to do
Random scatter around zero	Assumptions are met ✓	Nothing — your model is fine
Fan shape (wider on one side)	Non-constant variance (heteroscedasticity)	Use robust standard errors
Curved pattern	Non-linearity	Try polynomial or log model
Distinct clusters	Possible missing variable	Add more predictors

Q-Q Plot (Quantile-Quantile Plot)#

The Q-Q plot checks whether residuals follow a normal distribution. The idea is simple: if the residuals are truly normal, they should line up along the diagonal. Any deviation from the diagonal tells you something is off:

import scipy.stats as st

# Q-Q plot of residuals
fig, ax = plt.subplots(figsize=(8, 6))
st.probplot(model.resid, dist="norm", plot=ax)
ax.set_title('Q-Q Plot of Residuals')
plt.show()

[Q-Q plot where most points follow the diagonal line, with some
 deviation at the upper tail (due to expensive outlier hotels)]

Notice how most points sit close to the diagonal, but a few at the upper right curve away. Those are the expensive outlier hotels pulling the tail upward. In practice, moderate deviations like this are usually fine — regression is fairly robust to non-normality, especially with larger samples.

Tip

Reading a Q-Q plot: Points close to the diagonal = good. An S-shaped curve = heavier tails than normal. Points curving away at the ends = outliers or skewness. Don’t panic about minor deviations — focus on the overall pattern.

Leverage and Influential Points#

Here’s something that might surprise you: not all data points are created equal. Some points have more influence on the regression line than others. Consider: if you remove one data point and the regression line barely moves, that point isn’t very influential. But if removing one point completely changes the slope, that’s an influential point — and you should investigate it.

There are three related concepts:

High leverage points — observations with unusual x-values (far from the mean of x)
Influential points — observations that significantly change the regression results when removed
Cook’s distance — a single number that combines leverage and residual size to measure overall influence

# Calculate influence measures
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]

# Plot Cook's distance
plt.figure(figsize=(10, 5))
plt.stem(range(len(cooks_d)), cooks_d, markerfmt=",")
plt.xlabel('Observation Index')
plt.ylabel("Cook's Distance")
plt.title("Cook's Distance Plot")

# Common threshold: 4/n
threshold = 4 / len(hotels)
plt.axhline(y=threshold, color='red', linestyle='--', label=f'Threshold (4/n = {threshold:.4f})')
plt.legend()
plt.show()

[Stem plot showing Cook's distance for each observation.
 Most values are small; a few stick up above the red threshold line.]

The observations poking above the red line are potentially influential. But which ones are they? Let’s find out:

# Identify influential observations
influential = np.where(cooks_d > threshold)[0]
print(f"Number of influential points: {len(influential)}")
print(f"Threshold (4/n): {threshold:.4f}")

# Look at the most influential observations
if len(influential) > 0:
    print(f"\nMost influential observations:")
    top_influential = np.argsort(cooks_d)[-5:]  # Top 5
    for idx in top_influential:
        print(f"  Index {idx}: Cook's D = {cooks_d[idx]:.4f}, "
              f"Distance = {hotels.iloc[idx]['distance']:.1f}, "
              f"Price = ${hotels.iloc[idx]['price']:.0f}")

Warning

A common rule of thumb is that observations with Cook’s distance greater than 4/n (where n is the sample size) may be influential. But this is a guideline, not a strict rule. Don’t automatically delete influential points! Instead, examine them to understand why they’re unusual. Sometimes an influential point is a data entry error; other times it’s a genuinely unusual observation that belongs in the data.

Putting It All Together: Complete Diagnostic Workflow#

Rather than running these plots one at a time, here’s a handy function that creates all three diagnostic plots side by side:

def regression_diagnostics(model, data):
    """
    Run standard regression diagnostics.
    Creates three plots: residual plot, Q-Q plot, and Cook's distance.
    """
    import scipy.stats as st

    fig, axes = plt.subplots(1, 3, figsize=(18, 5))

    # 1. Residual plot
    axes[0].scatter(model.fittedvalues, model.resid, alpha=0.5, color='steelblue')
    axes[0].axhline(y=0, color='red', linestyle='--')
    axes[0].set_xlabel('Fitted Values')
    axes[0].set_ylabel('Residuals')
    axes[0].set_title('Residuals vs Fitted')

    # 2. Q-Q plot
    st.probplot(model.resid, dist="norm", plot=axes[1])
    axes[1].set_title('Q-Q Plot')

    # 3. Cook's distance
    influence = model.get_influence()
    cooks_d = influence.cooks_distance[0]
    axes[2].stem(range(len(cooks_d)), cooks_d, markerfmt=",")
    threshold = 4 / len(data)
    axes[2].axhline(y=threshold, color='red', linestyle='--')
    axes[2].set_xlabel('Observation')
    axes[2].set_ylabel("Cook's Distance")
    axes[2].set_title("Cook's Distance")

    plt.tight_layout()
    plt.show()

    # Summary
    n_influential = np.sum(cooks_d > threshold)
    print(f"R-squared: {model.rsquared:.4f}")
    print(f"Adjusted R-squared: {model.rsquared_adj:.4f}")
    print(f"Influential points (Cook's D > {threshold:.4f}): {n_influential}")

# Usage
model = smf.ols("price ~ distance", data=hotels).fit()
regression_diagnostics(model, hotels)

You can reuse this function on any model — just pass in the fitted model and the data. It’s a good habit to run diagnostics every time you fit a regression.

Exercises#

Exercise 40

Exercise 1: Simple Regression

Using the tips dataset from seaborn:

Run a regression predicting tip from total_bill
What is the coefficient on total_bill? Interpret it in words.
What is the R-squared? What does it mean?
If someone has a bill of $30, what tip does the model predict?

Solution to Exercise 40

import seaborn as sns
import statsmodels.formula.api as smf

# Load data
tips = sns.load_dataset("tips")

# 1. Fit the regression
model = smf.ols("tip ~ total_bill", data=tips).fit()
print(model.summary())

# 2. Coefficient interpretation
print(f"\nCoefficient on total_bill: {model.params['total_bill']:.4f}")
print("Interpretation: For each additional dollar on the bill,")
print("the tip increases by about $0.105 (about 10.5 cents).")

# 3. R-squared interpretation
print(f"\nR-squared: {model.rsquared:.3f}")
print("This means total_bill explains about 45.7% of the variation in tips.")

# 4. Prediction for $30 bill
predicted_tip = model.params['Intercept'] + model.params['total_bill'] * 30
print(f"\nPredicted tip for $30 bill: ${predicted_tip:.2f}")

                            OLS Regression Results
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.457
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9203      0.160      5.761      0.000       0.606       1.235
total_bill     0.1050      0.007     14.260      0.000       0.091       0.120
==============================================================================

Coefficient on total_bill: 0.1050
Interpretation: For each additional dollar on the bill,
the tip increases by about $0.105 (about 10.5 cents).

R-squared: 0.457
This means total_bill explains about 45.7% of the variation in tips.

Predicted tip for $30 bill: $4.07

Exercise 41

Exercise 2: Log Transformations

Using the tips dataset:

Create log-transformed versions of tip and total_bill
Run a log-log regression: ln(tip) ~ ln(total_bill)
Interpret the coefficient as an elasticity
Compare the R-squared to the level-level model

Solution to Exercise 41

import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf

tips = sns.load_dataset("tips")

# 1. Create log variables
tips["ln_tip"] = np.log(tips["tip"])
tips["ln_bill"] = np.log(tips["total_bill"])

# 2. Run log-log regression
model_loglog = smf.ols("ln_tip ~ ln_bill", data=tips).fit()
print(model_loglog.summary())

# 3. Interpret elasticity
elasticity = model_loglog.params['ln_bill']
print(f"\nElasticity: {elasticity:.4f}")
print(f"Interpretation: A 1% increase in total bill is associated")
print(f"with a {elasticity:.2f}% increase in tip.")
print(f"Since elasticity < 1, tips are 'inelastic' - they increase")
print(f"by a smaller percentage than the bill increases.")

# 4. Compare R-squared
model_level = smf.ols("tip ~ total_bill", data=tips).fit()
print(f"\nR-squared comparison:")
print(f"Level-Level: {model_level.rsquared:.3f}")
print(f"Log-Log:     {model_loglog.rsquared:.3f}")

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.3642      0.138     -2.645      0.009      -0.635      -0.093
ln_bill        0.6024      0.048     12.616      0.000       0.508       0.697
==============================================================================
R-squared: 0.396

Elasticity: 0.6024
Interpretation: A 1% increase in total bill is associated
with a 0.60% increase in tip.
Since elasticity < 1, tips are 'inelastic' - they increase
by a smaller percentage than the bill increases.

R-squared comparison:
Level-Level: 0.457
Log-Log:     0.396

The level-level model actually has a higher R-squared in this case, suggesting the linear relationship fits better than the log-log specification.

Exercise 42

Exercise 3: Analyzing Residuals

Using the tips dataset and your level-level regression:

Add predicted values and residuals to the dataframe
Find the 5 most “generous” tips (highest positive residuals)
Find the 5 “stingiest” tips (lowest negative residuals)
Create a histogram of the residuals

Solution to Exercise 42

import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

tips = sns.load_dataset("tips")

# Fit model
model = smf.ols("tip ~ total_bill", data=tips).fit()

# 1. Add predictions and residuals
tips["predicted_tip"] = model.fittedvalues
tips["residual"] = model.resid

# 2. Most generous (highest positive residuals)
print("5 Most Generous Tips:")
print(tips.nlargest(5, "residual")[["total_bill", "tip", "predicted_tip", "residual"]])

# 3. Stingiest (lowest negative residuals)
print("\n5 Stingiest Tips:")
print(tips.nsmallest(5, "residual")[["total_bill", "tip", "predicted_tip", "residual"]])

# 4. Histogram of residuals
plt.figure(figsize=(8, 5))
sns.histplot(tips["residual"], bins=20, color="steelblue", alpha=0.7)
plt.axvline(x=0, color="red", linestyle="--")
plt.xlabel("Residual (Actual - Predicted Tip)")
plt.ylabel("Count")
plt.title("Distribution of Tip Prediction Errors")
plt.show()

Most Generous Tips:
     total_bill   tip  predicted_tip  residual
     50.81 10.00           6.253     3.747
     48.33  9.00           5.993     3.007
      48.27  6.73           5.987     0.743
      39.42  7.58           5.058     2.522
     23.17  6.50           3.352     3.148

Stingiest Tips:
     total_bill   tip  predicted_tip  residual
      27.28  2.00           3.784    -1.784
        1.01  1.00           1.026    -0.026
     32.68  1.17           4.350    -3.180
      7.25  1.00           1.681    -0.681
      13.75  2.00           2.364    -0.364

Exercise 43

Exercise 4: Polynomial Regression

Using the tips dataset:

Create a squared term for total_bill
Run a quadratic regression: tip ~ total_bill + total_bill_sq
Is the squared term statistically significant?
Plot both the linear and quadratic fits on the same scatter plot
Which model would you prefer and why?

Solution to Exercise 43

import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np

tips = sns.load_dataset("tips")

# 1. Create squared term
tips["total_bill_sq"] = tips["total_bill"] ** 2

# 2. Run quadratic regression
model_quad = smf.ols("tip ~ total_bill + total_bill_sq", data=tips).fit()
print(model_quad.summary())

# 3. Check significance
print(f"\nSquared term p-value: {model_quad.pvalues['total_bill_sq']:.4f}")
if model_quad.pvalues['total_bill_sq'] < 0.05:
    print("The squared term IS statistically significant (p < 0.05)")
else:
    print("The squared term is NOT statistically significant (p >= 0.05)")

# 4. Plot both fits
fig, ax = plt.subplots(figsize=(10, 6))

# Scatter plot
ax.scatter(tips["total_bill"], tips["tip"], alpha=0.5, label="Actual tips")

# Linear fit
x_range = np.linspace(tips["total_bill"].min(), tips["total_bill"].max(), 100)
model_linear = smf.ols("tip ~ total_bill", data=tips).fit()
y_linear = model_linear.params['Intercept'] + model_linear.params['total_bill'] * x_range
ax.plot(x_range, y_linear, color="blue", label=f"Linear (R²={model_linear.rsquared:.3f})")

# Quadratic fit
y_quad = (model_quad.params['Intercept'] +
          model_quad.params['total_bill'] * x_range +
          model_quad.params['total_bill_sq'] * x_range**2)
ax.plot(x_range, y_quad, color="red", linestyle="--", label=f"Quadratic (R²={model_quad.rsquared:.3f})")

ax.set_xlabel("Total Bill ($)")
ax.set_ylabel("Tip ($)")
ax.set_title("Linear vs Quadratic Fit")
ax.legend()
plt.show()

# 5. Model preference
print(f"\nR-squared Linear: {model_linear.rsquared:.4f}")
print(f"R-squared Quadratic: {model_quad.rsquared:.4f}")
print("\nConclusion: The quadratic term is not significant, and R-squared")
print("barely improves. The simpler linear model is preferred (parsimony).")

==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5991      0.348      1.723      0.086      -0.086       1.284
total_bill     0.1341      0.038      3.561      0.000       0.060       0.208
total_bill_sq -0.0004      0.001     -0.777      0.438      -0.002       0.001
==============================================================================

Squared term p-value: 0.4378
The squared term is NOT statistically significant (p >= 0.05)

R-squared Linear: 0.4566
R-squared Quadratic: 0.4586

Conclusion: The quadratic term is not significant, and R-squared
barely improves. The simpler linear model is preferred (parsimony).

Exercise 44

Exercise 5: Multiple Regression

Using the tips dataset from seaborn:

Fit a multiple regression predicting tip from total_bill, C(sex), C(smoker), and size
Which predictors are statistically significant at α = 0.05?
Compare the R-squared and adjusted R-squared to the simple model (tip ~ total_bill only)
Interpret the coefficient on C(smoker)[T.Yes]

Solution to Exercise 44

import seaborn as sns
import statsmodels.formula.api as smf

tips = sns.load_dataset("tips")

# Simple model
model_simple = smf.ols("tip ~ total_bill", data=tips).fit()

# 1. Multiple regression
model_multi = smf.ols("tip ~ total_bill + C(sex) + C(smoker) + size", data=tips).fit()
print(model_multi.summary())

# 2. Significant predictors
print("\n--- SIGNIFICANT PREDICTORS (p < 0.05) ---")
for var, pval in model_multi.pvalues.items():
    status = "SIGNIFICANT" if pval < 0.05 else "not significant"
    print(f"  {var}: p = {pval:.4f} ({status})")

# 3. Compare R-squared
print(f"\n--- MODEL COMPARISON ---")
print(f"Simple model:   R² = {model_simple.rsquared:.4f}, Adj R² = {model_simple.rsquared_adj:.4f}")
print(f"Multiple model: R² = {model_multi.rsquared:.4f}, Adj R² = {model_multi.rsquared_adj:.4f}")

# 4. Interpret smoker coefficient
smoker_coef = model_multi.params['C(smoker)[T.Yes]']
print(f"\n--- INTERPRETATION ---")
print(f"Coefficient on C(smoker)[T.Yes]: {smoker_coef:.4f}")
print(f"Holding all other variables constant, smokers tip about")
print(f"${abs(smoker_coef):.2f} {'more' if smoker_coef > 0 else 'less'} than nonsmokers.")

Exercise 45

Exercise 6: Model Diagnostics

Using the tips dataset:

Fit the model: tip ~ total_bill
Create a residual plot (residuals vs fitted values)
Create a Q-Q plot of the residuals
Calculate Cook’s distance and identify how many influential points there are (using the 4/n threshold)
What do the diagnostic plots tell you about the model assumptions?

Solution to Exercise 45

import seaborn as sns
import statsmodels.formula.api as smf
import scipy.stats as st
import matplotlib.pyplot as plt
import numpy as np

tips = sns.load_dataset("tips")

# 1. Fit model
model = smf.ols("tip ~ total_bill", data=tips).fit()

# Create diagnostic plots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# 2. Residual plot
axes[0].scatter(model.fittedvalues, model.resid, alpha=0.5, color='steelblue')
axes[0].axhline(y=0, color='red', linestyle='--')
axes[0].set_xlabel('Fitted Values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted')

# 3. Q-Q plot
st.probplot(model.resid, dist="norm", plot=axes[1])
axes[1].set_title('Q-Q Plot of Residuals')

# 4. Cook's distance
influence = model.get_influence()
cooks_d = influence.cooks_distance[0]
threshold = 4 / len(tips)

axes[2].stem(range(len(cooks_d)), cooks_d, markerfmt=",")
axes[2].axhline(y=threshold, color='red', linestyle='--',
                label=f'Threshold = {threshold:.4f}')
axes[2].set_xlabel('Observation')
axes[2].set_ylabel("Cook's Distance")
axes[2].set_title("Cook's Distance")
axes[2].legend()

plt.tight_layout()
plt.show()

# Count influential points
n_influential = np.sum(cooks_d > threshold)
print(f"\nInfluential points (Cook's D > {threshold:.4f}): {n_influential}")

# 5. Interpretation
print("\n--- DIAGNOSTIC INTERPRETATION ---")
print("Residual plot: Some fan-shape visible (variance increases with")
print("  fitted values), suggesting mild heteroscedasticity.")
print("Q-Q plot: Points deviate at the upper tail, indicating")
print("  the residuals have heavier right tail than normal.")
print("Cook's distance: A few influential observations exist,")
print("  likely high-tip outliers on expensive bills.")

Introduction to Regression

Contents

Introduction to Regression#

What is Regression?#

Setting Up#

Filtering the Data#

Exploring the Data First#

Visualising the Relationship#

Simple Linear Regression#

Interpreting Regression Output#

The Coefficients#

Statistical Significance#

Confidence Intervals#

R-squared: How Good is Our Model?#

Visualising the Regression Line#

Comparing Groups: Close vs Far#

Visualising Group Differences#

Making Predictions#

Finding Underpriced Hotels#

Finding Overpriced Hotels#

Checking Model Fit: Residual Distribution#

Robust Standard Errors#

Log Transformations#

Creating Log Variables#

Four Types of Regression Models#

1. Level-Level (Our Original Model)#

2. Level-Log#

3. Log-Level#

4. Log-Log#

Comparing Models#

Polynomial Regression#

Adding Squared Term#

Visualising the Curve#

Adding Cubic Term#

Comparing Multiple Models#

Key Regression Concepts Summary#

When to Use Which Model#

Putting It All Together#

Quick Regression Plots with sns.regplot()#

Multiple Regression#

Why Multiple Regression?#

Fitting a Multiple Regression Model#

Using Categorical Variables with C()#

R-squared vs Adjusted R-squared#

Model Diagnostics#

Residual Plot#

Q-Q Plot (Quantile-Quantile Plot)#

Leverage and Influential Points#

Putting It All Together: Complete Diagnostic Workflow#

Exercises#

Quick Regression Plots with `sns.regplot()`#

Using Categorical Variables with `C()`#