Data Visualisation with Seaborn#

In this chapter, we’ll learn how to create beautiful and informative visualisations using seaborn. Data visualisation is one of the most important skills in data analysis — a good chart can reveal patterns that are invisible in raw numbers.

Why Visualise Data?#

Consider this: you have a dataset with 10,000 rows of sales data. You could:

  • Scroll through thousands of numbers (tedious and ineffective)

  • Calculate summary statistics (helpful but limited)

  • Create a chart that shows the pattern instantly (this is what we want!)

Visualisations help you:

  • Explore your data to find patterns

  • Communicate findings to others

  • Identify outliers and anomalies

  • Compare groups or categories

Python Visualisation Libraries#

There are several libraries for data visualisation in Python:

Library

Description

Matplotlib

The most versatile and customisable — but verbose

Seaborn

Built on matplotlib, easier for beginners, beautiful defaults

Plotnine

Library that is very close to R’s ggplot2 (probably not updated regularly)

Lets-plot

Another library that is very close to ggplot2 (it is maintained regularly)

We’ll focus on seaborn because it’s easy to learn and produces publication-quality plots with minimal code. I would suggest you learn Matplotlib later at some point.

Setting Up#

Let’s import the libraries we need:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Apply seaborn's default theme for nicer-looking plots
sns.set_theme()

Note

We import matplotlib as plt because seaborn is built on top of it. We’ll use plt.show() to display our plots.

Loading Example Data#

Seaborn comes with several built-in datasets that are perfect for learning. Let’s load the famous “tips” dataset:

df_tips = sns.load_dataset('tips')
df_tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

This dataset contains information about restaurant bills and tips. Let’s explore it:

df_tips.describe()
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

Scatter Plots#

A scatter plot shows the relationship between two numerical variables. Each point represents one observation.

Basic Scatter Plot#

Let’s see if there’s a relationship between the total bill and the tip:

sns.scatterplot(data=df_tips, x=df_tips['total_bill'], y=df_tips['tip'])
plt.show()
../_images/1107fb657a2d08f6bf36bef9dd745d3c376dab7e4435fb1a0bf4cce12138a26e.png

Notice the syntax:

  • data=df_tips — the DataFrame to use

  • x=df_tips['total_bill'] — the column for the x-axis

  • y=df_tips['tip'] — the column for the y-axis

The plot shows a positive relationship: higher bills tend to have higher tips. This makes intuitive sense!

Adding Colour (Hue)#

We can add a third variable using colour. Let’s see if the pattern differs by gender:

sns.scatterplot(data=df_tips, x=df_tips['total_bill'], y=df_tips['tip'], hue=df_tips['sex'])
plt.show()
../_images/7d7e231d6214ce4b345c38ea33361044aa0e8764961ef62cbad608c28a972409.png

The hue parameter colours the points by the ‘sex’ column. Seaborn automatically:

  • Assigns different colours to each category

  • Adds a legend

Now we can compare the tipping patterns of male and female customers at a glance.

Tip

The hue parameter works with most seaborn plots. It’s a powerful way to add a categorical dimension to your visualisation.

Histograms#

A histogram shows the distribution of a single numerical variable. It divides the data into bins and counts how many values fall into each bin.

Basic Histogram#

Let’s see how total bills are distributed:

sns.histplot(data=df_tips, x=df_tips['total_bill'])
plt.show()
../_images/b4fc6516c549a1db8dfb6c8f3bbaf933153e7c06a2732c04d94ca22176af0bc8.png

Most bills are between \(10 and \)25, with fewer very low or very high bills.

Adjusting the Number of Bins#

The bins parameter controls how many bars the histogram has:

sns.histplot(data=df_tips, x=df_tips['total_bill'], bins=40)
plt.show()
../_images/9177733063f7fdbef5791d668f0b29ea8469376fd482fb254ee386b9734efcf3.png

More bins show more detail, but can look noisy. Fewer bins show the overall shape but hide detail. Experiment to find the right balance.

Different Statistics#

By default, histograms show counts. You can change this with the stat parameter:

# Show percentages instead of counts
sns.histplot(data=df_tips, x=df_tips['total_bill'], bins=40, stat='percent')
plt.show()
../_images/5cf912bc060ec11ecfdd0572d5899f59b20969c6f1bf5059f40d41732f130482.png
# Show frequency (same as count)
sns.histplot(data=df_tips, x=df_tips['total_bill'], bins=30, stat='frequency')
plt.show()
../_images/655c3f7e5894912ee6770cc5051a8f1d26567aead5bbb560f4b1d569754d9cb1.png
# Show density (area under curve = 1)
sns.histplot(data=df_tips, x=df_tips['total_bill'], bins=30, stat='density')
plt.show()
../_images/414f181d3aeadabcee66d1c35b6cd63c68e9a433cf0ee4ad2eec847397fdbd1c.png

Stat

Description

'count'

Number of observations in each bin (default)

'frequency'

Same as count

'percent'

Percentage of total observations

'density'

Normalised so the area equals 1

Adding a Density Curve (KDE)#

You can overlay a smooth density curve using kde=True:

sns.histplot(data=df_tips, x=df_tips['total_bill'], kde=True)
plt.show()
../_images/60f4dce294e35cff537d8f9eaadd180fa7aa1f8c8ab012bbdd3fa3a5a9e27bfe.png

The KDE (Kernel Density Estimate) curve shows a smoothed version of the distribution.

Bar Plots#

A bar plot shows the relationship between a categorical variable and a numerical variable. By default, it shows the mean of the numerical variable for each category.

Basic Bar Plot#

Let’s compare average total bills for lunch vs dinner:

sns.barplot(data=df_tips, x=df_tips['time'], y=df_tips['total_bill'])
plt.show()
../_images/80b339748a9158dd5a09f6a39789253afc8e59daa178020e6d169974c8e7e6f9.png

Dinner bills are higher on average than lunch bills.

Grouped Bar Plot#

Add hue to compare across another category:

sns.barplot(data=df_tips, x=df_tips['time'], y=df_tips['total_bill'], hue=df_tips['sex'], errorbar=None)
plt.show()
../_images/633ddcd3f414e7ae3501ce1a16d6895d15ec3954af1c47ee44252cce1cfe3775.png

Now we can see that male customers have slightly higher bills on average, for both lunch and dinner.

Note

The errorbar=None removes the error bars. By default, seaborn shows 95% confidence intervals, which can be useful but also cluttered.

Box Plots#

A box plot (or box-and-whisker plot) shows the distribution of a numerical variable. It displays:

  • The median (middle line)

  • The interquartile range or IQR (the box — middle 50% of data)

  • Whiskers extending to 1.5 × IQR

  • Outliers as individual points beyond the whiskers

Basic Box Plot#

sns.boxplot(data=df_tips, x=df_tips['total_bill'])
plt.show()
../_images/489e5d4049ce086a7dbf5ee3db531f298cf42fb06455f8f33d6655a584d1a8d1.png

This shows the distribution of total bills. The box shows that 50% of bills are roughly between \(13 and \)24, with a few high outliers.

Comparing Groups#

Box plots are excellent for comparing distributions across categories:

sns.boxplot(data=df_tips, x=df_tips['total_bill'], y=df_tips['sex'])
plt.show()
../_images/16202d0c0657700c35b54f40de6287a6bdc201532d831d3bea8725b3c10e02d6.png

This shows the distribution of total bills separately for males and females. We can see that:

  • Male customers have a slightly higher median bill

  • Both groups have outliers at the high end

Adding a Title and Labels#

You can customise the plot using the .set() method:

sns.boxplot(data=df_tips, x=df_tips['total_bill'], y=df_tips['sex']).set(
    title='Distribution of Bills by Gender',
    xlabel='Total Bill ($)',
    ylabel='Gender'
)
plt.show()
../_images/e3946638155035f8756427d18d5dc2eb5ca5b6490d71fd35d3f387588c55d7d0.png

Line Plots#

A line plot shows how a numerical variable changes over time or another ordered variable. It’s particularly useful for time series data.

Loading Time Series Data#

Let’s use the flights dataset, which shows monthly airline passenger numbers:

df_flights = sns.load_dataset('flights')
df_flights.head()
year month passengers
0 1949 Jan 112
1 1949 Feb 118
2 1949 Mar 132
3 1949 Apr 129
4 1949 May 121

Basic Line Plot#

sns.lineplot(data=df_flights, x='year', y='passengers', errorbar=None)
plt.show()
../_images/f6a16e59ade55ce43011ad6c496f3197200d232ecdc4c382cbe45f1a791a0509.png

This shows the average number of passengers per year. The clear upward trend shows that air travel grew significantly from 1949 to 1960.

Multiple Lines#

Use hue to show separate lines for each category:

sns.lineplot(data=df_flights, x='year', y='passengers', hue='month')
plt.show()
../_images/e5c875955a87b091f07a4dc0b6a5e97562bb4ce466e70f42c5ee0341485b525e.png

Now we can see the trend for each month. Notice that:

  • All months show an upward trend

  • Summer months (especially July and August) consistently have more passengers

  • The seasonal pattern is clear

Faceted Plots with displot#

Sometimes you want to create separate plots for different categories. The displot function (distribution plot) makes this easy:

sns.displot(data=df_tips, x='total_bill', col='time', kde=True)
plt.show()
../_images/7cf82cf7cf93fe7ac457a56b538feabb8203617da228b3c396a4f8084f5baab9.png

This creates two histograms side by side — one for Lunch and one for Dinner. The col='time' parameter creates a separate column for each value of ‘time’.

Choosing the Right Chart#

Here’s a quick guide to choosing the right visualisation:

Data Type

Question

Chart Type

1 numerical

What’s the distribution?

Histogram, Box plot

2 numerical

What’s the relationship?

Scatter plot

1 categorical + 1 numerical

How do groups compare?

Bar plot, Box plot

Time series

How does it change over time?

Line plot

1 categorical

What are the frequencies?

Count plot (bar chart)

Saving Your Plots#

To save a plot as an image file, use matplotlib’s savefig:

sns.histplot(data=df_tips, x=df_tips['total_bill'], bins=30)
plt.savefig('my_histogram.png')
plt.show()

You can save in various formats: .png, .jpg, .pdf, .svg.

Tip

Call plt.savefig() before plt.show(). Once show() is called, the figure is cleared.

Common Customisations#

Changing Figure Size#

plt.figure(figsize=(10, 6))  # Width, Height in inches
sns.histplot(data=df_tips, x=df_tips['total_bill'])
plt.show()
../_images/29a2ea6c52d2f62321876c1a546e2c9747031ef736cd542177e1317c870c9921.png

Adding Titles and Labels#

sns.scatterplot(data=df_tips, x=df_tips['total_bill'], y=df_tips['tip'])
plt.title('Relationship Between Bill and Tip')
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.show()
../_images/9562b744506e36d969bfffc7ea7ce66518d5963171ee4a0bdcce6be580ddd73d.png

Changing Colours#

Seaborn has many built-in colour palettes:

sns.scatterplot(data=df_tips, x=df_tips['total_bill'], y=df_tips['tip'], hue=df_tips['day'], palette='Set2')
plt.show()
../_images/dcb52813a4d27780f4c1cb5431d10c9fe1dfcfce3eb13c2a3a1f2d8303af9641.png

Popular palettes include: 'Set1', 'Set2', 'deep', 'muted', 'pastel', 'bright'.


Exercises#

Exercise 16

Exercise 1: Create a Scatter Plot

Using the tips dataset:

  1. Create a scatter plot with total_bill on the x-axis and tip on the y-axis

  2. Colour the points by the day column

  3. Add a title: “Tips by Total Bill and Day”

Exercise 17

Exercise 2: Create Histograms

Using the tips dataset:

  1. Create a histogram of the tip column

  2. Use 20 bins

  3. Add a KDE curve

  4. Change the stat to show percentages

Exercise 18

Exercise 3: Compare Distributions

Using the tips dataset:

  1. Create a box plot comparing total_bill across different days (day column)

  2. Add appropriate title and labels

  3. What day has the highest median bill?

Exercise 19

Exercise 4: Complete Visualisation

Load the ‘penguins’ dataset from seaborn:

penguins = sns.load_dataset('penguins')

Create the following visualisations:

  1. A scatter plot of flipper_length_mm vs body_mass_g, coloured by species

  2. A histogram of bill_length_mm with separate colours for each species

  3. A box plot comparing body_mass_g across species

What patterns do you notice?