Introduction to Pandas#

In this chapter, we’ll learn about pandas — a powerful library that makes working with data in Python much easier. If you’ve ever used Excel or Google Sheets, pandas will feel somewhat familiar. It gives you tables (called DataFrames) that you can sort, filter, and analyse with just a few lines of code.

What is Pandas?#

Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

Think of it this way:

  • A list stores a sequence of items

  • A dictionary stores key-value pairs

  • A DataFrame stores a whole table — rows, columns, and all!

Installing and Importing Pandas#

If you’re using Google Colab or Anaconda, pandas is already installed. Otherwise, you can install it with:

pip install pandas

Once installed, we import pandas using a conventional alias:

import pandas as pd
pd.__version__
'2.3.3'

Notice we import pandas as pd. This is a widely-used convention — almost every pandas tutorial and documentation uses this alias. When you see pd.something(), you’ll know it’s a pandas function.

The Two Core Data Structures#

There are two core objects in pandas:

Structure

Description

Analogy

Series

A single column of data

Like a list with labels

DataFrame

A table with rows and columns

Like an Excel spreadsheet

Let’s explore both.

DataFrame: Your Data Table#

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

Creating a DataFrame from a Dictionary#

The simplest way to create a DataFrame is from a dictionary:

pd.DataFrame({'Yup': [50, 21, 32], 'Nope': [131, 2, 200]})
Yup Nope
0 50 131
1 21 2
2 32 200

Notice what happened here:

  • The dictionary keys became the column names (Yup and Nope)

  • The dictionary values (lists) became the data in each column

  • Pandas automatically added row labels (0, 1, 2) — these are called the index

DataFrame entries are not limited to integers. Here’s a DataFrame with strings:

pd.DataFrame({
    'Bob': ['I liked it.', 'It was awful.'],
    'Sue': ['Pretty good.', 'Bland.']
})
Bob Sue
0 I liked it. Pretty good.
1 It was awful. Bland.

Custom Row Labels (Index)#

The automatic numbering (0, 1, 2, …) is fine for many purposes, but sometimes you want meaningful row labels. You can specify these using the index parameter:

pd.DataFrame({
    'Bob': ['I liked it.', 'It was awful.'],
    'Sue': ['Pretty good.', 'Bland.']
}, index=['Product A', 'Product B'])
Bob Sue
Product A I liked it. Pretty good.
Product B It was awful. Bland.

Now instead of 0 and 1, our rows are labelled ‘Product A’ and ‘Product B’. This makes the data more readable and easier to reference.

Series: A Single Column#

A Series is a sequence of data values. If a DataFrame is a table, a Series is a single column (or row) of that table.

You can create a Series from a list:

pd.Series([1, 2, 3, 4])
0    1
1    2
2    3
3    4
dtype: int64

Notice that a Series has:

  • An index (0, 1, 2, 3) — just like DataFrame rows

  • Values (1, 2, 3, 4)

  • A dtype (data type) — here it’s int64 (64-bit integer)

You can also give a Series a name and custom index:

pd.Series([30, 35, 40], index=['2020', '2021', '2022'], name='Sales')
2020    30
2021    35
2022    40
Name: Sales, dtype: int64

Tip

A Series is essentially a single column of a DataFrame. When you extract one column from a DataFrame, you get a Series!

Loading Data from CSV Files#

In the real world, you won’t type your data manually. Instead, you’ll load it from files. The most common format is CSV (Comma-Separated Values).

Reading a CSV File#

Use pd.read_csv() to load a CSV file:

employee = pd.read_csv('../data/employee_dataset.csv')

The argument can be:

  • A file path on your computer: 'C:/Users/data/employees.csv'

  • A URL to a file online: 'https://example.com/data.csv'

Let’s look at the data:

employee
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
0 1 Finance 1.820000 False Good 56550.0 Michael Davis
1 2 IT 2.090000 False Poor 55225.0 Karen Brown
2 3 IT 7.047228 False Excellent 87600.0 Joseph Johnson
3 4 Finance 1.237509 False Excellent 53875.0 David Garcia
4 5 Marketing 4.160000 True Average 61400.0 Linda Martinez
5 6 Finance 2.475017 True Excellent 70550.0 Michael Brown
6 7 Marketing 10.420000 True Good 82050.0 Charles Moore
7 8 Marketing 6.102669 True Good 84750.0 David Lopez
8 9 Marketing 10.000000 True Poor 80500.0 David Johnson
9 10 HR 7.479808 False Average 101350.0 Patricia Martinez
10 11 Finance 5.281314 False Poor 67925.0 Linda Brown
11 12 IT 5.226557 True Average 83650.0 John Wilson
12 13 HR 11.126626 True Good 105750.0 Mary Thomas
13 14 IT 6.460000 True Excellent 69650.0 Barbara Anderson
14 15 Finance 11.783710 True Good 91175.0 Jennifer Davis
15 16 IT 5.798768 True Average 74000.0 Mary Wilson
16 17 IT 2.261465 False Poor 63400.0 James Davis
17 18 HR 12.480000 True Poor 88200.0 Barbara Thomas
18 19 HR 3.770000 True Excellent 60425.0 Barbara Gonzalez
19 20 HR 1.806982 True Excellent 65150.0 Thomas Moore
20 21 IT 2.757016 False Average 89775.0 Elizabeth Anderson
21 22 IT 7.436003 True Poor 94425.0 Jessica Gonzalez
22 23 Marketing 8.060000 False Excellent 75650.0 Joseph Rodriguez
23 24 Finance 13.880903 True Average 105075.0 Joseph Anderson
24 25 Marketing 4.566735 True Excellent 79800.0 Charles Gonzalez
25 26 Finance 8.550000 True Excellent 75875.0 Barbara Smith
26 27 HR 9.437372 True Good 81350.0 Karen Miller
27 28 HR 5.130000 False Average 65325.0 John Rodriguez
28 29 Finance 5.051335 False Good 74725.0 Robert Smith
29 30 Finance 6.685832 True Excellent 74550.0 Charles Lopez
30 31 IT 11.937029 True Good 104700.0 Michael Davis
31 32 Finance 2.080000 False Poor 56200.0 William Hernandez
32 33 HR 3.650000 False Poor 61125.0 Michael Jackson
33 34 HR 1.527721 True Poor 89475.0 Sarah Taylor
34 35 Finance 1.894593 True Poor 87825.0 Patricia Davis
35 36 IT 7.270000 False Average 71675.0 Susan Williams
36 37 IT 6.266940 False Poor 87650.0 Sarah Williams
37 38 IT 10.740589 False Good 105550.0 William Smith
38 39 Finance 13.160849 False Good 96000.0 Mary Jones
39 40 HR 11.640000 True Excellent 86100.0 Karen Martinez
40 41 Finance 1.120000 True Average 54800.0 Joseph Miller
41 42 Marketing 2.083504 False Good 86200.0 Michael Rodriguez
42 43 Marketing 10.269678 True Poor 96025.0 Barbara Miller
43 44 HR 2.280000 False Average 57700.0 Linda Rodriguez
44 45 Finance 11.189596 True Excellent 84375.0 Joseph Davis
45 46 Finance 12.555784 True Good 92625.0 John Lopez
46 47 IT 7.380000 True Average 73950.0 Susan Johnson
47 48 IT 10.132786 True Excellent 95400.0 Thomas Smith
48 49 IT 5.720000 True Average 68800.0 Patricia Thomas
49 50 IT 4.320000 True Good 62800.0 Sarah Jones

Notice that pandas shows the first and last few rows, with ... in between. This is helpful when your dataset is large.

Exploring Your Data#

Once you’ve loaded data, you’ll want to explore it. Pandas provides several useful methods for this.

Viewing the First/Last Rows#

Use .head() to see the first few rows:

employee.head()
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
0 1 Finance 1.820000 False Good 56550.0 Michael Davis
1 2 IT 2.090000 False Poor 55225.0 Karen Brown
2 3 IT 7.047228 False Excellent 87600.0 Joseph Johnson
3 4 Finance 1.237509 False Excellent 53875.0 David Garcia
4 5 Marketing 4.160000 True Average 61400.0 Linda Martinez

By default, .head() shows 5 rows. You can specify a different number:

employee.head(10)  # First 10 rows
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
0 1 Finance 1.820000 False Good 56550.0 Michael Davis
1 2 IT 2.090000 False Poor 55225.0 Karen Brown
2 3 IT 7.047228 False Excellent 87600.0 Joseph Johnson
3 4 Finance 1.237509 False Excellent 53875.0 David Garcia
4 5 Marketing 4.160000 True Average 61400.0 Linda Martinez
5 6 Finance 2.475017 True Excellent 70550.0 Michael Brown
6 7 Marketing 10.420000 True Good 82050.0 Charles Moore
7 8 Marketing 6.102669 True Good 84750.0 David Lopez
8 9 Marketing 10.000000 True Poor 80500.0 David Johnson
9 10 HR 7.479808 False Average 101350.0 Patricia Martinez

Similarly, .tail() shows the last rows:

employee.tail(10)  # Last 10 rows
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
40 41 Finance 1.120000 True Average 54800.0 Joseph Miller
41 42 Marketing 2.083504 False Good 86200.0 Michael Rodriguez
42 43 Marketing 10.269678 True Poor 96025.0 Barbara Miller
43 44 HR 2.280000 False Average 57700.0 Linda Rodriguez
44 45 Finance 11.189596 True Excellent 84375.0 Joseph Davis
45 46 Finance 12.555784 True Good 92625.0 John Lopez
46 47 IT 7.380000 True Average 73950.0 Susan Johnson
47 48 IT 10.132786 True Excellent 95400.0 Thomas Smith
48 49 IT 5.720000 True Average 68800.0 Patricia Thomas
49 50 IT 4.320000 True Good 62800.0 Sarah Jones

Checking the Shape#

How big is your dataset? Use .shape:

employee.shape
(50, 8)

This tells us we have 50 rows and 8 columns.

Viewing Column Names#

To see all column names:

employee.columns
Index(['Employee_ID', 'Department', 'Years_of_Experience', 'Full_Time',
       'Performance_Score', 'Salary', 'First_Name', 'Last_Name'],
      dtype='object')

Getting Summary Statistics#

The .describe() method gives you summary statistics for numerical columns:

employee.describe()
Employee_ID Years_of_Experience Salary
count 50.00000 50.000000 50.000000
mean 25.50000 6.552038 78774.000000
std 14.57738 3.777367 15041.086553
min 1.00000 1.120000 53875.000000
25% 13.25000 2.980262 65975.000000
50% 25.50000 6.184805 80150.000000
75% 37.75000 10.099589 89156.250000
max 50.00000 13.880903 105750.000000

This shows you the count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%) for each numerical column. It’s a quick way to understand the distribution of your data.

Selecting Columns#

To select a single column, use square brackets with the column name:

employee['Salary']
0      56550.0
1      55225.0
2      87600.0
3      53875.0
4      61400.0
5      70550.0
6      82050.0
7      84750.0
8      80500.0
9     101350.0
10     67925.0
11     83650.0
12    105750.0
13     69650.0
14     91175.0
15     74000.0
16     63400.0
17     88200.0
18     60425.0
19     65150.0
20     89775.0
21     94425.0
22     75650.0
23    105075.0
24     79800.0
25     75875.0
26     81350.0
27     65325.0
28     74725.0
29     74550.0
30    104700.0
31     56200.0
32     61125.0
33     89475.0
34     87825.0
35     71675.0
36     87650.0
37    105550.0
38     96000.0
39     86100.0
40     54800.0
41     86200.0
42     96025.0
43     57700.0
44     84375.0
45     92625.0
46     73950.0
47     95400.0
48     68800.0
49     62800.0
Name: Salary, dtype: float64

Notice this returns a Series (a single column).

To select multiple columns, pass a list of column names:

employee[['Performance_Score', 'Salary']]
Performance_Score Salary
0 Good 56550.0
1 Poor 55225.0
2 Excellent 87600.0
3 Excellent 53875.0
4 Average 61400.0
5 Excellent 70550.0
6 Good 82050.0
7 Good 84750.0
8 Poor 80500.0
9 Average 101350.0
10 Poor 67925.0
11 Average 83650.0
12 Good 105750.0
13 Excellent 69650.0
14 Good 91175.0
15 Average 74000.0
16 Poor 63400.0
17 Poor 88200.0
18 Excellent 60425.0
19 Excellent 65150.0
20 Average 89775.0
21 Poor 94425.0
22 Excellent 75650.0
23 Average 105075.0
24 Excellent 79800.0
25 Excellent 75875.0
26 Good 81350.0
27 Average 65325.0
28 Good 74725.0
29 Excellent 74550.0
30 Good 104700.0
31 Poor 56200.0
32 Poor 61125.0
33 Poor 89475.0
34 Poor 87825.0
35 Average 71675.0
36 Poor 87650.0
37 Good 105550.0
38 Good 96000.0
39 Excellent 86100.0
40 Average 54800.0
41 Good 86200.0
42 Poor 96025.0
43 Average 57700.0
44 Excellent 84375.0
45 Good 92625.0
46 Average 73950.0
47 Excellent 95400.0
48 Average 68800.0
49 Good 62800.0

Notice the double brackets [[...]]. The outer brackets are for indexing the DataFrame; the inner brackets create a list of column names. This returns a DataFrame (not a Series).

Indexing in Pandas#

Pandas provides two main ways to select data:

Method

Type

Description

.iloc[]

Index-based

Select by position (0, 1, 2, …)

.loc[]

Label-based

Select by label (row/column names)

This is one of the most important concepts in pandas, so let’s explore both carefully.

Index-Based Selection with .iloc#

.iloc stands for “integer location”. It selects data based on numerical position, just like indexing a list.

Selecting Rows#

To select the first row (index 0):

employee.iloc[0]
Employee_ID                  1
Department             Finance
Years_of_Experience       1.82
Full_Time                False
Performance_Score         Good
Salary                 56550.0
First_Name             Michael
Last_Name                Davis
Name: 0, dtype: object

This returns a Series containing all columns for that row.

The second row (index 1):

employee.iloc[1]
Employee_ID                  2
Department                  IT
Years_of_Experience       2.09
Full_Time                False
Performance_Score         Poor
Salary                 55225.0
First_Name               Karen
Last_Name                Brown
Name: 1, dtype: object

Slicing Rows#

You can select multiple rows using slicing:

employee.iloc[0:3]  # Rows 0, 1, and 2
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
0 1 Finance 1.820000 False Good 56550.0 Michael Davis
1 2 IT 2.090000 False Poor 55225.0 Karen Brown
2 3 IT 7.047228 False Excellent 87600.0 Joseph Johnson

Remember: slicing in Python is exclusive of the end index. 0:3 gives you indices 0, 1, 2 (not 3).

Selecting Non-Consecutive Rows#

What if you want specific rows that aren’t in sequence? Pass a list of indices:

employee.iloc[[1, 2, 39]]
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
1 2 IT 2.090000 False Poor 55225.0 Karen Brown
2 3 IT 7.047228 False Excellent 87600.0 Joseph Johnson
39 40 HR 11.640000 True Excellent 86100.0 Karen Martinez

Selecting Rows and Columns#

.iloc can select both rows and columns using [row, column]:

employee.iloc[0, 0]  # First row, first column
np.int64(1)
employee.iloc[0, 1]  # First row, second column
'Finance'

You can also use lists and slices for both:

employee.iloc[[0, 1, 2], [0, 1, 2]]  # First 3 rows, first 3 columns
Employee_ID Department Years_of_Experience
0 1 Finance 1.820000
1 2 IT 2.090000
2 3 IT 7.047228

Or with slicing:

employee.iloc[0:3, 0:4]  # Rows 0-2, columns 0-3
Employee_ID Department Years_of_Experience Full_Time
0 1 Finance 1.820000 False
1 2 IT 2.090000 False
2 3 IT 7.047228 False

Label-Based Selection with .loc#

.loc selects data by labels (names) rather than positions. This is often more intuitive when you know your column names.

To see .loc in action clearly, let’s create a DataFrame with meaningful row labels:

df1 = pd.DataFrame(
    [[1, 2, 2], [4, 5, 2], [7, 5, 2], [232, 21, 24]],
    index=['cobra', 'viper', 'sidewinder', 'rattle_snake'],
    columns=['max_speed', 'shield', 'windy']
)
df1
max_speed shield windy
cobra 1 2 2
viper 4 5 2
sidewinder 7 5 2
rattle_snake 232 21 24

Selecting Rows by Label#

df1.loc['viper']
max_speed    4
shield       5
windy        2
Name: viper, dtype: int64

Selecting Multiple Rows#

Pass a list of labels:

df1.loc[['viper', 'rattle_snake']]
max_speed shield windy
viper 4 5 2
rattle_snake 232 21 24

Slicing with Labels#

Unlike .iloc, .loc slicing is inclusive of the end label:

df1.loc['viper':'rattle_snake']
max_speed shield windy
viper 4 5 2
sidewinder 7 5 2
rattle_snake 232 21 24

Notice this includes rattle_snake!

Selecting Rows and Columns by Label#

df1.loc['viper':'rattle_snake', 'shield':'windy']
shield windy
viper 5 2
sidewinder 5 2
rattle_snake 21 24

The Difference Between .loc and .iloc#

This is a crucial distinction:

Feature

.iloc

.loc

Selection type

By position (integer)

By label (name)

Slicing

Exclusive of end

Inclusive of end

Example

df.iloc[0:3] → rows 0, 1, 2

df.loc['a':'c'] → rows a, b, c

When to use

When you know the position

When you know the label

Warning

Be careful! df.iloc[0:3] and df.loc[0:3] can give different results if your index labels are integers. .iloc always uses position; .loc always uses labels.

Practical Example: Working with Real Data#

Let’s put it all together with our employee dataset:

# Check the shape
print(f"Dataset has {employee.shape[0]} rows and {employee.shape[1]} columns")
Dataset has 50 rows and 8 columns
# View first few rows
employee.head()
Employee_ID Department Years_of_Experience Full_Time Performance_Score Salary First_Name Last_Name
0 1 Finance 1.820000 False Good 56550.0 Michael Davis
1 2 IT 2.090000 False Poor 55225.0 Karen Brown
2 3 IT 7.047228 False Excellent 87600.0 Joseph Johnson
3 4 Finance 1.237509 False Excellent 53875.0 David Garcia
4 5 Marketing 4.160000 True Average 61400.0 Linda Martinez
# Get summary statistics
employee.describe()
Employee_ID Years_of_Experience Salary
count 50.00000 50.000000 50.000000
mean 25.50000 6.552038 78774.000000
std 14.57738 3.777367 15041.086553
min 1.00000 1.120000 53875.000000
25% 13.25000 2.980262 65975.000000
50% 25.50000 6.184805 80150.000000
75% 37.75000 10.099589 89156.250000
max 50.00000 13.880903 105750.000000
# Select specific columns
salaries = employee[['First_Name', 'Last_Name', 'Salary']]
salaries.head()
First_Name Last_Name Salary
0 Michael Davis 56550.0
1 Karen Brown 55225.0
2 Joseph Johnson 87600.0
3 David Garcia 53875.0
4 Linda Martinez 61400.0
# Select specific rows and columns
employee.iloc[0:5, [1, 5]]  # First 5 rows, columns 1 and 5
Department Salary
0 Finance 56550.0
1 IT 55225.0
2 IT 87600.0
3 Finance 53875.0
4 Marketing 61400.0

Exercises#

Exercise 12

Exercise 1: Create a DataFrame

Create a DataFrame called movies with the following data:

Title

Year

Rating

The Matrix

1999

8.7

Inception

2010

8.8

Interstellar

2014

8.6

Then:

  1. Print the DataFrame

  2. Print only the ‘Title’ column

  3. Print the shape of the DataFrame

Exercise 13

Exercise 2: Index-Based Selection

Using the movies DataFrame from Exercise 1:

  1. Select the first row using .iloc

  2. Select the last two rows using .iloc

  3. Select the ‘Title’ and ‘Rating’ columns for all rows using .iloc

Exercise 14

Exercise 3: Exploring a Dataset

Load the built-in seaborn tips dataset and explore it:

import seaborn as sns
tips = sns.load_dataset('tips')

Answer these questions:

  1. How many rows and columns does the dataset have?

  2. What are the column names?

  3. What are the summary statistics for the numerical columns?

  4. What do the first 10 rows look like?

Exercise 15

Exercise 4: .loc vs .iloc

Create this DataFrame:

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}, index=['w', 'x', 'y', 'z'])

Then use both .loc and .iloc to:

  1. Select the row with label ‘x’

  2. Select rows ‘x’ through ‘z’ and columns ‘A’ and ‘B’

  3. Select the value at row ‘y’, column ‘C’