Introduction to Pandas

Introduction to Pandas#

In this chapter, we’ll learn about pandas — a powerful library that makes working with data in Python much easier. If you’ve ever used Excel or Google Sheets, pandas will feel somewhat familiar. It gives you tables (called DataFrames) that you can sort, filter, and analyse with just a few lines of code.

What is Pandas?#

Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.

Think of it this way:

A list stores a sequence of items
A dictionary stores key-value pairs
A DataFrame stores a whole table — rows, columns, and all!

Installing and Importing Pandas#

If you’re using Google Colab or Anaconda, pandas is already installed. Otherwise, you can install it with:

pip install pandas

Once installed, we import pandas using a conventional alias:

import pandas as pd

pd.__version__

'2.3.3'

Notice we import pandas as pd. This is a widely-used convention — almost every pandas tutorial and documentation uses this alias. When you see pd.something(), you’ll know it’s a pandas function.

The Two Core Data Structures#

There are two core objects in pandas:

Structure	Description	Analogy
Series	A single column of data	Like a list with labels
DataFrame	A table with rows and columns	Like an Excel spreadsheet

Let’s explore both.

DataFrame: Your Data Table#

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

Creating a DataFrame from a Dictionary#

The simplest way to create a DataFrame is from a dictionary:

pd.DataFrame({'Yup': [50, 21, 32], 'Nope': [131, 2, 200]})

	Yup	Nope
0	50	131
1	21	2
2	32	200

Notice what happened here:

The dictionary keys became the column names (Yup and Nope)
The dictionary values (lists) became the data in each column
Pandas automatically added row labels (0, 1, 2) — these are called the index

DataFrame entries are not limited to integers. Here’s a DataFrame with strings:

pd.DataFrame({
    'Bob': ['I liked it.', 'It was awful.'],
    'Sue': ['Pretty good.', 'Bland.']
})

	Bob	Sue
0	I liked it.	Pretty good.
1	It was awful.	Bland.

Custom Row Labels (Index)#

The automatic numbering (0, 1, 2, …) is fine for many purposes, but sometimes you want meaningful row labels. You can specify these using the index parameter:

pd.DataFrame({
    'Bob': ['I liked it.', 'It was awful.'],
    'Sue': ['Pretty good.', 'Bland.']
}, index=['Product A', 'Product B'])

	Bob	Sue
Product A	I liked it.	Pretty good.
Product B	It was awful.	Bland.

Now instead of 0 and 1, our rows are labelled ‘Product A’ and ‘Product B’. This makes the data more readable and easier to reference.

Series: A Single Column#

A Series is a sequence of data values. If a DataFrame is a table, a Series is a single column (or row) of that table.

You can create a Series from a list:

pd.Series([1, 2, 3, 4])

  1
  2
  3
  4
dtype: int64

Notice that a Series has:

An index (0, 1, 2, 3) — just like DataFrame rows
Values (1, 2, 3, 4)
A dtype (data type) — here it’s int64 (64-bit integer)

You can also give a Series a name and custom index:

pd.Series([30, 35, 40], index=['2020', '2021', '2022'], name='Sales')

  30
  35
  40
Name: Sales, dtype: int64

Tip

A Series is essentially a single column of a DataFrame. When you extract one column from a DataFrame, you get a Series!

Loading Data from CSV Files#

In the real world, you won’t type your data manually. Instead, you’ll load it from files. The most common format is CSV (Comma-Separated Values).

Reading a CSV File#

Use pd.read_csv() to load a CSV file:

employee = pd.read_csv('../data/employee_dataset.csv')

The argument can be:

A file path on your computer: 'C:/Users/data/employees.csv'
A URL to a file online: 'https://example.com/data.csv'

Let’s look at the data:

employee

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
0	1	Finance	1.820000	False	Good	56550.0	Michael	Davis
1	2	IT	2.090000	False	Poor	55225.0	Karen	Brown
2	3	IT	7.047228	False	Excellent	87600.0	Joseph	Johnson
3	4	Finance	1.237509	False	Excellent	53875.0	David	Garcia
4	5	Marketing	4.160000	True	Average	61400.0	Linda	Martinez
5	6	Finance	2.475017	True	Excellent	70550.0	Michael	Brown
6	7	Marketing	10.420000	True	Good	82050.0	Charles	Moore
7	8	Marketing	6.102669	True	Good	84750.0	David	Lopez
8	9	Marketing	10.000000	True	Poor	80500.0	David	Johnson
9	10	HR	7.479808	False	Average	101350.0	Patricia	Martinez
10	11	Finance	5.281314	False	Poor	67925.0	Linda	Brown
11	12	IT	5.226557	True	Average	83650.0	John	Wilson
12	13	HR	11.126626	True	Good	105750.0	Mary	Thomas
13	14	IT	6.460000	True	Excellent	69650.0	Barbara	Anderson
14	15	Finance	11.783710	True	Good	91175.0	Jennifer	Davis
15	16	IT	5.798768	True	Average	74000.0	Mary	Wilson
16	17	IT	2.261465	False	Poor	63400.0	James	Davis
17	18	HR	12.480000	True	Poor	88200.0	Barbara	Thomas
18	19	HR	3.770000	True	Excellent	60425.0	Barbara	Gonzalez
19	20	HR	1.806982	True	Excellent	65150.0	Thomas	Moore
20	21	IT	2.757016	False	Average	89775.0	Elizabeth	Anderson
21	22	IT	7.436003	True	Poor	94425.0	Jessica	Gonzalez
22	23	Marketing	8.060000	False	Excellent	75650.0	Joseph	Rodriguez
23	24	Finance	13.880903	True	Average	105075.0	Joseph	Anderson
24	25	Marketing	4.566735	True	Excellent	79800.0	Charles	Gonzalez
25	26	Finance	8.550000	True	Excellent	75875.0	Barbara	Smith
26	27	HR	9.437372	True	Good	81350.0	Karen	Miller
27	28	HR	5.130000	False	Average	65325.0	John	Rodriguez
28	29	Finance	5.051335	False	Good	74725.0	Robert	Smith
29	30	Finance	6.685832	True	Excellent	74550.0	Charles	Lopez
30	31	IT	11.937029	True	Good	104700.0	Michael	Davis
31	32	Finance	2.080000	False	Poor	56200.0	William	Hernandez
32	33	HR	3.650000	False	Poor	61125.0	Michael	Jackson
33	34	HR	1.527721	True	Poor	89475.0	Sarah	Taylor
34	35	Finance	1.894593	True	Poor	87825.0	Patricia	Davis
35	36	IT	7.270000	False	Average	71675.0	Susan	Williams
36	37	IT	6.266940	False	Poor	87650.0	Sarah	Williams
37	38	IT	10.740589	False	Good	105550.0	William	Smith
38	39	Finance	13.160849	False	Good	96000.0	Mary	Jones
39	40	HR	11.640000	True	Excellent	86100.0	Karen	Martinez
40	41	Finance	1.120000	True	Average	54800.0	Joseph	Miller
41	42	Marketing	2.083504	False	Good	86200.0	Michael	Rodriguez
42	43	Marketing	10.269678	True	Poor	96025.0	Barbara	Miller
43	44	HR	2.280000	False	Average	57700.0	Linda	Rodriguez
44	45	Finance	11.189596	True	Excellent	84375.0	Joseph	Davis
45	46	Finance	12.555784	True	Good	92625.0	John	Lopez
46	47	IT	7.380000	True	Average	73950.0	Susan	Johnson
47	48	IT	10.132786	True	Excellent	95400.0	Thomas	Smith
48	49	IT	5.720000	True	Average	68800.0	Patricia	Thomas
49	50	IT	4.320000	True	Good	62800.0	Sarah	Jones

Notice that pandas shows the first and last few rows, with ... in between. This is helpful when your dataset is large.

Exploring Your Data#

Once you’ve loaded data, you’ll want to explore it. Pandas provides several useful methods for this.

Viewing the First/Last Rows#

Use .head() to see the first few rows:

employee.head()

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
0	1	Finance	1.820000	False	Good	56550.0	Michael	Davis
1	2	IT	2.090000	False	Poor	55225.0	Karen	Brown
2	3	IT	7.047228	False	Excellent	87600.0	Joseph	Johnson
3	4	Finance	1.237509	False	Excellent	53875.0	David	Garcia
4	5	Marketing	4.160000	True	Average	61400.0	Linda	Martinez

By default, .head() shows 5 rows. You can specify a different number:

employee.head(10)  # First 10 rows

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
0	1	Finance	1.820000	False	Good	56550.0	Michael	Davis
1	2	IT	2.090000	False	Poor	55225.0	Karen	Brown
2	3	IT	7.047228	False	Excellent	87600.0	Joseph	Johnson
3	4	Finance	1.237509	False	Excellent	53875.0	David	Garcia
4	5	Marketing	4.160000	True	Average	61400.0	Linda	Martinez
5	6	Finance	2.475017	True	Excellent	70550.0	Michael	Brown
6	7	Marketing	10.420000	True	Good	82050.0	Charles	Moore
7	8	Marketing	6.102669	True	Good	84750.0	David	Lopez
8	9	Marketing	10.000000	True	Poor	80500.0	David	Johnson
9	10	HR	7.479808	False	Average	101350.0	Patricia	Martinez

Similarly, .tail() shows the last rows:

employee.tail(10)  # Last 10 rows

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
40	41	Finance	1.120000	True	Average	54800.0	Joseph	Miller
41	42	Marketing	2.083504	False	Good	86200.0	Michael	Rodriguez
42	43	Marketing	10.269678	True	Poor	96025.0	Barbara	Miller
43	44	HR	2.280000	False	Average	57700.0	Linda	Rodriguez
44	45	Finance	11.189596	True	Excellent	84375.0	Joseph	Davis
45	46	Finance	12.555784	True	Good	92625.0	John	Lopez
46	47	IT	7.380000	True	Average	73950.0	Susan	Johnson
47	48	IT	10.132786	True	Excellent	95400.0	Thomas	Smith
48	49	IT	5.720000	True	Average	68800.0	Patricia	Thomas
49	50	IT	4.320000	True	Good	62800.0	Sarah	Jones

Checking the Shape#

How big is your dataset? Use .shape:

employee.shape

(50, 8)

This tells us we have 50 rows and 8 columns.

Viewing Column Names#

To see all column names:

employee.columns

Index(['Employee_ID', 'Department', 'Years_of_Experience', 'Full_Time',
       'Performance_Score', 'Salary', 'First_Name', 'Last_Name'],
      dtype='object')

Getting Summary Statistics#

The .describe() method gives you summary statistics for numerical columns:

employee.describe()

	Employee_ID	Years_of_Experience	Salary
count	50.00000	50.000000	50.000000
mean	25.50000	6.552038	78774.000000
std	14.57738	3.777367	15041.086553
min	1.00000	1.120000	53875.000000
25%	13.25000	2.980262	65975.000000
50%	25.50000	6.184805	80150.000000
75%	37.75000	10.099589	89156.250000
max	50.00000	13.880903	105750.000000

This shows you the count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%) for each numerical column. It’s a quick way to understand the distribution of your data.

Selecting Columns#

To select a single column, use square brackets with the column name:

employee['Salary']

    56550.0
    55225.0
    87600.0
    53875.0
    61400.0
    70550.0
    82050.0
    84750.0
    80500.0
   101350.0
   67925.0
   83650.0
  105750.0
   69650.0
   91175.0
   74000.0
   63400.0
   88200.0
   60425.0
   65150.0
   89775.0
   94425.0
   75650.0
  105075.0
   79800.0
   75875.0
   81350.0
   65325.0
   74725.0
   74550.0
  104700.0
   56200.0
   61125.0
   89475.0
   87825.0
   71675.0
   87650.0
  105550.0
   96000.0
   86100.0
   54800.0
   86200.0
   96025.0
   57700.0
   84375.0
   92625.0
   73950.0
   95400.0
   68800.0
   62800.0
Name: Salary, dtype: float64

Notice this returns a Series (a single column).

To select multiple columns, pass a list of column names:

employee[['Performance_Score', 'Salary']]

	Performance_Score	Salary
0	Good	56550.0
1	Poor	55225.0
2	Excellent	87600.0
3	Excellent	53875.0
4	Average	61400.0
5	Excellent	70550.0
6	Good	82050.0
7	Good	84750.0
8	Poor	80500.0
9	Average	101350.0
10	Poor	67925.0
11	Average	83650.0
12	Good	105750.0
13	Excellent	69650.0
14	Good	91175.0
15	Average	74000.0
16	Poor	63400.0
17	Poor	88200.0
18	Excellent	60425.0
19	Excellent	65150.0
20	Average	89775.0
21	Poor	94425.0
22	Excellent	75650.0
23	Average	105075.0
24	Excellent	79800.0
25	Excellent	75875.0
26	Good	81350.0
27	Average	65325.0
28	Good	74725.0
29	Excellent	74550.0
30	Good	104700.0
31	Poor	56200.0
32	Poor	61125.0
33	Poor	89475.0
34	Poor	87825.0
35	Average	71675.0
36	Poor	87650.0
37	Good	105550.0
38	Good	96000.0
39	Excellent	86100.0
40	Average	54800.0
41	Good	86200.0
42	Poor	96025.0
43	Average	57700.0
44	Excellent	84375.0
45	Good	92625.0
46	Average	73950.0
47	Excellent	95400.0
48	Average	68800.0
49	Good	62800.0

Notice the double brackets [[...]]. The outer brackets are for indexing the DataFrame; the inner brackets create a list of column names. This returns a DataFrame (not a Series).

Indexing in Pandas#

Pandas provides two main ways to select data:

Method	Type	Description
`.iloc[]`	Index-based	Select by position (0, 1, 2, …)
`.loc[]`	Label-based	Select by label (row/column names)

This is one of the most important concepts in pandas, so let’s explore both carefully.

Index-Based Selection with `.iloc`#

.iloc stands for “integer location”. It selects data based on numerical position, just like indexing a list.

Selecting Rows#

To select the first row (index 0):

employee.iloc[0]

Employee_ID                  1
Department             Finance
Years_of_Experience       1.82
Full_Time                False
Performance_Score         Good
Salary                 56550.0
First_Name             Michael
Last_Name                Davis
Name: 0, dtype: object

This returns a Series containing all columns for that row.

The second row (index 1):

employee.iloc[1]

Employee_ID                  2
Department                  IT
Years_of_Experience       2.09
Full_Time                False
Performance_Score         Poor
Salary                 55225.0
First_Name               Karen
Last_Name                Brown
Name: 1, dtype: object

Slicing Rows#

You can select multiple rows using slicing:

employee.iloc[0:3]  # Rows 0, 1, and 2

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
0	1	Finance	1.820000	False	Good	56550.0	Michael	Davis
1	2	IT	2.090000	False	Poor	55225.0	Karen	Brown
2	3	IT	7.047228	False	Excellent	87600.0	Joseph	Johnson

Remember: slicing in Python is exclusive of the end index. 0:3 gives you indices 0, 1, 2 (not 3).

Selecting Non-Consecutive Rows#

What if you want specific rows that aren’t in sequence? Pass a list of indices:

employee.iloc[[1, 2, 39]]

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
1	2	IT	2.090000	False	Poor	55225.0	Karen	Brown
2	3	IT	7.047228	False	Excellent	87600.0	Joseph	Johnson
39	40	HR	11.640000	True	Excellent	86100.0	Karen	Martinez

Selecting Rows and Columns#

.iloc can select both rows and columns using [row, column]:

employee.iloc[0, 0]  # First row, first column

np.int64(1)

employee.iloc[0, 1]  # First row, second column

'Finance'

You can also use lists and slices for both:

employee.iloc[[0, 1, 2], [0, 1, 2]]  # First 3 rows, first 3 columns

	Employee_ID	Department	Years_of_Experience
0	1	Finance	1.820000
1	2	IT	2.090000
2	3	IT	7.047228

Or with slicing:

employee.iloc[0:3, 0:4]  # Rows 0-2, columns 0-3

	Employee_ID	Department	Years_of_Experience	Full_Time
0	1	Finance	1.820000	False
1	2	IT	2.090000	False
2	3	IT	7.047228	False

Label-Based Selection with `.loc`#

.loc selects data by labels (names) rather than positions. This is often more intuitive when you know your column names.

To see .loc in action clearly, let’s create a DataFrame with meaningful row labels:

df1 = pd.DataFrame(
    [[1, 2, 2], [4, 5, 2], [7, 5, 2], [232, 21, 24]],
    index=['cobra', 'viper', 'sidewinder', 'rattle_snake'],
    columns=['max_speed', 'shield', 'windy']
)
df1

	max_speed	shield	windy
cobra	1	2	2
viper	4	5	2
sidewinder	7	5	2
rattle_snake	232	21	24

Selecting Rows by Label#

df1.loc['viper']

max_speed    4
shield       5
windy        2
Name: viper, dtype: int64

Selecting Multiple Rows#

Pass a list of labels:

df1.loc[['viper', 'rattle_snake']]

	max_speed	shield	windy
viper	4	5	2
rattle_snake	232	21	24

Slicing with Labels#

Unlike .iloc, .loc slicing is inclusive of the end label:

df1.loc['viper':'rattle_snake']

	max_speed	shield	windy
viper	4	5	2
sidewinder	7	5	2
rattle_snake	232	21	24

Notice this includes rattle_snake!

Selecting Rows and Columns by Label#

df1.loc['viper':'rattle_snake', 'shield':'windy']

	shield	windy
viper	5	2
sidewinder	5	2
rattle_snake	21	24

The Difference Between `.loc` and `.iloc`#

This is a crucial distinction:

Feature	`.iloc`	`.loc`
Selection type	By position (integer)	By label (name)
Slicing	Exclusive of end	Inclusive of end
Example	`df.iloc[0:3]` → rows 0, 1, 2	`df.loc['a':'c']` → rows a, b, c
When to use	When you know the position	When you know the label

Warning

Be careful! df.iloc[0:3] and df.loc[0:3] can give different results if your index labels are integers. .iloc always uses position; .loc always uses labels.

Practical Example: Working with Real Data#

Let’s put it all together with our employee dataset:

# Check the shape
print(f"Dataset has {employee.shape[0]} rows and {employee.shape[1]} columns")

Dataset has 50 rows and 8 columns

# View first few rows
employee.head()

	Employee_ID	Department	Years_of_Experience	Full_Time	Performance_Score	Salary	First_Name	Last_Name
0	1	Finance	1.820000	False	Good	56550.0	Michael	Davis
1	2	IT	2.090000	False	Poor	55225.0	Karen	Brown
2	3	IT	7.047228	False	Excellent	87600.0	Joseph	Johnson
3	4	Finance	1.237509	False	Excellent	53875.0	David	Garcia
4	5	Marketing	4.160000	True	Average	61400.0	Linda	Martinez

# Get summary statistics
employee.describe()

	Employee_ID	Years_of_Experience	Salary
count	50.00000	50.000000	50.000000
mean	25.50000	6.552038	78774.000000
std	14.57738	3.777367	15041.086553
min	1.00000	1.120000	53875.000000
25%	13.25000	2.980262	65975.000000
50%	25.50000	6.184805	80150.000000
75%	37.75000	10.099589	89156.250000
max	50.00000	13.880903	105750.000000

# Select specific columns
salaries = employee[['First_Name', 'Last_Name', 'Salary']]
salaries.head()

	First_Name	Last_Name	Salary
0	Michael	Davis	56550.0
1	Karen	Brown	55225.0
2	Joseph	Johnson	87600.0
3	David	Garcia	53875.0
4	Linda	Martinez	61400.0

# Select specific rows and columns
employee.iloc[0:5, [1, 5]]  # First 5 rows, columns 1 and 5

	Department	Salary
0	Finance	56550.0
1	IT	55225.0
2	IT	87600.0
3	Finance	53875.0
4	Marketing	61400.0

Exercises#

Exercise 12

Exercise 1: Create a DataFrame

Create a DataFrame called movies with the following data:

Title	Year	Rating
The Matrix	1999	8.7
Inception	2010	8.8
Interstellar	2014	8.6

Then:

Print the DataFrame
Print only the ‘Title’ column
Print the shape of the DataFrame

Solution to Exercise 12

import pandas as pd

movies = pd.DataFrame({
    'Title': ['The Matrix', 'Inception', 'Interstellar'],
    'Year': [1999, 2010, 2014],
    'Rating': [8.7, 8.8, 8.6]
})

# Print the DataFrame
print(movies)

# Print only the 'Title' column
print(movies['Title'])

# Print the shape
print(movies.shape)  # (3, 3)

Exercise 13

Exercise 2: Index-Based Selection

Using the movies DataFrame from Exercise 1:

Select the first row using .iloc
Select the last two rows using .iloc
Select the ‘Title’ and ‘Rating’ columns for all rows using .iloc

Solution to Exercise 13

# 1. First row
print(movies.iloc[0])

# 2. Last two rows
print(movies.iloc[-2:])
# Or: movies.iloc[1:3]

# 3. Title and Rating columns (columns 0 and 2)
print(movies.iloc[:, [0, 2]])

Exercise 14

Exercise 3: Exploring a Dataset

Load the built-in seaborn tips dataset and explore it:

import seaborn as sns
tips = sns.load_dataset('tips')

Answer these questions:

How many rows and columns does the dataset have?
What are the column names?
What are the summary statistics for the numerical columns?
What do the first 10 rows look like?

Solution to Exercise 14

import seaborn as sns
tips = sns.load_dataset('tips')

# 1. Shape
print(f"Rows: {tips.shape[0]}, Columns: {tips.shape[1]}")
# Output: Rows: 244, Columns: 7

# 2. Column names
print(tips.columns)
# Output: Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

# 3. Summary statistics
print(tips.describe())

# 4. First 10 rows
print(tips.head(10))

Exercise 15

Exercise 4: .loc vs .iloc

Create this DataFrame:

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}, index=['w', 'x', 'y', 'z'])

Then use both .loc and .iloc to:

Select the row with label ‘x’
Select rows ‘x’ through ‘z’ and columns ‘A’ and ‘B’
Select the value at row ‘y’, column ‘C’

Solution to Exercise 15

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8],
    'C': [9, 10, 11, 12]
}, index=['w', 'x', 'y', 'z'])

# 1. Select row 'x'
print(df.loc['x'])       # Using label
print(df.iloc[1])        # Using position

# 2. Rows 'x' through 'z', columns 'A' and 'B'
print(df.loc['x':'z', ['A', 'B']])     # Using labels
print(df.iloc[1:4, [0, 1]])            # Using positions

# 3. Value at row 'y', column 'C'
print(df.loc['y', 'C'])   # Output: 11
print(df.iloc[2, 2])      # Output: 11