Introduction to Pandas#
In this chapter, we’ll learn about pandas — a powerful library that makes working with data in Python much easier. If you’ve ever used Excel or Google Sheets, pandas will feel somewhat familiar. It gives you tables (called DataFrames) that you can sort, filter, and analyse with just a few lines of code.
What is Pandas?#
Pandas is a package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
Think of it this way:
A list stores a sequence of items
A dictionary stores key-value pairs
A DataFrame stores a whole table — rows, columns, and all!
Installing and Importing Pandas#
If you’re using Google Colab or Anaconda, pandas is already installed. Otherwise, you can install it with:
pip install pandas
Once installed, we import pandas using a conventional alias:
import pandas as pd
pd.__version__
'2.3.3'
Notice we import pandas as pd. This is a widely-used convention — almost every pandas tutorial and documentation uses this alias. When you see pd.something(), you’ll know it’s a pandas function.
The Two Core Data Structures#
There are two core objects in pandas:
Structure |
Description |
Analogy |
|---|---|---|
Series |
A single column of data |
Like a list with labels |
DataFrame |
A table with rows and columns |
Like an Excel spreadsheet |
Let’s explore both.
DataFrame: Your Data Table#
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
Creating a DataFrame from a Dictionary#
The simplest way to create a DataFrame is from a dictionary:
pd.DataFrame({'Yup': [50, 21, 32], 'Nope': [131, 2, 200]})
| Yup | Nope | |
|---|---|---|
| 0 | 50 | 131 |
| 1 | 21 | 2 |
| 2 | 32 | 200 |
Notice what happened here:
The dictionary keys became the column names (
YupandNope)The dictionary values (lists) became the data in each column
Pandas automatically added row labels (0, 1, 2) — these are called the index
DataFrame entries are not limited to integers. Here’s a DataFrame with strings:
pd.DataFrame({
'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']
})
| Bob | Sue | |
|---|---|---|
| 0 | I liked it. | Pretty good. |
| 1 | It was awful. | Bland. |
Custom Row Labels (Index)#
The automatic numbering (0, 1, 2, …) is fine for many purposes, but sometimes you want meaningful row labels. You can specify these using the index parameter:
pd.DataFrame({
'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']
}, index=['Product A', 'Product B'])
| Bob | Sue | |
|---|---|---|
| Product A | I liked it. | Pretty good. |
| Product B | It was awful. | Bland. |
Now instead of 0 and 1, our rows are labelled ‘Product A’ and ‘Product B’. This makes the data more readable and easier to reference.
Series: A Single Column#
A Series is a sequence of data values. If a DataFrame is a table, a Series is a single column (or row) of that table.
You can create a Series from a list:
pd.Series([1, 2, 3, 4])
0 1
1 2
2 3
3 4
dtype: int64
Notice that a Series has:
An index (0, 1, 2, 3) — just like DataFrame rows
Values (1, 2, 3, 4)
A dtype (data type) — here it’s
int64(64-bit integer)
You can also give a Series a name and custom index:
pd.Series([30, 35, 40], index=['2020', '2021', '2022'], name='Sales')
2020 30
2021 35
2022 40
Name: Sales, dtype: int64
Tip
A Series is essentially a single column of a DataFrame. When you extract one column from a DataFrame, you get a Series!
Loading Data from CSV Files#
In the real world, you won’t type your data manually. Instead, you’ll load it from files. The most common format is CSV (Comma-Separated Values).
Reading a CSV File#
Use pd.read_csv() to load a CSV file:
employee = pd.read_csv('../data/employee_dataset.csv')
The argument can be:
A file path on your computer:
'C:/Users/data/employees.csv'A URL to a file online:
'https://example.com/data.csv'
Let’s look at the data:
employee
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Finance | 1.820000 | False | Good | 56550.0 | Michael | Davis |
| 1 | 2 | IT | 2.090000 | False | Poor | 55225.0 | Karen | Brown |
| 2 | 3 | IT | 7.047228 | False | Excellent | 87600.0 | Joseph | Johnson |
| 3 | 4 | Finance | 1.237509 | False | Excellent | 53875.0 | David | Garcia |
| 4 | 5 | Marketing | 4.160000 | True | Average | 61400.0 | Linda | Martinez |
| 5 | 6 | Finance | 2.475017 | True | Excellent | 70550.0 | Michael | Brown |
| 6 | 7 | Marketing | 10.420000 | True | Good | 82050.0 | Charles | Moore |
| 7 | 8 | Marketing | 6.102669 | True | Good | 84750.0 | David | Lopez |
| 8 | 9 | Marketing | 10.000000 | True | Poor | 80500.0 | David | Johnson |
| 9 | 10 | HR | 7.479808 | False | Average | 101350.0 | Patricia | Martinez |
| 10 | 11 | Finance | 5.281314 | False | Poor | 67925.0 | Linda | Brown |
| 11 | 12 | IT | 5.226557 | True | Average | 83650.0 | John | Wilson |
| 12 | 13 | HR | 11.126626 | True | Good | 105750.0 | Mary | Thomas |
| 13 | 14 | IT | 6.460000 | True | Excellent | 69650.0 | Barbara | Anderson |
| 14 | 15 | Finance | 11.783710 | True | Good | 91175.0 | Jennifer | Davis |
| 15 | 16 | IT | 5.798768 | True | Average | 74000.0 | Mary | Wilson |
| 16 | 17 | IT | 2.261465 | False | Poor | 63400.0 | James | Davis |
| 17 | 18 | HR | 12.480000 | True | Poor | 88200.0 | Barbara | Thomas |
| 18 | 19 | HR | 3.770000 | True | Excellent | 60425.0 | Barbara | Gonzalez |
| 19 | 20 | HR | 1.806982 | True | Excellent | 65150.0 | Thomas | Moore |
| 20 | 21 | IT | 2.757016 | False | Average | 89775.0 | Elizabeth | Anderson |
| 21 | 22 | IT | 7.436003 | True | Poor | 94425.0 | Jessica | Gonzalez |
| 22 | 23 | Marketing | 8.060000 | False | Excellent | 75650.0 | Joseph | Rodriguez |
| 23 | 24 | Finance | 13.880903 | True | Average | 105075.0 | Joseph | Anderson |
| 24 | 25 | Marketing | 4.566735 | True | Excellent | 79800.0 | Charles | Gonzalez |
| 25 | 26 | Finance | 8.550000 | True | Excellent | 75875.0 | Barbara | Smith |
| 26 | 27 | HR | 9.437372 | True | Good | 81350.0 | Karen | Miller |
| 27 | 28 | HR | 5.130000 | False | Average | 65325.0 | John | Rodriguez |
| 28 | 29 | Finance | 5.051335 | False | Good | 74725.0 | Robert | Smith |
| 29 | 30 | Finance | 6.685832 | True | Excellent | 74550.0 | Charles | Lopez |
| 30 | 31 | IT | 11.937029 | True | Good | 104700.0 | Michael | Davis |
| 31 | 32 | Finance | 2.080000 | False | Poor | 56200.0 | William | Hernandez |
| 32 | 33 | HR | 3.650000 | False | Poor | 61125.0 | Michael | Jackson |
| 33 | 34 | HR | 1.527721 | True | Poor | 89475.0 | Sarah | Taylor |
| 34 | 35 | Finance | 1.894593 | True | Poor | 87825.0 | Patricia | Davis |
| 35 | 36 | IT | 7.270000 | False | Average | 71675.0 | Susan | Williams |
| 36 | 37 | IT | 6.266940 | False | Poor | 87650.0 | Sarah | Williams |
| 37 | 38 | IT | 10.740589 | False | Good | 105550.0 | William | Smith |
| 38 | 39 | Finance | 13.160849 | False | Good | 96000.0 | Mary | Jones |
| 39 | 40 | HR | 11.640000 | True | Excellent | 86100.0 | Karen | Martinez |
| 40 | 41 | Finance | 1.120000 | True | Average | 54800.0 | Joseph | Miller |
| 41 | 42 | Marketing | 2.083504 | False | Good | 86200.0 | Michael | Rodriguez |
| 42 | 43 | Marketing | 10.269678 | True | Poor | 96025.0 | Barbara | Miller |
| 43 | 44 | HR | 2.280000 | False | Average | 57700.0 | Linda | Rodriguez |
| 44 | 45 | Finance | 11.189596 | True | Excellent | 84375.0 | Joseph | Davis |
| 45 | 46 | Finance | 12.555784 | True | Good | 92625.0 | John | Lopez |
| 46 | 47 | IT | 7.380000 | True | Average | 73950.0 | Susan | Johnson |
| 47 | 48 | IT | 10.132786 | True | Excellent | 95400.0 | Thomas | Smith |
| 48 | 49 | IT | 5.720000 | True | Average | 68800.0 | Patricia | Thomas |
| 49 | 50 | IT | 4.320000 | True | Good | 62800.0 | Sarah | Jones |
Notice that pandas shows the first and last few rows, with ... in between. This is helpful when your dataset is large.
Exploring Your Data#
Once you’ve loaded data, you’ll want to explore it. Pandas provides several useful methods for this.
Viewing the First/Last Rows#
Use .head() to see the first few rows:
employee.head()
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Finance | 1.820000 | False | Good | 56550.0 | Michael | Davis |
| 1 | 2 | IT | 2.090000 | False | Poor | 55225.0 | Karen | Brown |
| 2 | 3 | IT | 7.047228 | False | Excellent | 87600.0 | Joseph | Johnson |
| 3 | 4 | Finance | 1.237509 | False | Excellent | 53875.0 | David | Garcia |
| 4 | 5 | Marketing | 4.160000 | True | Average | 61400.0 | Linda | Martinez |
By default, .head() shows 5 rows. You can specify a different number:
employee.head(10) # First 10 rows
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Finance | 1.820000 | False | Good | 56550.0 | Michael | Davis |
| 1 | 2 | IT | 2.090000 | False | Poor | 55225.0 | Karen | Brown |
| 2 | 3 | IT | 7.047228 | False | Excellent | 87600.0 | Joseph | Johnson |
| 3 | 4 | Finance | 1.237509 | False | Excellent | 53875.0 | David | Garcia |
| 4 | 5 | Marketing | 4.160000 | True | Average | 61400.0 | Linda | Martinez |
| 5 | 6 | Finance | 2.475017 | True | Excellent | 70550.0 | Michael | Brown |
| 6 | 7 | Marketing | 10.420000 | True | Good | 82050.0 | Charles | Moore |
| 7 | 8 | Marketing | 6.102669 | True | Good | 84750.0 | David | Lopez |
| 8 | 9 | Marketing | 10.000000 | True | Poor | 80500.0 | David | Johnson |
| 9 | 10 | HR | 7.479808 | False | Average | 101350.0 | Patricia | Martinez |
Similarly, .tail() shows the last rows:
employee.tail(10) # Last 10 rows
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 40 | 41 | Finance | 1.120000 | True | Average | 54800.0 | Joseph | Miller |
| 41 | 42 | Marketing | 2.083504 | False | Good | 86200.0 | Michael | Rodriguez |
| 42 | 43 | Marketing | 10.269678 | True | Poor | 96025.0 | Barbara | Miller |
| 43 | 44 | HR | 2.280000 | False | Average | 57700.0 | Linda | Rodriguez |
| 44 | 45 | Finance | 11.189596 | True | Excellent | 84375.0 | Joseph | Davis |
| 45 | 46 | Finance | 12.555784 | True | Good | 92625.0 | John | Lopez |
| 46 | 47 | IT | 7.380000 | True | Average | 73950.0 | Susan | Johnson |
| 47 | 48 | IT | 10.132786 | True | Excellent | 95400.0 | Thomas | Smith |
| 48 | 49 | IT | 5.720000 | True | Average | 68800.0 | Patricia | Thomas |
| 49 | 50 | IT | 4.320000 | True | Good | 62800.0 | Sarah | Jones |
Checking the Shape#
How big is your dataset? Use .shape:
employee.shape
(50, 8)
This tells us we have 50 rows and 8 columns.
Viewing Column Names#
To see all column names:
employee.columns
Index(['Employee_ID', 'Department', 'Years_of_Experience', 'Full_Time',
'Performance_Score', 'Salary', 'First_Name', 'Last_Name'],
dtype='object')
Getting Summary Statistics#
The .describe() method gives you summary statistics for numerical columns:
employee.describe()
| Employee_ID | Years_of_Experience | Salary | |
|---|---|---|---|
| count | 50.00000 | 50.000000 | 50.000000 |
| mean | 25.50000 | 6.552038 | 78774.000000 |
| std | 14.57738 | 3.777367 | 15041.086553 |
| min | 1.00000 | 1.120000 | 53875.000000 |
| 25% | 13.25000 | 2.980262 | 65975.000000 |
| 50% | 25.50000 | 6.184805 | 80150.000000 |
| 75% | 37.75000 | 10.099589 | 89156.250000 |
| max | 50.00000 | 13.880903 | 105750.000000 |
This shows you the count, mean, standard deviation, min, max, and quartiles (25%, 50%, 75%) for each numerical column. It’s a quick way to understand the distribution of your data.
Selecting Columns#
To select a single column, use square brackets with the column name:
employee['Salary']
0 56550.0
1 55225.0
2 87600.0
3 53875.0
4 61400.0
5 70550.0
6 82050.0
7 84750.0
8 80500.0
9 101350.0
10 67925.0
11 83650.0
12 105750.0
13 69650.0
14 91175.0
15 74000.0
16 63400.0
17 88200.0
18 60425.0
19 65150.0
20 89775.0
21 94425.0
22 75650.0
23 105075.0
24 79800.0
25 75875.0
26 81350.0
27 65325.0
28 74725.0
29 74550.0
30 104700.0
31 56200.0
32 61125.0
33 89475.0
34 87825.0
35 71675.0
36 87650.0
37 105550.0
38 96000.0
39 86100.0
40 54800.0
41 86200.0
42 96025.0
43 57700.0
44 84375.0
45 92625.0
46 73950.0
47 95400.0
48 68800.0
49 62800.0
Name: Salary, dtype: float64
Notice this returns a Series (a single column).
To select multiple columns, pass a list of column names:
employee[['Performance_Score', 'Salary']]
| Performance_Score | Salary | |
|---|---|---|
| 0 | Good | 56550.0 |
| 1 | Poor | 55225.0 |
| 2 | Excellent | 87600.0 |
| 3 | Excellent | 53875.0 |
| 4 | Average | 61400.0 |
| 5 | Excellent | 70550.0 |
| 6 | Good | 82050.0 |
| 7 | Good | 84750.0 |
| 8 | Poor | 80500.0 |
| 9 | Average | 101350.0 |
| 10 | Poor | 67925.0 |
| 11 | Average | 83650.0 |
| 12 | Good | 105750.0 |
| 13 | Excellent | 69650.0 |
| 14 | Good | 91175.0 |
| 15 | Average | 74000.0 |
| 16 | Poor | 63400.0 |
| 17 | Poor | 88200.0 |
| 18 | Excellent | 60425.0 |
| 19 | Excellent | 65150.0 |
| 20 | Average | 89775.0 |
| 21 | Poor | 94425.0 |
| 22 | Excellent | 75650.0 |
| 23 | Average | 105075.0 |
| 24 | Excellent | 79800.0 |
| 25 | Excellent | 75875.0 |
| 26 | Good | 81350.0 |
| 27 | Average | 65325.0 |
| 28 | Good | 74725.0 |
| 29 | Excellent | 74550.0 |
| 30 | Good | 104700.0 |
| 31 | Poor | 56200.0 |
| 32 | Poor | 61125.0 |
| 33 | Poor | 89475.0 |
| 34 | Poor | 87825.0 |
| 35 | Average | 71675.0 |
| 36 | Poor | 87650.0 |
| 37 | Good | 105550.0 |
| 38 | Good | 96000.0 |
| 39 | Excellent | 86100.0 |
| 40 | Average | 54800.0 |
| 41 | Good | 86200.0 |
| 42 | Poor | 96025.0 |
| 43 | Average | 57700.0 |
| 44 | Excellent | 84375.0 |
| 45 | Good | 92625.0 |
| 46 | Average | 73950.0 |
| 47 | Excellent | 95400.0 |
| 48 | Average | 68800.0 |
| 49 | Good | 62800.0 |
Notice the double brackets [[...]]. The outer brackets are for indexing the DataFrame; the inner brackets create a list of column names. This returns a DataFrame (not a Series).
Indexing in Pandas#
Pandas provides two main ways to select data:
Method |
Type |
Description |
|---|---|---|
|
Index-based |
Select by position (0, 1, 2, …) |
|
Label-based |
Select by label (row/column names) |
This is one of the most important concepts in pandas, so let’s explore both carefully.
Index-Based Selection with .iloc#
.iloc stands for “integer location”. It selects data based on numerical position, just like indexing a list.
Selecting Rows#
To select the first row (index 0):
employee.iloc[0]
Employee_ID 1
Department Finance
Years_of_Experience 1.82
Full_Time False
Performance_Score Good
Salary 56550.0
First_Name Michael
Last_Name Davis
Name: 0, dtype: object
This returns a Series containing all columns for that row.
The second row (index 1):
employee.iloc[1]
Employee_ID 2
Department IT
Years_of_Experience 2.09
Full_Time False
Performance_Score Poor
Salary 55225.0
First_Name Karen
Last_Name Brown
Name: 1, dtype: object
Slicing Rows#
You can select multiple rows using slicing:
employee.iloc[0:3] # Rows 0, 1, and 2
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Finance | 1.820000 | False | Good | 56550.0 | Michael | Davis |
| 1 | 2 | IT | 2.090000 | False | Poor | 55225.0 | Karen | Brown |
| 2 | 3 | IT | 7.047228 | False | Excellent | 87600.0 | Joseph | Johnson |
Remember: slicing in Python is exclusive of the end index. 0:3 gives you indices 0, 1, 2 (not 3).
Selecting Non-Consecutive Rows#
What if you want specific rows that aren’t in sequence? Pass a list of indices:
employee.iloc[[1, 2, 39]]
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | IT | 2.090000 | False | Poor | 55225.0 | Karen | Brown |
| 2 | 3 | IT | 7.047228 | False | Excellent | 87600.0 | Joseph | Johnson |
| 39 | 40 | HR | 11.640000 | True | Excellent | 86100.0 | Karen | Martinez |
Selecting Rows and Columns#
.iloc can select both rows and columns using [row, column]:
employee.iloc[0, 0] # First row, first column
np.int64(1)
employee.iloc[0, 1] # First row, second column
'Finance'
You can also use lists and slices for both:
employee.iloc[[0, 1, 2], [0, 1, 2]] # First 3 rows, first 3 columns
| Employee_ID | Department | Years_of_Experience | |
|---|---|---|---|
| 0 | 1 | Finance | 1.820000 |
| 1 | 2 | IT | 2.090000 |
| 2 | 3 | IT | 7.047228 |
Or with slicing:
employee.iloc[0:3, 0:4] # Rows 0-2, columns 0-3
| Employee_ID | Department | Years_of_Experience | Full_Time | |
|---|---|---|---|---|
| 0 | 1 | Finance | 1.820000 | False |
| 1 | 2 | IT | 2.090000 | False |
| 2 | 3 | IT | 7.047228 | False |
Label-Based Selection with .loc#
.loc selects data by labels (names) rather than positions. This is often more intuitive when you know your column names.
To see .loc in action clearly, let’s create a DataFrame with meaningful row labels:
df1 = pd.DataFrame(
[[1, 2, 2], [4, 5, 2], [7, 5, 2], [232, 21, 24]],
index=['cobra', 'viper', 'sidewinder', 'rattle_snake'],
columns=['max_speed', 'shield', 'windy']
)
df1
| max_speed | shield | windy | |
|---|---|---|---|
| cobra | 1 | 2 | 2 |
| viper | 4 | 5 | 2 |
| sidewinder | 7 | 5 | 2 |
| rattle_snake | 232 | 21 | 24 |
Selecting Rows by Label#
df1.loc['viper']
max_speed 4
shield 5
windy 2
Name: viper, dtype: int64
Selecting Multiple Rows#
Pass a list of labels:
df1.loc[['viper', 'rattle_snake']]
| max_speed | shield | windy | |
|---|---|---|---|
| viper | 4 | 5 | 2 |
| rattle_snake | 232 | 21 | 24 |
Slicing with Labels#
Unlike .iloc, .loc slicing is inclusive of the end label:
df1.loc['viper':'rattle_snake']
| max_speed | shield | windy | |
|---|---|---|---|
| viper | 4 | 5 | 2 |
| sidewinder | 7 | 5 | 2 |
| rattle_snake | 232 | 21 | 24 |
Notice this includes rattle_snake!
Selecting Rows and Columns by Label#
df1.loc['viper':'rattle_snake', 'shield':'windy']
| shield | windy | |
|---|---|---|
| viper | 5 | 2 |
| sidewinder | 5 | 2 |
| rattle_snake | 21 | 24 |
The Difference Between .loc and .iloc#
This is a crucial distinction:
Feature |
|
|
|---|---|---|
Selection type |
By position (integer) |
By label (name) |
Slicing |
Exclusive of end |
Inclusive of end |
Example |
|
|
When to use |
When you know the position |
When you know the label |
Warning
Be careful! df.iloc[0:3] and df.loc[0:3] can give different results if your index labels are integers. .iloc always uses position; .loc always uses labels.
Practical Example: Working with Real Data#
Let’s put it all together with our employee dataset:
# Check the shape
print(f"Dataset has {employee.shape[0]} rows and {employee.shape[1]} columns")
Dataset has 50 rows and 8 columns
# View first few rows
employee.head()
| Employee_ID | Department | Years_of_Experience | Full_Time | Performance_Score | Salary | First_Name | Last_Name | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Finance | 1.820000 | False | Good | 56550.0 | Michael | Davis |
| 1 | 2 | IT | 2.090000 | False | Poor | 55225.0 | Karen | Brown |
| 2 | 3 | IT | 7.047228 | False | Excellent | 87600.0 | Joseph | Johnson |
| 3 | 4 | Finance | 1.237509 | False | Excellent | 53875.0 | David | Garcia |
| 4 | 5 | Marketing | 4.160000 | True | Average | 61400.0 | Linda | Martinez |
# Get summary statistics
employee.describe()
| Employee_ID | Years_of_Experience | Salary | |
|---|---|---|---|
| count | 50.00000 | 50.000000 | 50.000000 |
| mean | 25.50000 | 6.552038 | 78774.000000 |
| std | 14.57738 | 3.777367 | 15041.086553 |
| min | 1.00000 | 1.120000 | 53875.000000 |
| 25% | 13.25000 | 2.980262 | 65975.000000 |
| 50% | 25.50000 | 6.184805 | 80150.000000 |
| 75% | 37.75000 | 10.099589 | 89156.250000 |
| max | 50.00000 | 13.880903 | 105750.000000 |
# Select specific columns
salaries = employee[['First_Name', 'Last_Name', 'Salary']]
salaries.head()
| First_Name | Last_Name | Salary | |
|---|---|---|---|
| 0 | Michael | Davis | 56550.0 |
| 1 | Karen | Brown | 55225.0 |
| 2 | Joseph | Johnson | 87600.0 |
| 3 | David | Garcia | 53875.0 |
| 4 | Linda | Martinez | 61400.0 |
# Select specific rows and columns
employee.iloc[0:5, [1, 5]] # First 5 rows, columns 1 and 5
| Department | Salary | |
|---|---|---|
| 0 | Finance | 56550.0 |
| 1 | IT | 55225.0 |
| 2 | IT | 87600.0 |
| 3 | Finance | 53875.0 |
| 4 | Marketing | 61400.0 |
Exercises#
Exercise 12
Exercise 1: Create a DataFrame
Create a DataFrame called movies with the following data:
Title |
Year |
Rating |
|---|---|---|
The Matrix |
1999 |
8.7 |
Inception |
2010 |
8.8 |
Interstellar |
2014 |
8.6 |
Then:
Print the DataFrame
Print only the ‘Title’ column
Print the shape of the DataFrame
Solution to Exercise 12
import pandas as pd
movies = pd.DataFrame({
'Title': ['The Matrix', 'Inception', 'Interstellar'],
'Year': [1999, 2010, 2014],
'Rating': [8.7, 8.8, 8.6]
})
# Print the DataFrame
print(movies)
# Print only the 'Title' column
print(movies['Title'])
# Print the shape
print(movies.shape) # (3, 3)
Exercise 13
Exercise 2: Index-Based Selection
Using the movies DataFrame from Exercise 1:
Select the first row using
.ilocSelect the last two rows using
.ilocSelect the ‘Title’ and ‘Rating’ columns for all rows using
.iloc
Solution to Exercise 13
# 1. First row
print(movies.iloc[0])
# 2. Last two rows
print(movies.iloc[-2:])
# Or: movies.iloc[1:3]
# 3. Title and Rating columns (columns 0 and 2)
print(movies.iloc[:, [0, 2]])
Exercise 14
Exercise 3: Exploring a Dataset
Load the built-in seaborn tips dataset and explore it:
import seaborn as sns
tips = sns.load_dataset('tips')
Answer these questions:
How many rows and columns does the dataset have?
What are the column names?
What are the summary statistics for the numerical columns?
What do the first 10 rows look like?
Solution to Exercise 14
import seaborn as sns
tips = sns.load_dataset('tips')
# 1. Shape
print(f"Rows: {tips.shape[0]}, Columns: {tips.shape[1]}")
# Output: Rows: 244, Columns: 7
# 2. Column names
print(tips.columns)
# Output: Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')
# 3. Summary statistics
print(tips.describe())
# 4. First 10 rows
print(tips.head(10))
Exercise 15
Exercise 4: .loc vs .iloc
Create this DataFrame:
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}, index=['w', 'x', 'y', 'z'])
Then use both .loc and .iloc to:
Select the row with label ‘x’
Select rows ‘x’ through ‘z’ and columns ‘A’ and ‘B’
Select the value at row ‘y’, column ‘C’
Solution to Exercise 15
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8],
'C': [9, 10, 11, 12]
}, index=['w', 'x', 'y', 'z'])
# 1. Select row 'x'
print(df.loc['x']) # Using label
print(df.iloc[1]) # Using position
# 2. Rows 'x' through 'z', columns 'A' and 'B'
print(df.loc['x':'z', ['A', 'B']]) # Using labels
print(df.iloc[1:4, [0, 1]]) # Using positions
# 3. Value at row 'y', column 'C'
print(df.loc['y', 'C']) # Output: 11
print(df.iloc[2, 2]) # Output: 11