flowchart TB
A["Pandas DataFrame / Series"] --> B["I/O<br/>read_csv, read_excel<br/>to_csv, to_excel"]
A --> C["Inspect<br/>head, tail, info, describe"]
A --> D["Select<br/>[], .loc, .iloc"]
A --> E["Filter<br/>boolean masks"]
A --> F["Clean<br/>dropna, fillna, astype"]
A --> G["Transform<br/>new columns, apply, map"]
A --> H["Aggregate<br/>groupby, agg, pivot_table"]
A --> I["Combine<br/>concat, merge, join"]
48 Pandas
What This Chapter Covers
Pandas (Wes McKinney, 2010) is the library you will spend the most time in as an analyst. By the end of this chapter you will be able to:
- Create Series (1-D labelled) and DataFrame (2-D labelled) objects from scratch, from a dictionary, or from a file.
-
Inspect a DataFrame with
head,tail,info,describe,shapeanddtypes. -
Select rows and columns using bracket notation,
.loc(label-based) and.iloc(position-based). -
Filter with Boolean conditions and combine them using
&,|and~. -
Handle missing data with
dropna,fillnaandisna. -
Create new columns from existing ones using vectorised expressions and
apply. -
Group and aggregate with
groupby(split-apply-combine). -
Combine DataFrames with
concatand SQL-stylemerge. - Read and write CSV and Excel files.
48.1 A Map of Pandas
Pandas revolves around two data structures — the Series and the DataFrame — and a set of verbs that operate on them. The diagram below sorts the most common operations by purpose.
Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.
48.2 Core Features
Data Structures: Pandas introduces two primary data structures:
DataFrameandSeries. ADataFrameis a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). ASeries, on the other hand, is a one-dimensional array with axis labels.Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.
File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.
Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.
48.3 Installation
Pandas comes pre-installed in this book’s pyodide runtime, so there is no need to run pip install pandas here. In your own environment, install it with pip install pandas (or conda install pandas inside Anaconda).
48.3.1 Basic Usage
Importing Pandas is typically done using the pd alias, alongside NumPy:
Creating Data Structures
A Series is a one-dimensional labelled array and a DataFrame is a two-dimensional labelled table. The example below builds both from in-memory data.
Viewing Data
Data Selection
You can select a single column, a positional slice of rows, a specific row by label with .loc, or a specific row by integer position with .iloc.
Rule of thumb: .loc uses labels (both endpoints inclusive); .iloc uses integer positions (end exclusive, like a Python slice). When in doubt, be explicit — avoid relying on df[...] for row selection beyond quick peeks.
Boolean Filtering
Most real analytics questions are of the form “show me rows where X”. Pandas answers those with Boolean masks — the same idea as NumPy, lifted to DataFrames.
-
df[df["Revenue"] > 100]keeps rows meeting the condition. - Combine conditions with
&(and),|(or),~(not). Always parenthesise each condition. -
.isin([...])tests for membership in a list. -
.between(lo, hi)tests for inclusive range membership.
Sorting
sort_values() sorts by the values in one or more columns; sort_index() sorts by the row index. Pass ascending=False to reverse, and a list to break ties across multiple columns.
Adding and Transforming Columns
New columns are created by assignment — the right-hand side is any vectorised expression or an apply() call. Pandas lines up values by index automatically.
Missing Data
Pandas uses np.nan to mark missing values. Most default computations ignore them, but you can remove or replace them with dropna() and fillna().
Operations
Pandas DataFrames offer column-wise summary statistics and row-wise or column-wise function application via apply().
Grouping
groupby() splits a DataFrame by the values of one or more columns, applies an aggregation to each group, then combines the results back together — the canonical split-apply-combine pattern.
Single aggregations (sum, mean, count) are one-liners. For multiple aggregations on different columns, use .agg({...}).
Merging
Pandas offers two main ways to combine DataFrames: concat() stacks them along an axis, and merge() joins them on one or more key columns, SQL-style.
Choosing a Join Type
| how= | Returns |
|---|---|
inner |
Only rows with a match in both DataFrames (the default) |
left |
All rows from the left DataFrame, plus any matches from the right |
right |
All rows from the right DataFrame, plus any matches from the left |
outer |
All rows from both, with NaN filled in where a match is missing |
Use how="left" for the common “enrich my main table with a lookup” pattern.
File I/O
In a normal Python environment you would read and write files directly on disk:
pd.read_csv("filename.csv")
df.to_csv("my_dataframe.csv")
pd.read_excel("filename.xlsx", sheet_name="Sheet1")
df.to_excel("my_dataframe.xlsx", sheet_name="Sheet1")Inside the in-browser pyodide runtime we cannot read an arbitrary file from your computer, but we can demonstrate the same round-trip using an in-memory CSV with io.StringIO:
48.4 Common Pitfalls with Pandas
-
SettingWithCopyWarning→ Assigning to a slice (e.g.df[df.A > 0]["B"] = 1) may silently fail. Use.locon both sides:df.loc[df.A > 0, "B"] = 1. -
Forgetting parentheses in boolean filters →
df[df.A > 0 & df.B < 5]raises an error because&binds tighter than>. Writedf[(df.A > 0) & (df.B < 5)]. -
and/orvs&/|→andandorwork on scalar booleans. For element-wise logic on a Series, use&and|. -
.locis inclusive on both ends →df.loc[0:3]returns four rows (0, 1, 2, 3), unlike Python slicing..iloc[0:3]returns three rows. -
Chained indexing →
df["A"]["x"]sometimes works and sometimes doesn’t. Preferdf.loc["x", "A"]. -
Silent NaN in aggregations →
sum()skips NaN by default. If NaN means “unknown, can’t compute”, passskipna=Falseto force NaN in the result. -
Mutating while iterating → Don’t modify a DataFrame inside a
forloop over its rows. Build a new column with a vectorised expression orapply()instead. -
Dtype surprises after reading CSV → A column with one blank cell may arrive as
objectorfloat64rather thanint64. Checkdf.dtypesand cast with.astype(int)if needed.
Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods for efficient data manipulation and analysis. Its robust features for handling complex data operations make it indispensable for analysts and data scientists.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Pandas | A high-performance data analysis library built on NumPy, centred on labelled tabular data structures |
| Series | A one-dimensional labelled array that behaves like a column of a spreadsheet with an index |
| DataFrame | A two-dimensional, size-mutable, heterogeneous tabular structure with row and column labels |
| Labelled Axes | Rows and columns carry names, so code can refer to data by label rather than by integer position |
| Built on NumPy | Pandas operations ultimately run on NumPy arrays, inheriting vectorised speed for numerical work |
| Inspecting a DataFrame | |
| head() and tail() | df.head(n) returns the first n rows and df.tail(n) returns the last n, for quick inspection |
| info() and describe() | df.info() prints dtypes and memory; df.describe() prints mean, std, min, max and quartiles for numeric columns |
| .index, .columns, .to_numpy() | df.index and df.columns expose the axis labels, and df.to_numpy() returns the underlying NumPy array |
| Selecting and Filtering | |
| Selection with [] | df['A'] selects a column as a Series; df[0:3] slices rows by position using bracket notation |
| .loc vs .iloc | df.loc[label] selects by row or column labels and is inclusive; df.iloc[position] selects by integer position and is exclusive |
| Boolean Filtering | Filter rows by passing a Boolean Series to [] — combine with &, |, ~ and always parenthesise each condition |
| .isin() and .between() | .isin([...]) tests membership in a list; .between(lo, hi) tests inclusive range membership |
| sort_values() | Sort by one or more columns with ascending flags, the most common way to produce top-N or bottom-N views |
| Transforming and Cleaning | |
| Creating New Columns | Assign to df['new'] with any vectorised expression; Pandas aligns values by the row index automatically |
| Missing Data np.nan | Pandas represents missing values with np.nan and excludes them from most default computations |
| dropna() and fillna() | Use dropna() to remove rows with missing values and fillna(value) to substitute a replacement value |
| Summary Statistics | Built-in reductions such as df.mean(), df.sum() and df.describe() summarise numeric columns |
| apply() | Apply an arbitrary function column-wise or row-wise with df.apply(func) for custom transformations |
| Grouping and Aggregating | |
| groupby() | df.groupby(col).agg(...) splits the DataFrame into groups, applies an aggregation and recombines the results |
| .agg() with Multiple Aggregations | Pass a dict or named tuples to .agg() to compute several different aggregations across several columns in one call |
| Combining DataFrames | |
| concat() | pd.concat stacks DataFrames along an axis, typically for appending new rows below existing ones |
| merge() | pd.merge performs SQL-style joins on one or more key columns, connecting two DataFrames by matching values |
| Join Types inner, left, right, outer | how='inner' keeps only matched rows; 'left' keeps all rows on the left; 'right' on the right; 'outer' keeps all rows from both |
| File I/O | |
| read_csv() and to_csv() | pd.read_csv(path) and df.to_csv(path) load and save comma-separated files with a single call |
| read_excel() and to_excel() | pd.read_excel(path, sheet_name) and df.to_excel(path, sheet_name) do the same for Excel workbooks |
| Gotchas | |
| Common Pitfalls | Beware SettingWithCopyWarning, missing parentheses in filters, and .loc being inclusive at both ends |