48  Pandas

What This Chapter Covers

Pandas (Wes McKinney, 2010) is the library you will spend the most time in as an analyst. By the end of this chapter you will be able to:

  • Create Series (1-D labelled) and DataFrame (2-D labelled) objects from scratch, from a dictionary, or from a file.
  • Inspect a DataFrame with head, tail, info, describe, shape and dtypes.
  • Select rows and columns using bracket notation, .loc (label-based) and .iloc (position-based).
  • Filter with Boolean conditions and combine them using &, | and ~.
  • Handle missing data with dropna, fillna and isna.
  • Create new columns from existing ones using vectorised expressions and apply.
  • Group and aggregate with groupby (split-apply-combine).
  • Combine DataFrames with concat and SQL-style merge.
  • Read and write CSV and Excel files.

48.1 A Map of Pandas

Pandas revolves around two data structures — the Series and the DataFrame — and a set of verbs that operate on them. The diagram below sorts the most common operations by purpose.

flowchart TB
    A["Pandas DataFrame / Series"] --> B["I/O<br/>read_csv, read_excel<br/>to_csv, to_excel"]
    A --> C["Inspect<br/>head, tail, info, describe"]
    A --> D["Select<br/>[], .loc, .iloc"]
    A --> E["Filter<br/>boolean masks"]
    A --> F["Clean<br/>dropna, fillna, astype"]
    A --> G["Transform<br/>new columns, apply, map"]
    A --> H["Aggregate<br/>groupby, agg, pivot_table"]
    A --> I["Combine<br/>concat, merge, join"]


Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.

48.2 Core Features

  • Data Structures: Pandas introduces two primary data structures: DataFrame and Series. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Series, on the other hand, is a one-dimensional array with axis labels.

  • Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.

  • File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.

  • Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.

48.3 Installation

Pandas comes pre-installed in this book’s pyodide runtime, so there is no need to run pip install pandas here. In your own environment, install it with pip install pandas (or conda install pandas inside Anaconda).

48.3.1 Basic Usage

Importing Pandas is typically done using the pd alias, alongside NumPy:

Creating Data Structures

A Series is a one-dimensional labelled array and a DataFrame is a two-dimensional labelled table. The example below builds both from in-memory data.

Viewing Data

head() and tail() are the two fastest ways to look at a DataFrame, while index, columns and to_numpy() expose its underlying structure. For a one-line summary, info() shows dtypes and memory, and describe() prints numeric summary statistics.

Data Selection

You can select a single column, a positional slice of rows, a specific row by label with .loc, or a specific row by integer position with .iloc.

Rule of thumb: .loc uses labels (both endpoints inclusive); .iloc uses integer positions (end exclusive, like a Python slice). When in doubt, be explicit — avoid relying on df[...] for row selection beyond quick peeks.

Boolean Filtering

Most real analytics questions are of the form “show me rows where X”. Pandas answers those with Boolean masks — the same idea as NumPy, lifted to DataFrames.

  • df[df["Revenue"] > 100] keeps rows meeting the condition.
  • Combine conditions with & (and), | (or), ~ (not). Always parenthesise each condition.
  • .isin([...]) tests for membership in a list.
  • .between(lo, hi) tests for inclusive range membership.

Sorting

sort_values() sorts by the values in one or more columns; sort_index() sorts by the row index. Pass ascending=False to reverse, and a list to break ties across multiple columns.

Adding and Transforming Columns

New columns are created by assignment — the right-hand side is any vectorised expression or an apply() call. Pandas lines up values by index automatically.

Missing Data

Pandas uses np.nan to mark missing values. Most default computations ignore them, but you can remove or replace them with dropna() and fillna().

Operations

Pandas DataFrames offer column-wise summary statistics and row-wise or column-wise function application via apply().

Grouping

groupby() splits a DataFrame by the values of one or more columns, applies an aggregation to each group, then combines the results back together — the canonical split-apply-combine pattern.

Single aggregations (sum, mean, count) are one-liners. For multiple aggregations on different columns, use .agg({...}).

Merging

Pandas offers two main ways to combine DataFrames: concat() stacks them along an axis, and merge() joins them on one or more key columns, SQL-style.

Choosing a Join Type

how= Returns
inner Only rows with a match in both DataFrames (the default)
left All rows from the left DataFrame, plus any matches from the right
right All rows from the right DataFrame, plus any matches from the left
outer All rows from both, with NaN filled in where a match is missing

Use how="left" for the common “enrich my main table with a lookup” pattern.

File I/O

In a normal Python environment you would read and write files directly on disk:

pd.read_csv("filename.csv")
df.to_csv("my_dataframe.csv")

pd.read_excel("filename.xlsx", sheet_name="Sheet1")
df.to_excel("my_dataframe.xlsx", sheet_name="Sheet1")

Inside the in-browser pyodide runtime we cannot read an arbitrary file from your computer, but we can demonstrate the same round-trip using an in-memory CSV with io.StringIO:


48.4 Common Pitfalls with Pandas

  • SettingWithCopyWarning → Assigning to a slice (e.g. df[df.A > 0]["B"] = 1) may silently fail. Use .loc on both sides: df.loc[df.A > 0, "B"] = 1.
  • Forgetting parentheses in boolean filtersdf[df.A > 0 & df.B < 5] raises an error because & binds tighter than >. Write df[(df.A > 0) & (df.B < 5)].
  • and/or vs &/|and and or work on scalar booleans. For element-wise logic on a Series, use & and |.
  • .loc is inclusive on both endsdf.loc[0:3] returns four rows (0, 1, 2, 3), unlike Python slicing. .iloc[0:3] returns three rows.
  • Chained indexingdf["A"]["x"] sometimes works and sometimes doesn’t. Prefer df.loc["x", "A"].
  • Silent NaN in aggregationssum() skips NaN by default. If NaN means “unknown, can’t compute”, pass skipna=False to force NaN in the result.
  • Mutating while iterating → Don’t modify a DataFrame inside a for loop over its rows. Build a new column with a vectorised expression or apply() instead.
  • Dtype surprises after reading CSV → A column with one blank cell may arrive as object or float64 rather than int64. Check df.dtypes and cast with .astype(int) if needed.

Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods for efficient data manipulation and analysis. Its robust features for handling complex data operations make it indispensable for analysts and data scientists.


Summary

Concept Description
Foundations
Pandas A high-performance data analysis library built on NumPy, centred on labelled tabular data structures
Series A one-dimensional labelled array that behaves like a column of a spreadsheet with an index
DataFrame A two-dimensional, size-mutable, heterogeneous tabular structure with row and column labels
Labelled Axes Rows and columns carry names, so code can refer to data by label rather than by integer position
Built on NumPy Pandas operations ultimately run on NumPy arrays, inheriting vectorised speed for numerical work
Inspecting a DataFrame
head() and tail() df.head(n) returns the first n rows and df.tail(n) returns the last n, for quick inspection
info() and describe() df.info() prints dtypes and memory; df.describe() prints mean, std, min, max and quartiles for numeric columns
.index, .columns, .to_numpy() df.index and df.columns expose the axis labels, and df.to_numpy() returns the underlying NumPy array
Selecting and Filtering
Selection with [] df['A'] selects a column as a Series; df[0:3] slices rows by position using bracket notation
.loc vs .iloc df.loc[label] selects by row or column labels and is inclusive; df.iloc[position] selects by integer position and is exclusive
Boolean Filtering Filter rows by passing a Boolean Series to [] — combine with &, |, ~ and always parenthesise each condition
.isin() and .between() .isin([...]) tests membership in a list; .between(lo, hi) tests inclusive range membership
sort_values() Sort by one or more columns with ascending flags, the most common way to produce top-N or bottom-N views
Transforming and Cleaning
Creating New Columns Assign to df['new'] with any vectorised expression; Pandas aligns values by the row index automatically
Missing Data np.nan Pandas represents missing values with np.nan and excludes them from most default computations
dropna() and fillna() Use dropna() to remove rows with missing values and fillna(value) to substitute a replacement value
Summary Statistics Built-in reductions such as df.mean(), df.sum() and df.describe() summarise numeric columns
apply() Apply an arbitrary function column-wise or row-wise with df.apply(func) for custom transformations
Grouping and Aggregating
groupby() df.groupby(col).agg(...) splits the DataFrame into groups, applies an aggregation and recombines the results
.agg() with Multiple Aggregations Pass a dict or named tuples to .agg() to compute several different aggregations across several columns in one call
Combining DataFrames
concat() pd.concat stacks DataFrames along an axis, typically for appending new rows below existing ones
merge() pd.merge performs SQL-style joins on one or more key columns, connecting two DataFrames by matching values
Join Types inner, left, right, outer how='inner' keeps only matched rows; 'left' keeps all rows on the left; 'right' on the right; 'outer' keeps all rows from both
File I/O
read_csv() and to_csv() pd.read_csv(path) and df.to_csv(path) load and save comma-separated files with a single call
read_excel() and to_excel() pd.read_excel(path, sheet_name) and df.to_excel(path, sheet_name) do the same for Excel workbooks
Gotchas
Common Pitfalls Beware SettingWithCopyWarning, missing parentheses in filters, and .loc being inclusive at both ends