47  Numpy

What This Chapter Covers

NumPy (Charles R. Harris et al., 2020) is the numerical backbone of almost every Python analytics library. By the end of this chapter you will be able to:

  • Create arrays from lists, ranges and special constructors (zeros, ones, linspace).
  • Inspect an array’s shape, dtype and dimensions.
  • Use indexing, slicing and boolean masking to pull out the values you need.
  • Apply element-wise arithmetic and mathematical functions across whole arrays at once.
  • Reshape, join and split arrays.
  • Compute summary statistics (mean, median, std, var, min, max) along one or more axes.
  • Understand broadcasting — how NumPy aligns arrays of different shapes.
  • Generate reproducible random numbers for simulation and sampling.

47.1 A Map of NumPy

Everything in NumPy revolves around the ndarray — an n-dimensional array of homogeneous, typed values. The diagram below groups the functionality you will actually use day-to-day.

flowchart TB
    A["NumPy ndarray"] --> B["Creation<br/>array, arange, zeros, ones, linspace"]
    A --> C["Attributes<br/>shape, dtype, ndim, size"]
    A --> D["Access<br/>indexing, slicing, boolean masks"]
    A --> E["Math<br/>+, -, *, /, sqrt, exp, log"]
    A --> F["Statistics<br/>mean, median, std, var, min, max"]
    A --> G["Shape ops<br/>reshape, concatenate, split"]
    A --> H["Random<br/>rng.random, normal, choice"]


NumPy (Numerical Python) is a fundamental package for numerical computing in Python. It provides support for multi-dimensional arrays, mathematical functions, linear algebra operations, and random number generation, making it an essential tool for scientific computing, data analysis, and machine learning.

  • NumPy is a powerful, efficient, and versatile library that serves as the backbone of data analysis, machine learning, and scientific computing in Python. Its ability to perform fast array computations, mathematical operations, and linear algebra functions makes it a must-learn for data science professionals.

47.1.1 Key Features of NumPy

  1. Efficient Array Handling: Supports ndarray, a powerful multi-dimensional array object that is more efficient than Python lists.
  2. Vectorized Operations: Eliminates the need for explicit loops by applying operations element-wise.
  3. Broadcasting: Allows arithmetic operations on arrays of different shapes without explicit looping.
  4. Mathematical Functions: Provides a wide range of functions for algebra, statistics, trigonometry, and more.
  5. Random Number Generation: Generates pseudo-random numbers for simulations and machine learning.
  6. Integration with Other Libraries: Works seamlessly with pandas, matplotlib, scikit-learn, and TensorFlow.

47.1.2 Installing NumPy

To install NumPy, use: pip install numpy Or, if using Anaconda: conda install numpy

The convention in every Python project is to import NumPy as np:

import numpy as np

47.2 Array Creation

  • np.array([1, 2, 3]): Create a NumPy array from a list or tuple.
  • np.arange(10): Create an array with a range of numbers.
  • np.zeros((3, 4)): Create a 3-row, 4-column array of zeros.
  • np.ones((2, 3)): Create a 2-row, 3-column array of ones.
  • np.linspace(0, 1, 5): Five evenly spaced numbers from 0 to 1 (inclusive on both ends).

47.3 Array Attributes

Every ndarray carries metadata that describes it. Knowing these attributes makes debugging much easier:

  • a.shape — the size along each axis, returned as a tuple.
  • a.ndim — number of dimensions (axes).
  • a.size — total number of elements.
  • a.dtype — the element type (e.g. int64, float64, bool).

47.4 Indexing and Slicing

Indexing pulls out individual elements; slicing pulls out sub-arrays. For 2-D arrays, use a comma between row and column indexes.

  • a[0] — first row (for a 2-D array) or first element (for a 1-D array).
  • a[0, 1] — element at row 0, column 1.
  • a[:, 0] — every row, column 0 → a single column.
  • a[1:3] — rows 1 and 2 (slice is exclusive at the end).

47.4.1 Boolean Masking

Comparisons against a NumPy array return a Boolean array. Using that Boolean array as an index keeps only the elements where the condition is True — this is the standard way to filter data in NumPy and Pandas.


47.5 Array Manipulation

  • np.concatenate((a1, a2), axis=0): Join a sequence of arrays along an existing axis.
  • np.split(array, indices_or_sections): Split an array into multiple sub-arrays.
  • a.reshape(rows, cols): Return a view of a with a new shape (the total size must match).

47.6 Mathematical Operations

  • np.add(a, b), np.subtract(a, b), np.multiply(a, b), np.divide(a, b): Perform element-wise addition, subtraction, multiplication, and division.
  • np.sqrt(a): Square root of each element in the array.
  • np.exp(a): Calculate the exponential of all elements in the array.
  • np.log(a): Natural logarithm of each element in the array.
  • np.power(a, b): Elements of a raised to the powers from b, element-wise.

All of these are vectorised — they run a compiled inner loop in C, not a Python loop — which is why NumPy is typically 10–100× faster than plain Python for numerical work.

47.6.1 Broadcasting

Broadcasting is how NumPy handles arithmetic between arrays whose shapes don’t match exactly. Instead of requiring you to replicate the smaller array by hand, NumPy virtually stretches it to fit — without actually copying memory.

The simplest case: array + scalar applies the scalar to every element. More generally, two arrays are compatible if, reading their shapes right-to-left, each dimension is either equal or 1.


47.7 Statistical Functions

  • np.mean(a): Compute the arithmetic mean along the specified axis.
  • np.median(a): Compute the median along the specified axis.
  • np.std(a): Compute the standard deviation along the specified axis.
  • np.var(a): Compute the variance along the specified axis.
  • np.min(a), np.max(a): Find the minimum or maximum values.
  • np.argmin(a), np.argmax(a): Find the indices of the minimum or maximum values.
  • Passing axis=0 collapses rows (giving one value per column); axis=1 collapses columns (one value per row).

47.8 Random Number Generation

NumPy’s modern random API uses a generator object. Create one once with a seed for reproducibility, then draw samples from it.

  • rng.random(size) — uniform samples in [0, 1).
  • rng.integers(low, high, size) — random integers.
  • rng.normal(loc, scale, size) — samples from a normal distribution.
  • rng.choice(a, size) — random selection from an array.

47.9 Common Pitfalls with NumPy

  • Mixing dtypes unintentionallynp.array([1, 2, 3.0]) is promoted to float64. Check a.dtype if results look off.
  • Integer overflow → Fixed-width integer dtypes can wrap around silently. Use np.int64 or a float if the values could get large.
  • Slices are views, not copies → Modifying a slice modifies the original array. Use .copy() when you need an independent array.
  • Shape mismatches in broadcasting → If you get a ValueError: operands could not be broadcast, print .shape on both operands first.
  • axis=0 vs axis=1axis=0 collapses down the rows (per column); axis=1 collapses across the columns (per row). The axis you name is the one that disappears.
  • Using and / or on arrays → Raises ValueError. Use &, | (and parenthesise each side) for element-wise logic.
  • np.random.seed(42) (the old API) → Prefer np.random.default_rng(seed=42). The generator API avoids global state and is thread-safe.

Summary

Concept Description
Foundations
NumPy The core scientific-computing library for Python, providing efficient multi-dimensional arrays and numerical functions
ndarray NumPy's n-dimensional array, a contiguous block of typed data that is far more efficient than a Python list for numerical work
Why NumPy is Fast
Vectorized Operations Operations are applied element-wise across whole arrays, eliminating explicit Python loops and running at compiled speed
Broadcasting Arithmetic between arrays of different shapes is aligned automatically using broadcasting rules, avoiding manual reshaping
Integration with Other Libraries NumPy interoperates seamlessly with pandas, matplotlib, scikit-learn and TensorFlow as the common numerical foundation
Creating and Inspecting Arrays
np.array() Construct an array from a list or tuple, such as np.array([1, 2, 3])
np.arange() Create an array of evenly spaced integers or floats across a range, such as np.arange(10)
np.zeros / np.ones / np.linspace Build pre-filled arrays of zeros, ones, or evenly spaced floats for initialisation and plotting
shape, ndim, size, dtype Attributes that describe an array's size in each direction, its number of dimensions, total elements and element type
Accessing Elements
Indexing and Slicing Access elements with a[i, j] and sub-arrays with slices like a[:, 0] or a[1:3]
Boolean Masking Filter an array by passing a Boolean array of the same shape as the index, keeping only True elements
Reshaping and Joining
np.concatenate() Join two or more arrays along an existing axis into a single larger array
np.split() Cut a single array into a list of sub-arrays at specified indices or equal sections
reshape() Return a view with a new shape; -1 lets NumPy infer one dimension from the total size
Mathematical Operations
Element-wise Arithmetic np.add, np.subtract, np.multiply and np.divide apply their operations element by element across arrays
np.sqrt(), np.exp(), np.log() Apply square root, exponential or natural log to every element, enabling fast transformations of large datasets
np.power() Raise each element of one array to the corresponding power in another, element-wise
Broadcasting Rules Arrays with compatible shapes are aligned automatically; dimensions of 1 are stretched virtually without copying memory
Statistical Functions
np.mean() and np.median() Compute the arithmetic mean and median of array values along a given axis
np.std() and np.var() Compute standard deviation and variance as measures of spread across the array
np.min(), np.max(), argmin, argmax Locate the smallest or largest value in an array, or the index at which each occurs
axis=0 vs axis=1 axis=0 collapses down the rows (one value per column); axis=1 collapses across columns (one value per row)
Random Numbers and Gotchas
default_rng() The modern NumPy random API; create one seeded generator for reproducible simulations and sampling
Common Pitfalls Watch out for dtype promotion, views vs copies, shape mismatches, axis confusion, and using and/or on arrays