49  Scipy

NoteWhat This Chapter Covers

SciPy extends NumPy with high-level scientific algorithms. By the end of this chapter you should be able to:

  • Describe SciPy’s modular structure and how it complements NumPy and Pandas.
  • Perform common statistical tests and work with probability distributions via scipy.stats.
  • Fit models and minimise objective functions with scipy.optimize.
  • Interpolate between known data points with scipy.interpolate.
  • Compute definite integrals with scipy.integrate.quad.
  • Solve linear systems with scipy.linalg.
  • Recognise common pitfalls around result objects, two-sided tests, and convergence.

SciPy (Pauli Virtanen et al., 2020) is a fundamental library for scientific computing in Python, providing a wide range of functionalities and optimized algorithms that are essential for data analytics and statistics. It builds upon NumPy, offering a broader variety of high-level commands and classes for managing and visualizing data, performing scientific and mathematical computations, and much more. With modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, and others, SciPy is indispensable for researchers, scientists, and analysts working in data science.

49.1 Core Features in Data Analytics and Statistics

  1. Statistical Functions: The scipy.stats module contains a large number of probability distributions as well as a growing library of statistical functions such as summary and frequency statistics, correlation functions, tests for statistical hypotheses, and more. This makes it invaluable for statistical testing and analysis, which are core components of data analytics.

  2. Optimization and Fit: SciPy provides tools for finding minima and maxima of functions, curve fitting, and seeking root values. These are useful in modeling data and understanding the underlying trends or patterns.

  3. Interpolation: With SciPy, you can interpolate data points to estimate intermediate values, enhancing the analysis of datasets by making them denser or fitting them to a specific function.

  4. Numerical Integration: The library supports multiple integration techniques, including single, double, and triple integrals. This is particularly useful in areas of physics and engineering where these calculations are common.

  5. Linear Algebra: SciPy extends NumPy’s linear algebra capabilities by adding more advanced functions, which are essential in solving systems of linear equations, finding eigenvalues/eigenvectors, and more.

49.2 The SciPy Landscape

SciPy is organised into sub-packages, each dedicated to a specific scientific domain. Most analytics workflows touch only a handful of them.

flowchart TD
    SP[SciPy] --> ST[scipy.stats<br/>distributions, tests]
    SP --> OP[scipy.optimize<br/>minimise, curve_fit, root]
    SP --> IN[scipy.interpolate<br/>interp1d, splines]
    SP --> IG[scipy.integrate<br/>quad, dblquad, odeint]
    SP --> LA[scipy.linalg<br/>solve, eig, decompositions]
    SP --> SG[scipy.signal<br/>filtering, spectra]

49.3 Examples of Using SciPy in Data Analytics and Statistics

Example 1: Statistical Testing

Suppose you’re analyzing two sets of data and want to know if they come from the same distribution. You could use the T-test to determine this:

This will give you a T-statistic and a P-value, helping you understand if there’s a significant difference between the two datasets. A p-value below your chosen significance level (commonly 0.05) suggests evidence that the two means differ.

Working with Probability Distributions

Each distribution in scipy.stats exposes a consistent interface — pdf() for the density, cdf() for cumulative probability, ppf() for quantiles, and rvs() for random sampling. The same pattern applies across norm, t, chi2, binom, poisson, and many more.

Example 2: Curve Fitting

If you have a dataset and you want to fit a specific model to it, you can use the curve_fit function from scipy.optimize:

This script fits an exponential model to the noisy data and plots both the original data and the fitted curve, showcasing how SciPy can be used to understand and model your data.

Minimising an Objective Function

scipy.optimize.minimize locates the input that minimises a user-supplied function. Use it when a formula doesn’t exist in closed form:

Interpolation

scipy.interpolate fills in values between known data points — useful when resampling sparse series or smoothing a curve:

Numerical Integration

scipy.integrate.quad computes definite integrals for functions without closed-form antiderivatives. It returns the integral and an estimated absolute error:

Linear Algebra

scipy.linalg extends NumPy’s linear algebra with extra solvers and decompositions. A common task is solving a system A x = b without explicitly inverting A:

49.4 Common Pitfalls

WarningThings that trip up new SciPy users
  • Test results are result objects. stats.ttest_ind(a, b) returns an object; unpack it with t, p = result or access result.statistic and result.pvalue rather than indexing by position.
  • Two-sided vs one-sided tests. Most scipy.stats tests are two-sided by default. Pass alternative="less" or alternative="greater" when the hypothesis is directional.
  • Distributions use loc and scale, not mean and std. stats.norm(loc=mu, scale=sigma) — the same parameter names appear across distributions for consistency.
  • curve_fit needs a reasonable initial guess. If the optimiser fails to converge or returns nonsense, supply p0=[...] close to the expected parameters.
  • minimize returns a result object. Always check result.success before trusting result.x; False means the solver did not converge.
  • quad returns a tuple. The second element is the estimated absolute error, not another integral value — unpack as value, err = quad(...).
  • Prefer scipy.linalg over numpy.linalg for heavy work. SciPy offers more algorithms, more options, and better numerical behaviour for large or ill-conditioned problems.
  • SciPy builds on NumPy, not replaces it. Keep your arrays as NumPy arrays and import only the SciPy sub-modules you actually need.

Summary

Concept Description
Foundations
SciPy A scientific-computing library that extends NumPy with high-level algorithms for statistics, optimisation, integration and linear algebra
Built on NumPy SciPy operates directly on NumPy arrays, inheriting their speed and memory layout
Modular Structure SciPy is organised into sub-packages such as stats, optimize, interpolate, integrate and linalg, each focused on a specific domain
Statistics (scipy.stats)
scipy.stats The statistics sub-package providing distributions, summary statistics, correlation routines and hypothesis tests
Probability Distributions Continuous and discrete distributions that expose a consistent interface for density, cumulative probability, quantiles and sampling
pdf / cdf / ppf / rvs Every distribution supports pdf() for density, cdf() for cumulative probability, ppf() for quantiles and rvs() for random sampling
Hypothesis Testing Classic inferential tests including t-tests, ANOVA, chi-square and non-parametric tests are a single function call away
ttest_ind stats.ttest_ind(a, b) performs an independent-samples t-test and returns a result object with statistic and pvalue fields
alternative argument Pass alternative='less' or 'greater' to request a one-sided test instead of the default two-sided
Optimisation (scipy.optimize)
scipy.optimize The optimisation sub-package for finding minima, roots and best-fit parameters
curve_fit curve_fit(model, xdata, ydata) fits a user-defined function to data by non-linear least squares
minimize minimize(fun, x0) locates a local minimum of an objective function starting from the initial guess x0
Result Objects Optimisation functions return a result object with fields such as x, fun and success that should be inspected before use
Interpolation and Integration
scipy.interpolate Interpolation routines that estimate values between known data points for smoother, denser datasets
interp1d interp1d(x, y, kind='linear' or 'cubic') builds a callable interpolator that can be evaluated at new x values
scipy.integrate The numerical integration sub-package for problems where no closed-form antiderivative exists
quad quad(func, a, b) returns the definite integral and an estimated absolute error as a tuple
Linear Algebra
scipy.linalg Linear algebra extensions including solvers, eigenvalue routines and matrix decompositions for large or ill-conditioned systems
linalg.solve linalg.solve(A, b) solves the linear system A x = b without explicitly inverting A
Installation and Ecosystem
Installing SciPy Install with pip install scipy or with conda install scipy inside an Anaconda environment
Python Analytics Stack NumPy, Pandas, Matplotlib and SciPy together form the core Python stack for applied analytics