flowchart TD
SP[SciPy] --> ST[scipy.stats<br/>distributions, tests]
SP --> OP[scipy.optimize<br/>minimise, curve_fit, root]
SP --> IN[scipy.interpolate<br/>interp1d, splines]
SP --> IG[scipy.integrate<br/>quad, dblquad, odeint]
SP --> LA[scipy.linalg<br/>solve, eig, decompositions]
SP --> SG[scipy.signal<br/>filtering, spectra]
49 Scipy
SciPy extends NumPy with high-level scientific algorithms. By the end of this chapter you should be able to:
- Describe SciPy’s modular structure and how it complements NumPy and Pandas.
- Perform common statistical tests and work with probability distributions via
scipy.stats. - Fit models and minimise objective functions with
scipy.optimize. - Interpolate between known data points with
scipy.interpolate. - Compute definite integrals with
scipy.integrate.quad. - Solve linear systems with
scipy.linalg. - Recognise common pitfalls around result objects, two-sided tests, and convergence.
SciPy (Pauli Virtanen et al., 2020) is a fundamental library for scientific computing in Python, providing a wide range of functionalities and optimized algorithms that are essential for data analytics and statistics. It builds upon NumPy, offering a broader variety of high-level commands and classes for managing and visualizing data, performing scientific and mathematical computations, and much more. With modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, and others, SciPy is indispensable for researchers, scientists, and analysts working in data science.
49.1 Core Features in Data Analytics and Statistics
Statistical Functions: The
scipy.statsmodule contains a large number of probability distributions as well as a growing library of statistical functions such as summary and frequency statistics, correlation functions, tests for statistical hypotheses, and more. This makes it invaluable for statistical testing and analysis, which are core components of data analytics.Optimization and Fit: SciPy provides tools for finding minima and maxima of functions, curve fitting, and seeking root values. These are useful in modeling data and understanding the underlying trends or patterns.
Interpolation: With SciPy, you can interpolate data points to estimate intermediate values, enhancing the analysis of datasets by making them denser or fitting them to a specific function.
Numerical Integration: The library supports multiple integration techniques, including single, double, and triple integrals. This is particularly useful in areas of physics and engineering where these calculations are common.
Linear Algebra: SciPy extends NumPy’s linear algebra capabilities by adding more advanced functions, which are essential in solving systems of linear equations, finding eigenvalues/eigenvectors, and more.
49.2 The SciPy Landscape
SciPy is organised into sub-packages, each dedicated to a specific scientific domain. Most analytics workflows touch only a handful of them.
49.3 Examples of Using SciPy in Data Analytics and Statistics
Example 1: Statistical Testing
Suppose you’re analyzing two sets of data and want to know if they come from the same distribution. You could use the T-test to determine this:
This will give you a T-statistic and a P-value, helping you understand if there’s a significant difference between the two datasets. A p-value below your chosen significance level (commonly 0.05) suggests evidence that the two means differ.
Working with Probability Distributions
Each distribution in scipy.stats exposes a consistent interface — pdf() for the density, cdf() for cumulative probability, ppf() for quantiles, and rvs() for random sampling. The same pattern applies across norm, t, chi2, binom, poisson, and many more.
Example 2: Curve Fitting
If you have a dataset and you want to fit a specific model to it, you can use the curve_fit function from scipy.optimize:
This script fits an exponential model to the noisy data and plots both the original data and the fitted curve, showcasing how SciPy can be used to understand and model your data.
Minimising an Objective Function
scipy.optimize.minimize locates the input that minimises a user-supplied function. Use it when a formula doesn’t exist in closed form:
Interpolation
scipy.interpolate fills in values between known data points — useful when resampling sparse series or smoothing a curve:
Numerical Integration
scipy.integrate.quad computes definite integrals for functions without closed-form antiderivatives. It returns the integral and an estimated absolute error:
Linear Algebra
scipy.linalg extends NumPy’s linear algebra with extra solvers and decompositions. A common task is solving a system A x = b without explicitly inverting A:
49.4 Common Pitfalls
-
Test results are result objects.
stats.ttest_ind(a, b)returns an object; unpack it witht, p = resultor accessresult.statisticandresult.pvaluerather than indexing by position. -
Two-sided vs one-sided tests. Most
scipy.statstests are two-sided by default. Passalternative="less"oralternative="greater"when the hypothesis is directional. -
Distributions use
locandscale, notmeanandstd.stats.norm(loc=mu, scale=sigma)— the same parameter names appear across distributions for consistency. -
curve_fitneeds a reasonable initial guess. If the optimiser fails to converge or returns nonsense, supplyp0=[...]close to the expected parameters. -
minimizereturns a result object. Always checkresult.successbefore trustingresult.x;Falsemeans the solver did not converge. -
quadreturns a tuple. The second element is the estimated absolute error, not another integral value — unpack asvalue, err = quad(...). -
Prefer
scipy.linalgovernumpy.linalgfor heavy work. SciPy offers more algorithms, more options, and better numerical behaviour for large or ill-conditioned problems. - SciPy builds on NumPy, not replaces it. Keep your arrays as NumPy arrays and import only the SciPy sub-modules you actually need.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| SciPy | A scientific-computing library that extends NumPy with high-level algorithms for statistics, optimisation, integration and linear algebra |
| Built on NumPy | SciPy operates directly on NumPy arrays, inheriting their speed and memory layout |
| Modular Structure | SciPy is organised into sub-packages such as stats, optimize, interpolate, integrate and linalg, each focused on a specific domain |
| Statistics (scipy.stats) | |
| scipy.stats | The statistics sub-package providing distributions, summary statistics, correlation routines and hypothesis tests |
| Probability Distributions | Continuous and discrete distributions that expose a consistent interface for density, cumulative probability, quantiles and sampling |
| pdf / cdf / ppf / rvs | Every distribution supports pdf() for density, cdf() for cumulative probability, ppf() for quantiles and rvs() for random sampling |
| Hypothesis Testing | Classic inferential tests including t-tests, ANOVA, chi-square and non-parametric tests are a single function call away |
| ttest_ind | stats.ttest_ind(a, b) performs an independent-samples t-test and returns a result object with statistic and pvalue fields |
| alternative argument | Pass alternative='less' or 'greater' to request a one-sided test instead of the default two-sided |
| Optimisation (scipy.optimize) | |
| scipy.optimize | The optimisation sub-package for finding minima, roots and best-fit parameters |
| curve_fit | curve_fit(model, xdata, ydata) fits a user-defined function to data by non-linear least squares |
| minimize | minimize(fun, x0) locates a local minimum of an objective function starting from the initial guess x0 |
| Result Objects | Optimisation functions return a result object with fields such as x, fun and success that should be inspected before use |
| Interpolation and Integration | |
| scipy.interpolate | Interpolation routines that estimate values between known data points for smoother, denser datasets |
| interp1d | interp1d(x, y, kind='linear' or 'cubic') builds a callable interpolator that can be evaluated at new x values |
| scipy.integrate | The numerical integration sub-package for problems where no closed-form antiderivative exists |
| quad | quad(func, a, b) returns the definite integral and an estimated absolute error as a tuple |
| Linear Algebra | |
| scipy.linalg | Linear algebra extensions including solvers, eigenvalue routines and matrix decompositions for large or ill-conditioned systems |
| linalg.solve | linalg.solve(A, b) solves the linear system A x = b without explicitly inverting A |
| Installation and Ecosystem | |
| Installing SciPy | Install with pip install scipy or with conda install scipy inside an Anaconda environment |
| Python Analytics Stack | NumPy, Pandas, Matplotlib and SciPy together form the core Python stack for applied analytics |