42 Box plots
Box plots (John W. Tukey, 1977), also known as box-and-whisker plots, are a type of statistical graph that is used to display the distribution of data based on a five-number summary: minimum, first quartile (Q1), median (second quartile, Q2), third quartile (Q3), and maximum. They are particularly useful for identifying outliers and understanding the spread and skewness of the data.
Purpose:
- Box plots (also known as box-and-whisker plots) are used to display the distribution of data based on a five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. They are excellent for detecting outliers and understanding the spread and skewness of the data.
- A box plot visualizes data distribution in a compact manner. The central box represents the middle 50% of the data (from Q1 to Q3). Inside the box, a line indicates the median of the data.
- Whiskers extend from either side of the box to the smallest and largest values within 1.5 times the interquartile range (the distance between Q1 and Q3). Data points outside this range are considered outliers and are often plotted as individual points.
How Data Analysts Use Box Plots:
Identify Outliers: The whiskers extend from the hinges to the highest and lowest values that are within 1.5 * IQR (interquartile range, which is Q3 - Q1) from the quartiles, providing a quick visual cue about the presence of outliers beyond these bounds.
Compare Distributions: Analysts use box plots to compare the distributions across different categories or groups within a dataset, making it easy to see variations in medians, the ange of data, and overall variability.
Spot Asymmetry and Spread: Box plots allow analysts to easily see if the data is symmetrically distributed, skewed, or if one tail is longer than the other.
42.1 Creating a Box Plot
We use a sample of 20 exam scores with a couple of unusually high and low values so the whiskers and outlier dots are clearly visible in the plot.
42.2 Box Plot using R and Python
The central box shows where the middle 50% of scores sit, and the line inside the box marks the class median. Dots beyond the whiskers are scores that sit unusually far from the rest of the class.
42.3 Grouped Box Plot
Box plots earn their keep when you compare several groups side by side. Each group gets its own box, and differences in medians, spreads and outlier counts become visible at a glance.
42.4 Grouped Box Plot using R and Python
At a glance, Class A has the highest median and Class C the lowest, while Class B has one student well above the rest - the dot sitting above its upper whisker.
42.5 Horizontal Notched Box Plot
A notched box plot adds a pinched cut around the median that represents an informal 95% confidence interval. When the notches of two groups do not overlap, their medians are probably different. Drawing the boxes horizontally also makes long group labels easier to read.
42.6 Horizontal Notched Box Plot using R and Python
If the notches of two boxes do not overlap, you have informal visual evidence that their medians differ. When the notches do overlap, treat median differences as unconfirmed and follow up with a formal test such as a Mann-Whitney or Kruskal-Wallis comparison.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Box Plot | A statistical chart that compresses a variable into a five-number summary |
| Five-Number Summary | Minimum, Q1, median, Q3 and maximum together summarise centre, spread and tails |
| Anatomy of a Box Plot | |
| Box Represents IQR | The central rectangle spans Q1 to Q3 and contains the middle 50 percent of the data |
| Median Line | A line inside the box marks the median, the 50th percentile of the data |
| Whiskers | Lines extending from the box to the most extreme non-outlier values on each side |
| 1.5 x IQR Rule | Whiskers usually extend to values within 1.5 times the interquartile range from each quartile |
| Outliers | Points outside the whiskers are flagged as potential outliers and drawn individually |
| What Box Plots Reveal | |
| Detect Outliers | Box plots make extreme values immediately visible because they sit beyond the whiskers |
| Compare Distributions | Side-by-side box plots make medians, spreads and outlier patterns easy to compare across groups |
| Assess Skewness | Position of the median within the box and the relative whisker lengths reveal symmetry or skew |
| Compact Distribution Display | An entire distribution is summarised with a small footprint, ideal for many groups in one panel |
| In R and Python | |
| R via boxplot() | Use boxplot(data, main, ylab, col, border) for a base R box plot |
| Python via matplotlib boxplot | Use matplotlib.pyplot.boxplot(data) when working without higher-level plotting libraries |
| Python via seaborn boxplot | Use seaborn.boxplot(x=data) for a cleaner default style and easy grouping by category |