39  Graphical Presentation – Scatter plot, Histogram

39.1 Scatter Plot

A scatter plot (Karl Pearson, 1895) (or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

Example of a Scatter Plot: Imagine you have a dataset with the ages of a group of people and their corresponding systolic blood pressure readings.

Importance:

Relationships: Scatter plots are particularly useful for determining the relationship or correlation between two variables. This can be especially helpful in spotting trends, clusters, and outliers.

Correlation detection: They make it easy to see if an increase in one variable correlates with an increase in another (positive correlation), a decrease in another (negative correlation), or no correlation.

A scatter plot can be used to visualize any correlation between age and blood pressure.

  • X-axis (horizontal): Age
  • Y-axis (vertical): Systolic Blood Pressure

The scatter plot might show that as age increases, blood pressure also tends to increase, indicating a positive correlation.

Let’s take an example dataset that includes hours studied and scores obtained by students to demonstrate how to create a scatter plot using both R and Python.

39.1.1 Dataset Example:

Hours Studied Score Obtained
1 20
2 40
3 60
4 80
5 100

We’ll visualize this data to see if there’s a correlation between the number of hours studied and the scores obtained.

39.2 Scatter Plot using R and Python

In R, you can use the base plot() function; in Python, matplotlib.pyplot.scatter() from the Matplotlib library.

In both examples, you define two lists or vectors: one for the hours studied and one for the scores obtained. Then you use plotting functions to create a scatter plot where each point’s position on the plot corresponds to a pair of values from these lists. The title, xlabel, and ylabel provide labels for clarity. The scatter plot will show a clear positive linear relationship, suggesting that higher study hours might be associated with higher scores.

39.3 Histogram

A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable (quantitative variable) and was first introduced by Karl Pearson. A histogram consists of contiguous (adjacent) boxes. It groups numbers into ranges (bins). The height of each box depicts the number of data points that fall within each range.

Importance:

Distribution: Histograms provide a visual interpretation of numerical data by indicating the frequency of data points within certain ranges of values. This helps in understanding the distribution (e.g., normal distribution, skewed, bimodal) of the data.

Outliers and shape: They help identify outliers and the overall shape of the data distribution, which are critical in statistical analyses and assumptions required for applying various statistical tests and models.

Example of a Histogram: Consider you have data on the test scores of students in a particular exam. The histogram can show how many students achieved scores within certain score ranges (e.g., 0–10, 11–20, etc.).

  • X-axis (horizontal): Score ranges
  • Y-axis (vertical): Number of students

From the histogram, you might observe most students scoring between 50 and 70, which could indicate the test’s difficulty level or the average student’s preparedness.

These visual tools help researchers, analysts, and businesses to analyze large amounts of data quickly and effectively, making informed decisions based on visual insights.

Sure, let’s continue with the theme of students’ scores, but this time, let’s imagine a larger dataset representing the distribution of scores on a test. Here’s the example dataset we’ll use for creating histograms:

39.3.1 Dataset Example:

Scores
55
70
65
85
90
75
60
95
80
70
65
50

39.4 Histogram using R and Python

In R, the base hist() function creates a histogram; in Python, use matplotlib.pyplot.hist().

This Python code does something very similar to the R code. It defines a list of scores, and then plt.hist() is used to create a histogram with 5 bins. The histogram bars are colored blue with black edges for better visual distinction. Labels and a title are added to enhance understanding of the plot.

Summary

Concept Description
Scatter Plot Foundations
Scatter Plot A chart that places one variable on the x-axis and another on the y-axis, plotting one point per observation
Cartesian Coordinates Each point's position encodes a paired (x, y) value drawn from the dataset
What Scatter Plots Reveal
Trends, Clusters and Outliers Patterns visible at a glance, helping to spot relationships, groupings and unusual values
Positive vs Negative Correlation Upward sloping points suggest positive correlation, downward sloping suggest negative
When to Use a Scatter Plot Use when the question is about the relationship between two continuous variables
Scatter Plot in R and Python
R via plot() Use plot(x, y, main, xlab, ylab, pch, col) for the base R scatter
Python via plt.scatter() Use matplotlib.pyplot.scatter(x, y) followed by title/xlabel/ylabel/show
Histogram Foundations
Histogram A bar-style chart of a single continuous variable showing how observations are distributed across ranges
Bins Continuous values are grouped into adjacent intervals; bin width controls the level of detail
Frequency on the Y-axis The height of each bar is the count or frequency of observations in that bin
What Histograms Reveal
Distribution Shape Reveals whether data is roughly normal, skewed, bimodal, uniform or otherwise patterned
Skewness and Modality Long tails point to skew; multiple peaks point to mixtures of subpopulations
Outlier Detection Bars far from the bulk of the distribution suggest extreme observations worth investigating
Histogram in R and Python
R via hist() Use hist(x, breaks, col, main, xlab, ylab) for the base R histogram
Python via plt.hist() Use matplotlib.pyplot.hist(x, bins, color, edgecolor) for the Python histogram