31  Karl Pearson’s Coefficient of Correlation

Karl Pearson’s correlation coefficient (Karl Pearson, 1895), also known as the Pearson product-moment correlation coefficient (PPMCC) or simply Pearson’s correlation, is a measure used in statistics to determine the degree of linear relationship between two variables. It’s widely used in the sciences to quantify the linear correlation between datasets.

31.1 Assumptions

Pearson’s correlation requires certain assumptions about the data it is used to analyze:

  1. Linearity: The relationship between the two variables should be linear.
  2. Homoscedasticity: The variances along the line of best fit remain similar as the value of the predictor variable increases.
  3. Normally Distributed Variables: Both variables being tested should follow a normal distribution.

31.2 Formula

The Pearson correlation coefficient (\(r\)) is calculated using the following formula:

\[ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]

Where:

  • \(n\) is the number of data points.
  • \(x\) and \(y\) are the variables for which the correlation is being calculated.
  • \(\sum\) represents the summation symbol, aggregating all values of \(x\), \(y\), \(xy\), \(x^2\), and \(y^2\).

31.3 Interpretation

The value of \(r\) ranges from -1 to +1:

  • +1 indicates a perfect positive linear relationship,
  • -1 indicates a perfect negative linear relationship,
  • 0 means no linear relationship exists.

Values close to +1 or -1 indicate a strong relationship, while values close to 0 indicate a weak relationship.

31.4 Example Problem

Suppose we want to determine the relationship between hours studied and scores obtained in an exam. Here are the data for 5 students:

  • Hours Studied: 2, 4, 6, 8, 10
  • Scores: 20, 40, 60, 80, 100

Hypotheses:

  • Null Hypothesis (H₀): There is no linear correlation between hours studied and scores (\(r = 0\)).
  • Alternative Hypothesis (H₁): There is a linear correlation between hours studied and scores (\(r \neq 0\)).

Calculate Pearson’s r:

Using the data points provided, the calculation involves first calculating sums and products needed:

  1. Sum of Hours Studied (\(\sum x\)): \(2 + 4 + 6 + 8 + 10 = 30\)
  2. Sum of Scores (\(\sum y\)): \(20 + 40 + 60 + 80 + 100 = 300\)
  3. Sum of the product of hours and scores (\(\sum xy\)): \(2*20 + 4*40 + 6*60 + 8*80 + 10*100 = 1300\)
  4. Sum of the squares of hours (\(\sum x^2\)): \(2^2 + 4^2 + 6^2 + 8^2 + 10^2 = 220\)
  5. Sum of the squares of scores (\(\sum y^2\)): \(20^2 + 40^2 + 60^2 + 80^2 + 100^2 = 30000\)

Plugging these values into the formula gives:

\[ r = \frac{5(1300) - (30)(300)}{\sqrt{[5(220) - (30)^2][5(30000) - (300)^2]}} = 1 \]

Conclusion:

Since \(r = 1\), there is a perfect positive linear relationship between the hours studied and the scores obtained, supporting the alternative hypothesis that there is a significant linear relationship.

Pearson’s Correlation using Excel:

📥 Stats Basics (Excel)

31.5 Pearson’s Correlation using R and Python

31.5.1 Example Research Articles on Correlation:

  1. Bridging the Gap: Exploring the Impact of Human Capital Management on Employee Performance through Work Engagement — Administrative Sciences, 2024. 👉 Download Article
  2. Developing Employee Productivity and Performance through Work Engagement and Organizational Factors in an Educational Society — Societies, 2023. 👉 Download Article

Summary

Concept Description
Foundations
Pearson's r A measure of the strength and direction of the linear relationship between two scale variables
Linear Relationship Quantifies how well a straight line summarises the joint behaviour of the two variables
Range of r Always lies between minus one and plus one inclusive
Direction and Strength Sign indicates direction of association, magnitude indicates how tightly the points cluster around a line
Assumptions
Linearity The relationship between the two variables should be approximately linear
Homoscedasticity The spread of one variable should be roughly constant across values of the other
Bivariate Normality Both variables should be approximately normally distributed for inference on r
Computation
Sums and Cross-Products The numerator and denominator are built from sums of x, y, x times y, x squared and y squared
Computational Formula Convenient form using n times sum-of-products minus product-of-sums, suited to manual calculation
Standardised Form Equivalently, r is the average product of the two variables once both have been standardised to z-scores
Hypotheses
Null Hypothesis States that the population correlation is zero
Alternative Hypothesis States that the population correlation is non-zero, or has a specified direction
Cautions
Correlation vs Causation A non-zero r indicates association, not that one variable causes changes in the other
Sensitivity to Outliers A single extreme observation can pull r toward plus or minus one or wash out a real relationship
In R and Python
R via cor() Use cor(x, y) for the coefficient and cor.test(x, y) for the significance test in R
Python via numpy.corrcoef() Use numpy.corrcoef(x, y) for the matrix or scipy.stats.pearsonr(x, y) for the coefficient and p-value