31 Karl Pearson’s Coefficient of Correlation
Karl Pearson’s correlation coefficient (Karl Pearson, 1895), also known as the Pearson product-moment correlation coefficient (PPMCC) or simply Pearson’s correlation, is a measure used in statistics to determine the degree of linear relationship between two variables. It’s widely used in the sciences to quantify the linear correlation between datasets.
31.1 Assumptions
Pearson’s correlation requires certain assumptions about the data it is used to analyze:
- Linearity: The relationship between the two variables should be linear.
- Homoscedasticity: The variances along the line of best fit remain similar as the value of the predictor variable increases.
- Normally Distributed Variables: Both variables being tested should follow a normal distribution.
31.2 Formula
The Pearson correlation coefficient (\(r\)) is calculated using the following formula:
\[ r = \frac{n(\sum xy) - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}} \]
Where:
- \(n\) is the number of data points.
- \(x\) and \(y\) are the variables for which the correlation is being calculated.
- \(\sum\) represents the summation symbol, aggregating all values of \(x\), \(y\), \(xy\), \(x^2\), and \(y^2\).
31.3 Interpretation
The value of \(r\) ranges from -1 to +1:
- +1 indicates a perfect positive linear relationship,
- -1 indicates a perfect negative linear relationship,
- 0 means no linear relationship exists.
Values close to +1 or -1 indicate a strong relationship, while values close to 0 indicate a weak relationship.
31.4 Example Problem
Suppose we want to determine the relationship between hours studied and scores obtained in an exam. Here are the data for 5 students:
- Hours Studied: 2, 4, 6, 8, 10
- Scores: 20, 40, 60, 80, 100
Hypotheses:
- Null Hypothesis (H₀): There is no linear correlation between hours studied and scores (\(r = 0\)).
- Alternative Hypothesis (H₁): There is a linear correlation between hours studied and scores (\(r \neq 0\)).
Calculate Pearson’s r:
Using the data points provided, the calculation involves first calculating sums and products needed:
- Sum of Hours Studied (\(\sum x\)): \(2 + 4 + 6 + 8 + 10 = 30\)
- Sum of Scores (\(\sum y\)): \(20 + 40 + 60 + 80 + 100 = 300\)
- Sum of the product of hours and scores (\(\sum xy\)): \(2*20 + 4*40 + 6*60 + 8*80 + 10*100 = 1300\)
- Sum of the squares of hours (\(\sum x^2\)): \(2^2 + 4^2 + 6^2 + 8^2 + 10^2 = 220\)
- Sum of the squares of scores (\(\sum y^2\)): \(20^2 + 40^2 + 60^2 + 80^2 + 100^2 = 30000\)
Plugging these values into the formula gives:
\[ r = \frac{5(1300) - (30)(300)}{\sqrt{[5(220) - (30)^2][5(30000) - (300)^2]}} = 1 \]
Conclusion:
Since \(r = 1\), there is a perfect positive linear relationship between the hours studied and the scores obtained, supporting the alternative hypothesis that there is a significant linear relationship.
Pearson’s Correlation using Excel:
31.5 Pearson’s Correlation using R and Python
31.5.1 Example Research Articles on Correlation:
- Bridging the Gap: Exploring the Impact of Human Capital Management on Employee Performance through Work Engagement — Administrative Sciences, 2024. 👉 Download Article
- Developing Employee Productivity and Performance through Work Engagement and Organizational Factors in an Educational Society — Societies, 2023. 👉 Download Article
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Pearson's r | A measure of the strength and direction of the linear relationship between two scale variables |
| Linear Relationship | Quantifies how well a straight line summarises the joint behaviour of the two variables |
| Range of r | Always lies between minus one and plus one inclusive |
| Direction and Strength | Sign indicates direction of association, magnitude indicates how tightly the points cluster around a line |
| Assumptions | |
| Linearity | The relationship between the two variables should be approximately linear |
| Homoscedasticity | The spread of one variable should be roughly constant across values of the other |
| Bivariate Normality | Both variables should be approximately normally distributed for inference on r |
| Computation | |
| Sums and Cross-Products | The numerator and denominator are built from sums of x, y, x times y, x squared and y squared |
| Computational Formula | Convenient form using n times sum-of-products minus product-of-sums, suited to manual calculation |
| Standardised Form | Equivalently, r is the average product of the two variables once both have been standardised to z-scores |
| Hypotheses | |
| Null Hypothesis | States that the population correlation is zero |
| Alternative Hypothesis | States that the population correlation is non-zero, or has a specified direction |
| Cautions | |
| Correlation vs Causation | A non-zero r indicates association, not that one variable causes changes in the other |
| Sensitivity to Outliers | A single extreme observation can pull r toward plus or minus one or wash out a real relationship |
| In R and Python | |
| R via cor() | Use cor(x, y) for the coefficient and cor.test(x, y) for the significance test in R |
| Python via numpy.corrcoef() | Use numpy.corrcoef(x, y) for the matrix or scipy.stats.pearsonr(x, y) for the coefficient and p-value |