20  Phi-Coefficient of Correlation

The Phi-Coefficient of Correlation (G. Udny Yule, 1912) is a measure of the degree of association between two binary variables. This statistic is a specific case of the Pearson correlation coefficient and can be used when dealing with dichotomous variables.

The phi coefficient ranges from -1 to 1, where:

  • 1 or -1 indicates a perfect positive or negative association, respectively,
  • 0 indicates no association between variables.

The phi coefficient is often used in conjunction with the Chi-square test for 2x2 contingency tables to quantify the strength of the association between the variables. It provides a numeric measure of the relationship’s strength, whereas the chi-square test assesses the significance of that relationship.

Application Contexts:

  • Phi-Coefficient serves as a straightforward measure of association strength in studies with dichotomous variables, offering insights into the relationship’s intensity in medical, psychological, and social sciences research.

Each of these tests and measures has its specific conditions and assumptions that must be met to ensure valid and reliable results. They are powerful tools in the arsenal of statistical analysis for categorical data, providing insights into patterns, associations, and differences among groups or variables.

20.1 Example Problem:

Imagine a study looking at the relationship between having a gym membership (Yes or No) and being classified as physically active (Active or Not Active). Here’s the data collected from 200 individuals:

Active Not Active Total
Gym Member 80 20 100
No Gym Member 30 70 100
Total 110 90 200

Step-by-Step Calculation:

First, let’s label the counts in our contingency table: - $ a = 80 $ (Active and Gym Member) - $ b = 20 $ (Not Active and Gym Member) - $ c = 30 $ (Active and No Gym Member) - $ d = 70 $ (Not Active and No Gym Member)

Phi Coefficient Formula: \[ \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \]

Plugging in the values, we get: \[ \phi = \frac{(80 \times 70) - (20 \times 30)}{\sqrt{(80+20)(30+70)(80+30)(20+70)}} \] \[ \phi = \frac{5600 - 600}{\sqrt{100 \times 100 \times 110 \times 90}} \] \[ \phi = \frac{5000}{\sqrt{10000 \times 9900}} \] \[ \phi = \frac{5000}{\sqrt{99000000}} \] \[ \phi = \frac{5000}{9950} \] \[ \phi \approx 0.5025 \]

Interpretation:

The calculated Phi Coefficient of approximately 0.50 suggests a moderate positive association between having a gym membership and being physically active. This indicates that individuals with gym memberships are more likely to be classified as active compared to those without memberships. The value is positive, showing that the association is in the expected direction (more gym members are active), and a value of 0.50 indicates a noticeable correlation but not an extremely strong one.

20.2 Phi-Coefficient of Correlation calculation using R and Python:

By default, chi square tests in R and Python applies Yates’ continuity correction for 2×2 tables. This correction reduces the Chi-square statistic slightly, which makes your Phi value a bit smaller (0.492 instead of 0.502).

In the Python code, the chi2_contingency() function from SciPy’s stats module is used to compute the chi-square statistic, and then the Phi Coefficient is calculated as the square root of the chi-square statistic divided by the total sample size.


Summary

Concept Description
Foundations
Phi Coefficient A measure of the strength of association between two binary variables, ranging from minus one to one
Pearson Correlation Special Case Phi is what Pearson correlation reduces to when both variables are dichotomous
Range of Phi Values of plus or minus one indicate a perfect association, zero indicates no association
Interpretation of Phi Absolute values near 0.1 are weak, 0.3 moderate, and 0.5 or above strong
Computation
Direct Formula (a times d minus b times c) divided by the square root of the product of the four marginal totals
Chi-square Derivation Phi equals the square root of the chi-square statistic divided by the total sample size
2x2 Contingency Table Phi is defined on two-by-two tables only and becomes ambiguous for larger tables
Continuity Correction Effect Yates' correction reduces the chi-square slightly, so phi via chi-square is smaller than the direct formula
Applications
Medical and Psychological Research Used to quantify association between binary risk factors and binary outcomes
Social Sciences Research Used to measure association between yes/no predictors and categorical responses
In R and Python
R Direct Formula Compute phi directly from cell counts a, b, c, d without calling any test function
R via chisq.test() Apply chisq.test() then take the square root of the statistic divided by sample size
Python Direct Formula Compute phi directly using the same algebra in NumPy
Python via chi2_contingency Apply chi2_contingency then derive phi from the returned chi-square statistic