To Documents
Linear Correlation
Introduction
- The news is filled with examples of correlations and associations:
- Drinking a glass of red wine per day may decrease your chances of a heart attack.
- Taking one aspirin per day may decrease your chances of stroke or of a heart attack.
- Eating lots of certain kinds of fish may improve your health and make you smarter.
- Driving slower reduces your chances of getting killed in a trafficaccident.
- Taller people tend to weigh more.
- Pregnant women that smoke tend to have low birthweight babies.
- Animals with large brains tend to be more intelligent.
- The more you study for an exam, the higher the score you are likely to receive.
- The correlation, denoted by r, measures the amount of linear
association between two variables.
- r is always between -1 and 1 inclusive.
- The R-squared value, denoted by R2,
is the square of the correlation. It measures the proportion of
variation in the dependent variable that can be attributed to
the independent variable.
- The R-squared value R2 is always between 0 and 1 inclusive.
-
Perfect positive linear association. The points are exactly
on the trend line.
Correlation r = 1; R-squared = 1.00
-
Large positive linear association. The points are close to the linear
trend line.
Correlation r = 0.9; R=squared = 0.81.
-
Small positive linear association. The points are far from the trend
line.
Correlation r = 0.45; R-squared = 0.2025.
-
No association. There is no association between the variables.
Correlation r = 0.0; R-squared = 0.0.
-
Small negative association.
Correlation r = -0.3. R-squared = 0.09.
-
Large negative association.
Correlation r = -0.95; R-squared = 0.9025
-
Perfect negative association.
Correlation r = -1. R-squared = 1.00.
- How high must a correlation be to be considered meaningful?
It depends on the discipline. Here are some rough guidelines:
Discipline | r meaningful if |
R2 meaningful if |
Physics | r < -0.95 or 0.95 < r |
0.9 < R2 |
Chemistry | r < -0.9 or 0.9 < r |
0.8 < R2 |
Biology | r < -0.7 or 0.7 < r |
0.5 < R2 |
Social Sciences | r < -0.5 or 0.5 < r |
0.25 < R2 |
Calculating the Correlation
- To calculate the correlation, first standardize both the x and y
variables:
zxi =
(xi - x) / SDx
zyi =
(yi - y) / SDy
- Then compute r = the average of the products
zxi
zyi
- Example: Compute the correlation r of this dataset:
- We calculate:
x = (1 + 3 + 4 + 5 + 7) / 5 = 4
y = (5 + 9 + 7 + 1 + 13) / 5 = 7
SDx2 =
[(1-4)2 + (3-4)2 +
(4-4)2 + (5-4)2 +
(7-4)2] / 5 = 4
SDx = √4 = 2
SDy2 =
[(5-7)2 + (9-7)2 +
(7-7)2 + (1-7)2 +
(13-7)2] / 5 = 16
SDy = √16 = 4
- Now compute the average of the z-scores of the x- and y-variables:
x | y | zx |
zy |
zxzy |
1 | 5 | -1.5 | -0.5 | 0.75 |
3 | 9 | -0.5 | 0.5 | -0.25 |
4 | 7 | 0.0 | 0.0 | 0.00 |
5 | 1 | 0.5 | -1.5 | -0.75 |
7 | 13 | 1.5 | 1.5 | 2.25 |
Ave. of zxzy: 0.40 |
- Thus the correlation r is 0.4.
- Remember: the correlation is always between -1 and 1, inclusive.
- Why does this work? Here are three possibilities:
- In diagram (a), the x- and y-variables have a positive relationship.
Most of the (x,y) points lie in quadrants I and III where the
zxzy product is positive.
Therefore r > 0.
- In diagram (b), the x- and y-variables have a negative relationship.
Most of the (x,y) points lie in quadrants II and IV where the
zxzy product is negative.
Thereform r < 0.
- In diagram (c), the x- and y-variables have no relationship.
The positive products in quadrants I and III cancel out the
negative products in quadrants II and VI so the average of the
products is close to 0; r is also close to 0.
Calculating the Correlation with SD+
- Compute the correlation r of this dataset:
- Use SPSS to calculate descriptive statistics and z-scores:
x = 4.00
SDx+ = 2.236
x = 7.00
SDx+ = 4.472
x | y | zx |
zy |
zxzy |
1 | 5 | -1.34164 | -0.44721 | 0.60 |
2 | 9 | -0.44721 | 0.44721 | -0.20 |
3 | 7 | 0.0 | 0.0 | 0.00 |
4 | 1 | 0.44721 | -1.34164 | -0.60 |
5 | 13 | 1.34164 | 1.34164 | 1.80 |
Ave. of zxzy: 0.32 |
- Multiply by the correction factor n / (n - 1):
(ave of zxzy) *
n / (n-1) = 0.32 * 5 / (5-1) = 0.32 * 5 / 4 = 0.4.
This is the same answer obtained previously using SDx
and SDy.
Correlation with SPSS
- With the Laundry Dataset, use SPSS to display the scatterplot of price (y-axis) vs.
rating (x-axis):
- Import the Laundry Dataset dataset into SPSS.
- Verify that for the variables Rating and Price, Type is set to Numeric and Measure to Scale.
- Set the labels for Rating and Price to "Consumer Union Rating" and "Price (Cents) per Load", respectively.
- Select Graphs >> Chart Builder. In the Chart Builder Dialog, click CK
because you have already set the Measure of both variables to Scale.
In the
next Chart Builder Dialog, select Scatter/Dot, and drag the Simple Scatter from
the Gallery into the Chart Preview area.
Drag the Rating variable into the
X-axis box and the Price variable into the Y-axis box.
Click OK.
- In the Laundry Dataset, use SPSS to compute the correlation of rating and price.
- Select Analyze >> Correlate >> Bivariate...
Drag "Consumer Union Rating" and "Price (Cents) per Load" into the Variables box.
Click OK.
- The output shows that the correlation is 0.671
Cautions
- Caution: Correlation does not necessarily imply
causation.
- If X is correlated with Y, there could be five explanations:
- X causes Y
- Y causes X
- X causes Y and Y causes X
- Some third variable Z causes X and Y
- The correlation is a coincidence; there is no causal relationship between X and Y.
- Here are some examples of correlations with implied causations that
have various explanations:
- The more firemen that are fighting a fire, the bigger the fire
is going to be.
The actual causation is Y → X: The
bigger the fire is, the more firemen are necessary to fight it.
- For a gas, an increase in pressure causes an increase in temperature.
This is Charles' Law for an ideal gas. In fact X → Y and
Y → X. The causation works in both directions: an increase in
either temperature or pressure causes an increase in the other.
- Children that sleep with the light on are likely to develop
nearsightedness later in life.
This result was published in a study in May 13, 1999, in the Journal
Nature. In fact a follow up study showed that Z → X and Z → Y. There is
a strong link between parental nearsightedness and
child nearsightedness. Also, nearsighted parents were more likely to
leave the light on in a child's room.
- Women that take hormone replacement therapy (HRT) are less likely
to have coronary heart disease.
At first glance X → Y, but after controlling for the third variable
socio-economic group, the opposite effect was found: women that take HRT
were more likely to develop heart disease.
- As ice cream sales increase, the rate of drowning deaths increase.
This is also a case of Z → X and Z → Y. Both events depend on
the season of the year. In the summer months, ice cream sales increase;
drowning deaths also increase because more people to swimming.
- Piracy causes global warming.
It is true that both piracy and global warming have increased over the
past several decades, but this is just a coincidence. There is no
causal relationship. Another explanation is that both result from a common
third factor: population increase.
- For a correlation between X and Y to imply causation,
- X must precede Y in time,
- the causation must be plausable,
- common causes from other variables are controlled for.
- Question: Does smoking cause lung cancer?
- Caution: The correlation is misleading if there
is a nonlinear relationship between the variables.
Example 1:
There is a perfect quadratic relationship between
x and y, but the correlation is -0.368. A quadratic relationship between
x and y means that there is an equation y = ax2 + bx + c that
allows us to compute y from x. a, b, and c must be determined from the
dataset.
- Caution: Outliers can distort the correlation:
Example 2:
Without the outlier, the correlation is 1;
with the outlier the correlation is 0.514.
Example 3:
Without the outlier, the correlation is 1;
with the outlier the correlation is 0.522.
- Use SPSS to do continue the above analysis of
datasets/bears-1985.xls.
- Compute the correlation between meter and kilo.
- Create a scatterplot with a linear regression line
(linear trend line) of meter (x-variable) and kilo (y-variable).
- Repeat steps 1 and 2 after omitting the point that represents
William Perry.