Write the names of the students at your table:
Suppose you’ve opened up a dataset, made a jointplot,
and you find that two columns A and B
appear highly correlated. Let’s explore some of the ways this
correlation may have arisen. The first one is done for you:
A is liquid precip per minute, B is number of people wearing raincoats. What happened?
When it rains more, more people put on raincoats. A (rain) caused B (raincoats).
A is a measure of lung health; B is number of cigarettes smoked per day. What happened?
A is the ranking of the college a person graduated from; B is their score on the GRE (an SAT-like exam sometimes used for graduate school admissions). What happened?
A is height in centimeters; B is height in feet. What happened?
A is the cost to send a letter via the USPS; B is the number of Google searches for ‘i am dizzy’. What happened?
Recall that the formula for Pearson’s correlation coefficient is: \[ r(X, Y) = \frac{\sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})} {\sigma(X) \sigma(Y)} \] Suppose one sample from \(Y\) is bad (e.g., it’s 1000x larger than it should be because of a unit error such as km vs m). How will this affect \(r\)?