Correlations in R
1 Introduction
We will create Tables for Correlations, and graphs for Correlations
in R. As always, we will consistently use the Project Mosaic
ecosystem of packages in R (mosaic
, mosaicData
and ggformula
).
Note the standard method for all commands from the
mosaic
package:
goal( y ~ x | z, data = mydata, …)
1.1 Case Study -1:
Dataset from mosaicData
Let us inspect what datasets are available in the package
mosaicData
. Run this command in your Console:
# Run in Console
data(package = "mosaicData")
The popup tab shows a lot of datasets we could use. Let us continue
to use the famous Galton
dataset and inspect
it:
data("Galton")
inspect(Galton)
#>
#> categorical variables:
#> name class levels n missing
#> 1 family factor 197 898 0
#> 2 sex factor 2 898 0
#> distribution
#> 1 185 (1.7%), 166 (1.2%), 66 (1.2%) ...
#> 2 M (51.8%), F (48.2%)
#>
#> quantitative variables:
#> name class min Q1 median Q3 max mean sd n missing
#> 1 father numeric 62 68 69.0 71.0 78.5 69.232851 2.470256 898 0
#> 2 mother numeric 58 63 64.0 65.5 70.5 64.084410 2.307025 898 0
#> 3 height numeric 56 64 66.5 69.7 79.0 66.760690 3.582918 898 0
#> 4 nkids integer 1 4 6.0 8.0 15.0 6.135857 2.685156 898 0
The inspect
command already gives us a series of
statistical measures of different variables of interest. As discussed
previously, we can retain the output of inspect
and use it
in our reports: (there are ways of dressing up these tables too)
<- inspect(Galton)
galton_describe
$categorical galton_describe
$quantitative galton_describe
Try help("Galton")
in your Console. The dataset is
described as:
A data frame with 898 observations on the following variables.
-family
a factor with levels for each family
-father
the father’s height (in inches)
-mother
the mother’s height (in inches)
-sex
the child’s sex: F or M
-height
the child’s height as an adult (in inches)
-nkids
the number of adult children in the family, or, at least, the number whose heights Galton recorded.
There is a lot of Description generated by the
mosaic::inspect()
command ! What can we say about the
dataset and its variables? How big is the dataset? How many variables?
What types are they, Quant or Qual? If they are Qual, what are the
levels? Are they ordered levels? Discuss!
1.2 Correlations and Plots
What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?
Q.1 Which are the variables that have significant pair-wise correlations? What polarity are these correlations?
# Pulling out the list of Quant variables from NHANES
<- galton_describe$quantitative
galton_quant $name
galton_quant#> [1] "father" "mother" "height" "nkids"
::ggpairs(
GGally
Galton,columns = c("father", "mother", "height", "nkids"),
diag = list("densityDiag"),
title = "Galton Data Correlations Plot"
)
Can we plot a Correlogram for this dataset?
#library(corrplot)
<- Galton %>% select(father, mother, height, nkids)
galton_num_var <- cor(galton_num_var)
galton_cor <- galton_cor %>%
galton_correlogram corrplot(method = "ellipse",
type = "lower",
main = "Correlogram for Galton dataset")
galton_correlogram#> $corr
#> father mother height nkids
#> father 1.00000000 0.07366461 0.2753548 -0.16002294
#> mother 0.07366461 1.00000000 0.2016549 -0.02002964
#> height 0.27535483 0.20165489 1.0000000 -0.12691012
#> nkids -0.16002294 -0.02002964 -0.1269101 1.00000000
#>
#> $corrPos
#> xName yName x y corr
#> 1 father father 1 4 1.00000000
#> 2 father mother 1 3 0.07366461
#> 3 father height 1 2 0.27535483
#> 4 father nkids 1 1 -0.16002294
#> 5 mother mother 2 3 1.00000000
#> 6 mother height 2 2 0.20165489
#> 7 mother nkids 2 1 -0.02002964
#> 8 height height 3 2 1.00000000
#> 9 height nkids 3 1 -0.12691012
#> 10 nkids nkids 4 1 1.00000000
#>
#> $arg
#> $arg$type
#> [1] "lower"
Clearly height
is positively correlated to
father
and mother
; interestingly,
height
is negatively correlated ( slightly) with
nkids
.
Q.2: Let us confirm with a test:
height ~ father
:
::cor_test(height ~ father, data = Galton)
mosaic#>
#> Pearson's product-moment correlation
#>
#> data: height and father
#> t = 8.5737, df = 896, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2137851 0.3347455
#> sample estimates:
#> cor
#> 0.2753548
height ~ mother
:
::cor_test(height ~ mother, data = Galton)
mosaic#>
#> Pearson's product-moment correlation
#>
#> data: height and mother
#> t = 6.1628, df = 896, p-value = 1.079e-09
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.1380554 0.2635982
#> sample estimates:
#> cor
#> 0.2016549
Q.3 What does this correlation look when split by sex
of
Child?
Sons ~ Parents
# For the sons
cor_test(height ~ father, data = Galton %>% filter(sex == "M"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and father
#> t = 9.1498, df = 463, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.3114667 0.4656805
#> sample estimates:
#> cor
#> 0.3913174
cor_test(height ~ mother, data = Galton %>% filter(sex == "M"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and mother
#> t = 7.628, df = 463, p-value = 1.367e-13
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2508178 0.4125305
#> sample estimates:
#> cor
#> 0.3341309
Daughters ~ Parents
# For the daughters
cor_test(height ~ father, data = Galton %>% filter(sex == "F"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and father
#> t = 10.719, df = 431, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.3809944 0.5300812
#> sample estimates:
#> cor
#> 0.4587605
cor_test(height ~ mother, data = Galton %>% filter(sex == "F"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and mother
#> t = 6.8588, df = 431, p-value = 2.421e-11
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2261463 0.3962226
#> sample estimates:
#> cor
#> 0.3136984
Q.4. How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galton’s famous diagram?
# For the sons
gf_point(height ~ father,
data = Galton %>% filter(sex == "M")) %>%
gf_smooth(method = "lm")
gf_point(height ~ mother,
data = Galton %>% filter(sex == "M")) %>%
gf_smooth(method = "lm")
# For the daughters
gf_point(height ~ father,
data = Galton %>% filter(sex == "F")) %>%
gf_smooth(method = "lm")
gf_point(height ~ mother,
data = Galton %>% filter(sex == "F")) %>%
gf_smooth(method = "lm")
An approximation to Galton’s famous plot:
gf_point(height ~ (father + mother)/2, data = Galton) %>%
gf_smooth(method = "lm") %>%
gf_density_2d(n = 8) %>%
gf_abline(slope = 1)
How would you interpret this plot?
1.3 Case Study -2:
Dataset from NHANES
We will “live code” this in class!
2 Conclusion
We have a decent Correlations related workflow in R:
- Load the dataset
inspect
the dataset, identify Quant and Qual variables- Develop Pair-Wise plots + Correlations using
GGally::ggpairs()
- Develop Correlogram
corrplot::corrplot
- Check everything with a
cor_test
- Plot scatter plots using
gf_point
if needed. - Add extra lines using
gf_abline()
to compare hypotheses that you may have.
2.1 Reference
The Farnam Street Blog. Regression Toward the Mean: An Introduction with Examples. https://fs.blog/regression-to-the-mean/