Correlations in R
1 Introduction
We will create Tables for Correlations, and graphs for Correlations
in R. As always, we will consistently use the Project Mosaic
ecosystem of packages in R (mosaic, mosaicData
and ggformula).
Note the standard method for all commands from the
mosaic package:
goal( y ~ x | z, data = mydata, …)
1.1 Case Study -1:
Dataset from mosaicData
Let us inspect what datasets are available in the package
mosaicData. Run this command in your Console:
# Run in Console
data(package = "mosaicData")The popup tab shows a lot of datasets we could use. Let us continue
to use the famous Galton dataset and inspect
it:
data("Galton")
inspect(Galton)
#>
#> categorical variables:
#> name class levels n missing
#> 1 family factor 197 898 0
#> 2 sex factor 2 898 0
#> distribution
#> 1 185 (1.7%), 166 (1.2%), 66 (1.2%) ...
#> 2 M (51.8%), F (48.2%)
#>
#> quantitative variables:
#> name class min Q1 median Q3 max mean sd n missing
#> 1 father numeric 62 68 69.0 71.0 78.5 69.232851 2.470256 898 0
#> 2 mother numeric 58 63 64.0 65.5 70.5 64.084410 2.307025 898 0
#> 3 height numeric 56 64 66.5 69.7 79.0 66.760690 3.582918 898 0
#> 4 nkids integer 1 4 6.0 8.0 15.0 6.135857 2.685156 898 0The inspect command already gives us a series of
statistical measures of different variables of interest. As discussed
previously, we can retain the output of inspect and use it
in our reports: (there are ways of dressing up these tables too)
galton_describe <- inspect(Galton)
galton_describe$categoricalname <chr> | class <chr> | levels <int> | n <int> | missing <int> | distribution <chr> |
|---|---|---|---|---|---|
| family | factor | 197 | 898 | 0 | 185 (1.7%), 166 (1.2%), 66 (1.2%) ... |
| sex | factor | 2 | 898 | 0 | M (51.8%), F (48.2%) |
galton_describe$quantitativename <chr> | class <chr> | min <dbl> | Q1 <dbl> | median <dbl> | Q3 <dbl> | max <dbl> | mean <dbl> | sd <dbl> | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | father | numeric | 62 | 68 | 69.0 | 71.0 | 78.5 | 69.232851 | 2.470256 | |
| 2 | mother | numeric | 58 | 63 | 64.0 | 65.5 | 70.5 | 64.084410 | 2.307025 | |
| 3 | height | numeric | 56 | 64 | 66.5 | 69.7 | 79.0 | 66.760690 | 3.582918 | |
| 4 | nkids | integer | 1 | 4 | 6.0 | 8.0 | 15.0 | 6.135857 | 2.685156 |
Try help("Galton") in your Console. The dataset is
described as:
A data frame with 898 observations on the following variables.
-familya factor with levels for each family
-fatherthe father’s height (in inches)
-motherthe mother’s height (in inches)
-sexthe child’s sex: F or M
-heightthe child’s height as an adult (in inches)
-nkidsthe number of adult children in the family, or, at least, the number whose heights Galton recorded.
There is a lot of Description generated by the
mosaic::inspect() command ! What can we say about the
dataset and its variables? How big is the dataset? How many variables?
What types are they, Quant or Qual? If they are Qual, what are the
levels? Are they ordered levels? Discuss!
1.2 Correlations and Plots
What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?
Q.1 Which are the variables that have significant pair-wise correlations? What polarity are these correlations?
# Pulling out the list of Quant variables from NHANES
galton_quant <- galton_describe$quantitative
galton_quant$name
#> [1] "father" "mother" "height" "nkids"
GGally::ggpairs(
Galton,
columns = c("father", "mother", "height", "nkids"),
diag = list("densityDiag"),
title = "Galton Data Correlations Plot"
)Can we plot a Correlogram for this dataset?
#library(corrplot)
galton_num_var <- Galton %>% select(father, mother, height, nkids)
galton_cor <- cor(galton_num_var)
galton_correlogram <- galton_cor %>%
corrplot(method = "ellipse",
type = "lower",
main = "Correlogram for Galton dataset")galton_correlogram
#> $corr
#> father mother height nkids
#> father 1.00000000 0.07366461 0.2753548 -0.16002294
#> mother 0.07366461 1.00000000 0.2016549 -0.02002964
#> height 0.27535483 0.20165489 1.0000000 -0.12691012
#> nkids -0.16002294 -0.02002964 -0.1269101 1.00000000
#>
#> $corrPos
#> xName yName x y corr
#> 1 father father 1 4 1.00000000
#> 2 father mother 1 3 0.07366461
#> 3 father height 1 2 0.27535483
#> 4 father nkids 1 1 -0.16002294
#> 5 mother mother 2 3 1.00000000
#> 6 mother height 2 2 0.20165489
#> 7 mother nkids 2 1 -0.02002964
#> 8 height height 3 2 1.00000000
#> 9 height nkids 3 1 -0.12691012
#> 10 nkids nkids 4 1 1.00000000
#>
#> $arg
#> $arg$type
#> [1] "lower"Clearly height is positively correlated to
father and mother; interestingly,
height is negatively correlated ( slightly) with
nkids.
Q.2: Let us confirm with a test:
height ~ father:
mosaic::cor_test(height ~ father, data = Galton)
#>
#> Pearson's product-moment correlation
#>
#> data: height and father
#> t = 8.5737, df = 896, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2137851 0.3347455
#> sample estimates:
#> cor
#> 0.2753548height ~ mother:
mosaic::cor_test(height ~ mother, data = Galton)
#>
#> Pearson's product-moment correlation
#>
#> data: height and mother
#> t = 6.1628, df = 896, p-value = 1.079e-09
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.1380554 0.2635982
#> sample estimates:
#> cor
#> 0.2016549Q.3 What does this correlation look when split by sex of
Child?
Sons ~ Parents
# For the sons
cor_test(height ~ father, data = Galton %>% filter(sex == "M"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and father
#> t = 9.1498, df = 463, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.3114667 0.4656805
#> sample estimates:
#> cor
#> 0.3913174
cor_test(height ~ mother, data = Galton %>% filter(sex == "M"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and mother
#> t = 7.628, df = 463, p-value = 1.367e-13
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2508178 0.4125305
#> sample estimates:
#> cor
#> 0.3341309Daughters ~ Parents
# For the daughters
cor_test(height ~ father, data = Galton %>% filter(sex == "F"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and father
#> t = 10.719, df = 431, p-value < 2.2e-16
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.3809944 0.5300812
#> sample estimates:
#> cor
#> 0.4587605
cor_test(height ~ mother, data = Galton %>% filter(sex == "F"))
#>
#> Pearson's product-moment correlation
#>
#> data: height and mother
#> t = 6.8588, df = 431, p-value = 2.421e-11
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#> 0.2261463 0.3962226
#> sample estimates:
#> cor
#> 0.3136984Q.4. How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galton’s famous diagram?
# For the sons
gf_point(height ~ father,
data = Galton %>% filter(sex == "M")) %>%
gf_smooth(method = "lm")
gf_point(height ~ mother,
data = Galton %>% filter(sex == "M")) %>%
gf_smooth(method = "lm")
# For the daughters
gf_point(height ~ father,
data = Galton %>% filter(sex == "F")) %>%
gf_smooth(method = "lm")
gf_point(height ~ mother,
data = Galton %>% filter(sex == "F")) %>%
gf_smooth(method = "lm")An approximation to Galton’s famous plot:
gf_point(height ~ (father + mother)/2, data = Galton) %>%
gf_smooth(method = "lm") %>%
gf_density_2d(n = 8) %>%
gf_abline(slope = 1)How would you interpret this plot?
1.3 Case Study -2:
Dataset from NHANES
We will “live code” this in class!
2 Conclusion
We have a decent Correlations related workflow in R:
- Load the dataset
inspectthe dataset, identify Quant and Qual variables- Develop Pair-Wise plots + Correlations using
GGally::ggpairs() - Develop Correlogram
corrplot::corrplot - Check everything with a
cor_test - Plot scatter plots using
gf_pointif needed. - Add extra lines using
gf_abline()to compare hypotheses that you may have.
2.1 Reference
The Farnam Street Blog. Regression Toward the Mean: An Introduction with Examples. https://fs.blog/regression-to-the-mean/