Categorical Data in R
1 Introduction
Let us play with some categorical ( or predominantly categorical ) datasets in R and see how we can analyze and plot them.
First we will learn how to make Contingency Tables with any of the three forms. This will be useful in arriving at a common form of Table to go with plotting.
Then we will use vcd
, mosaic
,
ggplot
and ggpubr
to make several plots for
Categorical Datasets.
::opts_chunk$set(message = FALSE, echo = TRUE)
knitrlibrary(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(vcd) #COntingency Tables and Plots
## Loading required package: grid
library(vcdExtra) # Datasets
## Loading required package: gnm
##
## Attaching package: 'vcdExtra'
##
## The following object is masked from 'package:dplyr':
##
## summarise
library(sjPlot) # Likert Plots
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(mosaic) # Data Analysis and Plots
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
## mean
##
## The following object is masked from 'package:vcd':
##
## mplot
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(ggmosaic) # Mosaic Plots
##
## Attaching package: 'ggmosaic'
##
## The following objects are masked from 'package:vcd':
##
## mosaic, spine
library(ggpubr) # Balloon Plots
#install.packages("openintro")
library(openintro) # More Datasets
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
##
## The following object is masked from 'package:mosaic':
##
## dotPlot
##
## The following objects are masked from 'package:lattice':
##
## ethanol, lsegments
2 Creating Contingency Tables
Most plots for Categorical Data ( as we shall see ) require that the data be converted into a Contingency Table; even Statistical tests for Proportions ( the $ X^2 $ test ) need Contingency Tables. The Frequency Table we encountered earlier is very close to being a full-fledged Contingency Table. ( needs to add row and column margin counts )
In this section we understand how to make Contingency Tables from each of the three forms.
2.1 Using base R
Arthritis
#One Way Table ( one variable )
table(Arthritis$Treatment)
##
## Placebo Treated
## 43 41
table(Arthritis$Treatment) %>% prop.table()
##
## Placebo Treated
## 0.5119048 0.4880952
table(Arthritis$Treatment) %>% addmargins()
##
## Placebo Treated Sum
## 43 41 84
# Two-Way Table ( two variables )
table(Arthritis$Treatment, Arthritis$Improved) %>% prop.table()
##
## None Some Marked
## Placebo 0.34523810 0.08333333 0.08333333
## Treated 0.15476190 0.08333333 0.25000000
table(Arthritis$Treatment, Arthritis$Improved) %>% addmargins() # Contingency Table!!
##
## None Some Marked Sum
## Placebo 29 7 7 43
## Treated 13 7 21 41
## Sum 42 14 28 84
We can use table() and xtabs() to generate multi-dimensional tables too ( More than 2D ) These will be printed out as a series of 2D tables, one for each value of the “third” parameter.
We can also ftable() to print these out in an attractive manner.
<- table(Arthritis$Treatment, Arthritis$Sex, Arthritis$Improved)
mytable mytable
## , , = None
##
##
## Female Male
## Placebo 19 10
## Treated 6 7
##
## , , = Some
##
##
## Female Male
## Placebo 7 0
## Treated 5 2
##
## , , = Marked
##
##
## Female Male
## Placebo 6 1
## Treated 16 5
ftable(mytable)
## None Some Marked
##
## Placebo Female 19 7 6
## Male 10 0 1
## Treated Female 6 5 16
## Male 7 2 5
ftable(mytable) %>% addmargins()
## Sum
## 19 7 6 32
## 10 0 1 11
## 6 5 16 27
## 7 2 5 14
## Sum 42 14 28 84
2.2 Using the
vcd
package
The vcd
( Visualize Categorical Data ) package by
Michael Friendly has a convenient function to create Contingency Tables:
structable()
; this function produces a ‘flat’
representation of a high-dimensional contingency table constructed by
recursive splits (similar to the construction of mosaic
charts/graphs).
The arguments of structable
are:
- a formula ($y + p x + z $ ) which shows which variables are to be
included as columns and rows respectively on a
table
- a
data
argument, which can indicate adata frame
::structable(data = Arthritis, Treatment ~ Improved) vcd
## Treatment Placebo Treated
## Improved
## None 29 13
## Some 7 7
## Marked 7 21
::structable(data = Arthritis, Treatment ~ Improved + Sex) vcd
## Treatment Placebo Treated
## Improved Sex
## None Female 19 6
## Male 10 7
## Some Female 7 5
## Male 0 2
## Marked Female 6 16
## Male 1 5
# HairEyeColor is in multiple table form
# structable flattens these into one, as for a mosaic chart
HairEyeColor
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
::structable(HairEyeColor) vcd
## Eye Brown Blue Hazel Green
## Hair Sex
## Black Male 32 11 10 3
## Female 36 9 5 2
## Brown Male 53 50 25 15
## Female 66 34 29 14
## Red Male 10 10 7 7
## Female 16 7 7 7
## Blond Male 3 30 5 8
## Female 4 64 5 8
## UCBAdmissions is already in Frequency Form i.e. a Contingency Table
#`structable` tends to render flat tables, of the kind that can be thought of as a "text representation" of the `vcd::mosaic` plot
UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
::structable(UCBAdmissions) vcd
## Gender Male Female
## Admit Dept
## Admitted A 512 89
## B 353 17
## C 120 202
## D 138 131
## E 53 94
## F 22 24
## Rejected A 313 19
## B 207 8
## C 205 391
## D 279 244
## E 138 299
## F 351 317
2.3 Using the
mosaic
package
I think this is the simplest and most elegant way of obtaining Contingency Tables:
# One Way Table
tally( ~ substance, data = HELPrct)
## substance
## alcohol cocaine heroin
## 177 152 124
# Two Way Tables
tally( ~ substance + sex , data = HELPrct)
## sex
## substance female male
## alcohol 36 141
## cocaine 41 111
## heroin 30 94
tally( ~ substance | sex , data = HELPrct)
## sex
## substance female male
## alcohol 36 141
## cocaine 41 111
## heroin 30 94
tally( sex ~ substance, data = HELPrct)
## substance
## sex alcohol cocaine heroin
## female 36 41 30
## male 141 111 94
tally(~ sex |substance, data = HELPrct)
## substance
## sex alcohol cocaine heroin
## female 36 41 30
## male 141 111 94
# Adding Margins
tally( ~ substance + sex , data = HELPrct, format = 'count', margins = TRUE) # Ta Da!
## sex
## substance female male Total
## alcohol 36 141 177
## cocaine 41 111 152
## heroin 30 94 124
## Total 107 346 453
tally( ~ substance + sex , data = HELPrct, format = 'percent', margins = TRUE)
## sex
## substance female male Total
## alcohol 7.947020 31.125828 39.072848
## cocaine 9.050773 24.503311 33.554084
## heroin 6.622517 20.750552 27.373068
## Total 23.620309 76.379691 100.000000
2.3.1 Using the
tidyverse
%>% group_by(cut, clarity) %>% dplyr::summarise( count = n()) diamonds
# We need to pivot this "wide" to obtain a Contingency Table
%>%
diamonds group_by(cut, clarity) %>%
::summarise( count = n()) %>%
dplyrpivot_wider(id_cols = cut, names_from = clarity, values_from = count) %>%
# Now add the row and column totals using the `janitor` package
::adorn_totals(where = c("row", "col")) %>%
janitor
# Recovert to tibble since janitor gives a "tabyl" format ( which can be useful )
as_tibble()
Now that we have Contingency Tables, we can plot these: