Categorical Data in R

1 Introduction

Let us play with some categorical ( or predominantly categorical ) datasets in R and see how we can analyze and plot them.

First we will learn how to make Contingency Tables with any of the three forms. This will be useful in arriving at a common form of Table to go with plotting.

Then we will use vcd, mosaic, ggplot and ggpubr to make several plots for Categorical Datasets.

knitr::opts_chunk$set(message = FALSE, echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(vcd) #COntingency Tables and Plots
## Loading required package: grid
library(vcdExtra) # Datasets
## Loading required package: gnm
## 
## Attaching package: 'vcdExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     summarise
library(sjPlot) # Likert Plots
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
library(mosaic) # Data Analysis and Plots
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## The following object is masked from 'package:vcd':
## 
##     mplot
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(ggmosaic) # Mosaic Plots
## 
## Attaching package: 'ggmosaic'
## 
## The following objects are masked from 'package:vcd':
## 
##     mosaic, spine
library(ggpubr) # Balloon Plots

#install.packages("openintro")
library(openintro) # More Datasets
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
## 
## Attaching package: 'openintro'
## 
## The following object is masked from 'package:mosaic':
## 
##     dotPlot
## 
## The following objects are masked from 'package:lattice':
## 
##     ethanol, lsegments

2 Creating Contingency Tables

Most plots for Categorical Data ( as we shall see ) require that the data be converted into a Contingency Table; even Statistical tests for Proportions ( the $ X^2 $ test ) need Contingency Tables. The Frequency Table we encountered earlier is very close to being a full-fledged Contingency Table. ( needs to add row and column margin counts )

In this section we understand how to make Contingency Tables from each of the three forms.

2.1 Using base R

Arthritis
#One Way Table ( one variable )
table(Arthritis$Treatment)
## 
## Placebo Treated 
##      43      41
table(Arthritis$Treatment) %>% prop.table()
## 
##   Placebo   Treated 
## 0.5119048 0.4880952
table(Arthritis$Treatment) %>% addmargins()
## 
## Placebo Treated     Sum 
##      43      41      84
# Two-Way Table ( two variables )
table(Arthritis$Treatment, Arthritis$Improved) %>% prop.table() 
##          
##                 None       Some     Marked
##   Placebo 0.34523810 0.08333333 0.08333333
##   Treated 0.15476190 0.08333333 0.25000000
table(Arthritis$Treatment, Arthritis$Improved) %>% addmargins() # Contingency Table!!
##          
##           None Some Marked Sum
##   Placebo   29    7      7  43
##   Treated   13    7     21  41
##   Sum       42   14     28  84

We can use table() and xtabs() to generate multi-dimensional tables too ( More than 2D ) These will be printed out as a series of 2D tables, one for each value of the “third” parameter.

We can also ftable() to print these out in an attractive manner.

mytable <- table(Arthritis$Treatment, Arthritis$Sex, Arthritis$Improved)
mytable
## , ,  = None
## 
##          
##           Female Male
##   Placebo     19   10
##   Treated      6    7
## 
## , ,  = Some
## 
##          
##           Female Male
##   Placebo      7    0
##   Treated      5    2
## 
## , ,  = Marked
## 
##          
##           Female Male
##   Placebo      6    1
##   Treated     16    5
ftable(mytable)
##                 None Some Marked
##                                 
## Placebo Female    19    7      6
##         Male      10    0      1
## Treated Female     6    5     16
##         Male       7    2      5
ftable(mytable) %>% addmargins()
##              Sum
##     19  7  6  32
##     10  0  1  11
##      6  5 16  27
##      7  2  5  14
## Sum 42 14 28  84

2.2 Using the vcd package

The vcd ( Visualize Categorical Data ) package by Michael Friendly has a convenient function to create Contingency Tables: structable(); this function produces a ‘flat’ representation of a high-dimensional contingency table constructed by recursive splits (similar to the construction of mosaic charts/graphs).

The arguments of structable are:

  • a formula ($y + p x + z $ ) which shows which variables are to be included as columns and rows respectively on a table
  • a data argument, which can indicate a data frame
vcd::structable(data = Arthritis, Treatment ~ Improved)
##          Treatment Placebo Treated
## Improved                          
## None                    29      13
## Some                     7       7
## Marked                   7      21
vcd::structable(data = Arthritis, Treatment ~ Improved + Sex)
##                 Treatment Placebo Treated
## Improved Sex                             
## None     Female                19       6
##          Male                  10       7
## Some     Female                 7       5
##          Male                   0       2
## Marked   Female                 6      16
##          Male                   1       5
# HairEyeColor is in multiple table form
# structable flattens these into one, as for a mosaic chart
HairEyeColor
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8
vcd::structable(HairEyeColor)
##              Eye Brown Blue Hazel Green
## Hair  Sex                              
## Black Male          32   11    10     3
##       Female        36    9     5     2
## Brown Male          53   50    25    15
##       Female        66   34    29    14
## Red   Male          10   10     7     7
##       Female        16    7     7     7
## Blond Male           3   30     5     8
##       Female         4   64     5     8
## UCBAdmissions is already in Frequency Form i.e. a Contingency Table
#`structable` tends to render flat tables, of the kind that can be thought of as a "text representation" of the `vcd::mosaic` plot
UCBAdmissions
## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317
vcd::structable(UCBAdmissions)
##               Gender Male Female
## Admit    Dept                   
## Admitted A            512     89
##          B            353     17
##          C            120    202
##          D            138    131
##          E             53     94
##          F             22     24
## Rejected A            313     19
##          B            207      8
##          C            205    391
##          D            279    244
##          E            138    299
##          F            351    317

2.3 Using the mosaic package

I think this is the simplest and most elegant way of obtaining Contingency Tables:

# One Way Table
tally( ~ substance, data = HELPrct)
## substance
## alcohol cocaine  heroin 
##     177     152     124
# Two Way Tables
tally( ~ substance + sex , data = HELPrct)
##          sex
## substance female male
##   alcohol     36  141
##   cocaine     41  111
##   heroin      30   94
tally( ~ substance | sex , data = HELPrct)
##          sex
## substance female male
##   alcohol     36  141
##   cocaine     41  111
##   heroin      30   94
tally( sex ~ substance, data = HELPrct)
##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94
tally(~ sex |substance, data = HELPrct)
##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94
# Adding Margins
tally( ~ substance + sex , data = HELPrct, format = 'count', margins = TRUE) # Ta Da!
##          sex
## substance female male Total
##   alcohol     36  141   177
##   cocaine     41  111   152
##   heroin      30   94   124
##   Total      107  346   453
tally( ~ substance + sex , data = HELPrct, format = 'percent', margins = TRUE)
##          sex
## substance     female       male      Total
##   alcohol   7.947020  31.125828  39.072848
##   cocaine   9.050773  24.503311  33.554084
##   heroin    6.622517  20.750552  27.373068
##   Total    23.620309  76.379691 100.000000

2.3.1 Using the tidyverse

diamonds %>% group_by(cut, clarity) %>% dplyr::summarise( count = n())
# We need to pivot this "wide" to obtain a Contingency Table

diamonds %>% 
  group_by(cut, clarity) %>% 
  dplyr::summarise( count = n()) %>% 
  pivot_wider(id_cols = cut, names_from = clarity, values_from = count) %>% 
  
  # Now add the row and column totals using the `janitor` package
  janitor::adorn_totals(where = c("row", "col")) %>%
  
  # Recovert to tibble since janitor gives a "tabyl" format ( which can be useful )
  as_tibble()

Now that we have Contingency Tables, we can plot these:

3 Plotting Categorical Data

3.1 The titanic dataset

data("titanic")
titanic

3.1.1 titanic Bar Plots

# Use dplyr and ggplot

3.1.2 titanic Mosaic Plot

# Try the mosaic package and the ggmosaic package

3.1.3 titanic Balloon Plot

# use ggpubr

3.2 The hippocorpus dataset from Kaggle

This is a dataset from Kaggle and is based on Reference 2.

Hippocorpus is dataset of 6854 English diary like short stories about recalled and imagined events. Using a crowdsourcing framework the respective owners of this datasets collected recalled stories and summaries from workers, then provided these collected summaries to other workers who write imagined stories. Months later dataset creators collected a retold version of the recalled stories from the subset of recalled authors. Dataset contains author demographics (age, gender, race), their openness to experience, as well as some variables regarding the author’s relationship to the event (e.g., how personal the event is, how often they tell its story, etc.)

Apart from metadata pertaining to each respondent, there are 4 Likert Scale variables:

  • distracted: How distracted were you while writing your story? (5-point Likert)
  • draining: How taxing/draining was writing for you emotionally? (5-point Likert)
  • frequency: How often do you think about or talk about this event? (5-point Likert)
  • importance: How impactful, important, or personal is this story/this event to you? (5-point Likert). Plot these using the package sjPlot. Can you also try a ggplot?

3.3 A dataset from the vcdExtra package

Pick one of the fairly large Categorical datasets that are built into vcdExtra: type data(package = "vcdExtra") in your Console.

Create:
- Contingency Table - A Bar Plot - A Mosaic Plot - A Balloon Plot

4 Conclusion

Write a few comments on the data and visualizations. Did they convey a story of sorts?

5 References

  1. A detailed analysis of the NHANES dataset, https://awagaman.people.amherst.edu/stat230/Stat230CodeCompilationExampleCodeUsingNHANES.pdf

  2. Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, and James Pennebaker (2020) Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models. ACL.

---
title: "Categorical Data in R"
author: "Arvind Venkatadri"
date: 2023/16/01
lastmod: "`r Sys.Date()`"
output:
  rmdformats::readthedown:
    highlight: tango
    toc_float: TRUE
    toc_depth: 3
    df_print: paged
    number_sections: TRUE
    code_folding: show
    code_download: TRUE
editor_options: 
  markdown: 
    wrap: 72
---

# Introduction

Let us play with some categorical ( or predominantly categorical )
datasets in R and see how we can analyze and plot them.

First we will learn how to make Contingency Tables with any of the three forms. This will be useful in arriving at a common form of Table to go with plotting.

Then we will use `vcd`, `mosaic`, `ggplot` and `ggpubr` to make several plots for Categorical Datasets. 


```{r setup, include=TRUE}
knitr::opts_chunk$set(message = FALSE, echo = TRUE)
library(tidyverse)
library(vcd) #COntingency Tables and Plots
library(vcdExtra) # Datasets
library(sjPlot) # Likert Plots
library(mosaic) # Data Analysis and Plots
library(ggmosaic) # Mosaic Plots
library(ggpubr) # Balloon Plots

#install.packages("openintro")
library(openintro) # More Datasets

```


# Creating Contingency Tables {.tabset .tabset-pills}

Most plots for Categorical Data ( as we shall see ) require that the data be converted into a *Contingency Table*; even Statistical tests for Proportions ( the $ X^2 $ test ) need Contingency Tables. The *Frequency Table* we encountered earlier is very close to being a full-fledged Contingency Table. ( needs to add row and column margin counts )

In this section we understand how to make Contingency Tables from each of the three forms.

## Using base R

```{r}
Arthritis

#One Way Table ( one variable )
table(Arthritis$Treatment)
table(Arthritis$Treatment) %>% prop.table()
table(Arthritis$Treatment) %>% addmargins()
```


```{r}

# Two-Way Table ( two variables )
table(Arthritis$Treatment, Arthritis$Improved) %>% prop.table() 
table(Arthritis$Treatment, Arthritis$Improved) %>% addmargins() # Contingency Table!!

```

We can use **table()** and **xtabs()** to generate multi-dimensional tables too ( More than 2D ) These will be printed out as a series of 2D tables, one for each value of the "third" parameter. 

We can also **ftable()** to print these out in an attractive manner.

```{r}

mytable <- table(Arthritis$Treatment, Arthritis$Sex, Arthritis$Improved)
mytable
ftable(mytable)
ftable(mytable) %>% addmargins()

```


## Using the `vcd` package

The `vcd` ( Visualize Categorical Data ) package by Michael Friendly has a convenient function to create Contingency Tables: `structable()`; this function produces a ‘flat’ representation of a high-dimensional contingency table constructed by recursive splits (similar to the construction of mosaic charts/graphs).

The arguments of `structable` are:  

- a formula ($y + p \sim x + z $ ) which shows which variables are to be included as *columns* and *rows* respectively on a table  
- a `data` argument, which can indicate a `data frame`


```{r}
vcd::structable(data = Arthritis, Treatment ~ Improved)
vcd::structable(data = Arthritis, Treatment ~ Improved + Sex)


# HairEyeColor is in multiple table form
# structable flattens these into one, as for a mosaic chart
HairEyeColor
vcd::structable(HairEyeColor)

## UCBAdmissions is already in Frequency Form i.e. a Contingency Table
#`structable` tends to render flat tables, of the kind that can be thought of as a "text representation" of the `vcd::mosaic` plot
UCBAdmissions
vcd::structable(UCBAdmissions)

```


## Using the `mosaic` package

I think this is the simplest and most elegant way of obtaining Contingency Tables:

```{r}
# One Way Table
tally( ~ substance, data = HELPrct)

# Two Way Tables
tally( ~ substance + sex , data = HELPrct)
tally( ~ substance | sex , data = HELPrct)

tally( sex ~ substance, data = HELPrct)
tally(~ sex |substance, data = HELPrct)

```


```{r,highlight=TRUE}

# Adding Margins
tally( ~ substance + sex , data = HELPrct, format = 'count', margins = TRUE) # Ta Da!
tally( ~ substance + sex , data = HELPrct, format = 'percent', margins = TRUE)

```

### Using the `tidyverse`


```{r}
diamonds %>% group_by(cut, clarity) %>% dplyr::summarise( count = n())

# We need to pivot this "wide" to obtain a Contingency Table

diamonds %>% 
  group_by(cut, clarity) %>% 
  dplyr::summarise( count = n()) %>% 
  pivot_wider(id_cols = cut, names_from = clarity, values_from = count) %>% 
  
  # Now add the row and column totals using the `janitor` package
  janitor::adorn_totals(where = c("row", "col")) %>%
  
  # Recovert to tibble since janitor gives a "tabyl" format ( which can be useful )
  as_tibble()

```



Now that we have Contingency Tables, we can plot these:

# Plotting Categorical Data

## The `titanic` dataset

```{r}
data("titanic")
titanic

```


### `titanic` Bar Plots

```{r titanic-bar-plot}
# Use dplyr and ggplot




```




### `titanic` Mosaic Plot

```{r titanic-mosaic-plot}
# Try the mosaic package and the ggmosaic package




```

### `titanic` Balloon Plot

```{r titanic-balloon-plot}
# use ggpubr




```


## The `hippocorpus` dataset from Kaggle

```{r, echo=FALSE,message=FALSE}
library(downloadthis)
hippo <- read.csv("data/hippoCorpusV2.csv")
download_this(hippo,
    #path = "data/hippoCorpusV2.csv",
    output_name = "hippocorpus",
    output_extension = ".csv",
    button_label = "Download data as csv",
    button_type = "info",
    has_icon = TRUE,
    icon = "fa fa-save"
  )

```

This is a dataset from
[Kaggle](https://www.kaggle.com/datasets/saurabhshahane/hippocorpus?select=hippoCorpusV2.csv)
and is based on Reference 2.

> Hippocorpus is dataset of 6854 English diary like short stories about
> recalled and imagined events. Using a crowdsourcing framework the
> respective owners of this datasets collected recalled stories and
> summaries from workers, then provided these collected summaries to
> other workers who write imagined stories. Months later dataset
> creators collected a retold version of the recalled stories from the
> subset of recalled authors. Dataset contains author demographics (age,
> gender, race), their openness to experience, as well as some variables
> regarding the author's relationship to the event (e.g., how personal
> the event is, how often they tell its story, etc.)

Apart from metadata pertaining to each respondent, there are 4 *Likert
Scale* variables:

-   `distracted`: How distracted were you while writing your story?
    (5-point Likert)
-   `draining`: How taxing/draining was writing for you emotionally?
    (5-point Likert)
-   `frequency`: How often do you think about or talk about this event?
    (5-point Likert)
-   `importance`: How impactful, important, or personal is this
    story/this event to you? (5-point Likert). Plot these using the
    package `sjPlot`. Can you also try a `ggplot`?

```{r hippocorpus-likert}


```


## A dataset from the `vcdExtra` package

Pick one of the fairly large Categorical datasets that are built into `vcdExtra`: type `data(package = "vcdExtra")` in your Console.

Create:  
- Contingency Table
- A Bar Plot
- A Mosaic Plot
- A Balloon Plot


# Conclusion

Write a few comments on the data and visualizations. Did they convey a story of sorts?


# References

1.  A detailed analysis of the NHANES dataset,
    <https://awagaman.people.amherst.edu/stat230/Stat230CodeCompilationExampleCodeUsingNHANES.pdf>

2.  Maarten Sap, Eric Horvitz, Yejin Choi, Noah A. Smith, and James
    Pennebaker (2020) *Recollection versus Imagination: Exploring Human
    Memory and Cognition via Neural Language Models.* ACL.
