Introduction
This RMarkdown document is part of the Generic Skills Component (GSK)
of the Course of the Foundation Studies Programme at Srishti Manipal
Institute of Art, Design, and Technology, Bangalore India. The material
is based on A Layered Grammar of Graphics by Hadley Wickham.
The course is meant for First Year students pursuing a Degree in Art and
Design.
The intent of this GSK part is to build Skill in coding in R, and
also appreciate R as a way to metaphorically visualize information of
various kinds, using predominantly geometric figures and structures.
All RMarkdown files combine code, text, web-images, and figures
developed using code. Everything is text; code chunks are enclosed in
fences (```)
Goals
- Understand different kinds of data variables
- Appreciate how they can be identified based on the Interrogative
Pronouns they answer to
- Understand how each kind of variable lends itself to a specific
geometric aspect in the data visualization.
- Understand how ask Questions of Data to develop Visualizations
Pedagogical Note
The method followed will be based on PRIMM:
- PREDICT Inspect the code and guess at what the code
might do, write predictions
- RUN the code provided and check what happens
- INFER what the
parameters
of the code
do and write comments to explain. What bells and
whistles can you see?
- MODIFY the
parameters
code provided to
understand the options
available. Write
comments to show what you have aimed for and achieved.
- MAKE : take an idea/concept of your own, and graph
it.
Set Up
The setup
code chunk below brings into
our coding session R packages that provide specific
computational abilities and also datasets which we can
use.
To reiterate: Packages and datasets are not the same
thing !! Packages are (small) collections of programs. Datasets are
just….information.
Packages needed
knitr::opts_chunk$set(echo = TRUE,warning = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(palmerpenguins)
Introduction
In this RMarkdown document, we try to connect story-making
questions with two ideas:
- a Variable in a dataset
- A computed Quantity / Descriptive Statistic or a
Visual, based on one or more Variables
So: a question identifies a variable and a question also leads to a
Computation or a Data Visualization. The idea is to
get the intuition behind data, and iteratively ask the questions and
form hypotheses and perform Exploratory Data Analysis (EDA)
using graphs and charts in R.
At some point we may find that the data is not adequate to
prove/disprove a particular hypothesis and need to get into further
research / experimental design. It is possible to design the research
experiments also in R, but we may cover that much later.
In the following:
When it is YOUR TURN: wherever you see YOUR TURN, please respond with
explanations, more questions and if you are already confident, code
chunks to create new calculations and graphs. This will be one of your
submissions for this module, on Teams!
Interrogative Pronouns
for Data Variables
So how do we ask questions? These are usually with interrogative
pronouns in English: What? Who? Where? Which? What Kind? How? and
so on.
The
penguins
dataset
names(penguins) # Column, i.e. Variable names
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
head(penguins) # first six rows
tail(penguins) # Last six rows
dim(penguins) # Size of dataset
## [1] 344 8
# Check for missing data
any(is.na(penguins) == TRUE)
## [1] TRUE
- What are the variable
names()
?
- What would be the Question you might have asked to obtain each of
the variables?
- What further questions/meta questions would you ask to “process”
that variable? ( Hint: Add another word after any of the Interrogative
Pronouns, e.g. How…MANY?)
- Where might the answers take your story?
YOUR TURN-1
State a few questions after discussion with your friend and state
possible variables, or what you could DO with the variables, as an
answer.
E.g. Q. How many penguins? A. We need to count…rows?
Pronouns and
Variables
In the Table below, we have a rough mapping of interrogative pronouns
to the kinds of variables in the data:
What, Who, Where, Whom, Which |
Name, Place, Animal, Thing |
Qu alitative / Nominal |
Name |
|
How, What Kind, What Sort |
A Manner / Method, Type or Attribute from a list, with list items in
some ” o rder**” ( e.g. good, better, improved, best..) |
Qu alitative / Ordinal |
So cioeconom ic-status (“lo w-income, middl e-income, hig
h-income)
Education l evel(“hig hschool”,
” BS”,“MS”,
“PhD”)
Income level
(“less than 50K”,
“5 0K-100K”, “o ver100K”)
Sat isfaction
rating (” extremely
dislike”, ” dislike”, ” neutral”,
“like”, ” extremely
like”). |
|
How Many / Much / Heavy? Few? Seldom? Often? When? |
Q uantities with Scale.
Diff erences are me aningful, but not products or
ratios |
Qua ntitative / I nterval |
- pH
- SAT score (200-800)
- Credit score (300-850)
- Year of
Starting in
College |
Deviation |
How Many / Much / Heavy? Few? Seldom? Often? When? |
Qu antities, with Scale and a Zero Value.
Di fferences and Ratios /Products are me aningful. (e.g Weight
) |
Qua ntitative / Ratio |
Weight,
Length,
Height
Te mperature in
Kelvin
activity, dose
amount,
reaction rate, flow rate, conc entration
|
Variation |
As you go from Qualitative to Quantitative data types in the table, I
hope you can detect a movement from fuzzy groups/categories to more and
more crystallized numbers. Each variable/scale can be subjected to the
operations of the previous group. In the words of S.S.
Stevens (https://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf)
the basic operations needed to create each type of scale is
cumulative: to an operation listed opposite a particular scale must be
added all those operations preceding it.
Do think about this as you work with data.
Do take a look at these references:
- https://stats.idre.ucla.edu/other/mult-pkg/whatstat/what-is-the-difference-between-categorical-ordinal-and-interval-variables/
- https://www.freecodecamp.org/news/types-of-data-in-statistics-nominal-ordinal-interval-and-ratio-data-types-explained-with-examples/
The mpg
dataset
names(mpg) # Column, i.e. Variable names
## [1] "manufacturer" "model" "displ" "year" "cyl"
## [6] "trans" "drv" "cty" "hwy" "fl"
## [11] "class"
head(mpg) # first six rows
tail(mpg) # Last six rows
dim(mpg) # Size of dataset
## [1] 234 11
# Check for missing data
any(is.na(mpg) == TRUE)
## [1] FALSE
YOUR TURN-2
Look carefully at the variables here. How would you interpret say the
cyl
variable? Is it a number and therefore Quantitative, or
could it be something else?
Interrogations and
Graphs
We can also respond to ( more complex ) Questions, with not just a
variable but one of two things:
- A calculation, shown in a table
- a data visualization. This visualization can even involve
more than one variable, as we will see.
What sort of calculations, and visuals charts can we create with
different kinds of variables, taken singly or together? Let us write
some simple English descriptions of measures and visuals and see what
commands they use in R.
Here we will use the Grammar of a package called ggplot
,
which we will encounter in Lab:04. Let us go with our intuition with the
code in the following sections.
Note: since we saw a couple of missing entries in the
penguins
dataset, let us remove them for now.
penguins <- penguins %>% drop_na()
Single
Qualitative/Categorical/ Nominal Variable
- Questions: Which? What Kind? How? How many of each Kind?
- Island ( Which island ? )
- Species ( Which Species? )
- Calculations: No of
levels
/ Counts for each
level
count / tally
of no. of penguins on each island or in
each species
sort
and order
by island or species
- Charts: Bar Chart / Pie Chart / Tree Map
geom_bar
/ geom_bar + coord_polar()
/ Find
out!!
penguins %>% count(species)
ggplot(penguins) + geom_bar(aes(x = island))
ggplot(penguins) + geom_bar(aes(x = sex))
YOUR TURN-3
Single Quantitative
Variable
Questions: How many? How few? How often? How much?
Calculations: max / min / mean / mode / (units)
max()
, min()
, range()
,
mean()
, mode(), summary()
- Charts: Bar Chart / Histogram / Density
geom_histogram()
/ geom_density()
max(penguins$bill_length_mm)
## [1] 59.6
range(penguins$bill_length_mm, na.rm =TRUE)
## [1] 32.1 59.6
summary(penguins$flipper_length_mm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 172 190 197 201 213 231
ggplot(penguins) + geom_density(aes(bill_length_mm))
ggplot(penguins) + geom_histogram(aes(x = bill_length_mm))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
YOUR TURN-4
Are all the above Quantitative variables ratio variables?
Justify.
Two Variables:
Quantitative vs Quantitative
We can easily extend our intuition about one quantitative variable,
to a pair of them. What Questions can we ask?
Questions: How many of this vs How many of that? Does this depend
upon that? How are they related? (Remember \(y
= mx + c\) and friends?)
Calculations: Correlation / Covariance / T-test / Chi-Square Test
for Two Means etc. We won’t go into this here !
Charts: Scatter Plot / Line Plot / Regression i.e. best fit
lines
cor(penguins$bill_length_mm, penguins$bill_depth_mm)
## [1] -0.2286256
ggplot(penguins) +
geom_point(aes(x = flipper_length_mm,
y = body_mass_g))
ggplot(penguins) +
geom_point(aes(x = flipper_length_mm,
y = bill_length_mm))
YOUR TURN-5
Two Variables:
Categorical vs Categorical
What sort of question could we ask that involves two
categorical variables?
Questions: How Many of this Kind( ~x) are How Many of that Kind(
~y ) ?
Calculations: Counts and Tallies sliced by Category
Charts: Stacked Bar Charts / Grouped Bar Charts / Segmented Bar
Chart / Mosaic Chart
geom_bar()
- Use the second Categorical variables to modify
fill
,
color
.
- Also try to vary the parameter
position
of the
bars.
ggplot(penguins) + geom_bar(aes(x = island,
fill = species),
position = "stack")
Storyline: तीन पेनगीन। और तुम भी तीन(Oh never mind!)
YOUR TURN-6
Two Variables:
Quantitative vs Qualitative
Finally, what if we want to look at Quant variables and Qual
variables together? What questions could we ask?
Questions: How much of this is Which Kind of that? How many vs
Which? How many vs How?
Calculations: Counts, Means, Ranges etc., grouped
by Categorical variable.
ggplot(penguins) +
geom_density(aes(x = body_mass_g,
color = island,
fill = island),
alpha = 0.3)
- Charts: Bar Chart using group / density plots by group / violin
plots by group / box plots by group
geom_bar
/ geom_density
/
geom_violin
/ geom_boxplot
using Categorical
Variable for grouping
ggplot(penguins) +
geom_density(aes(x = body_mass_g,
color = island,
fill = island),
alpha = 0.3)
ggplot(penguins) +
geom_histogram(aes(x = flipper_length_mm,
fill = sex))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
YOUR TURN-7
Time to Play
- Create a fresh RMarkdown and similarly analyse two datasets of the
following data sets
