Abstract
Part of theR for Artists and Designers
at the School of
Foundation Studies, Srishti Manipal Institute of Art, Design, and
Technology, Bangalore.
This RMarkdown document is part of the Generic Skills Component (GSK) of the Course of the Foundation Studies Programme at Srishti Manipal Institute of Art, Design, and Technology, Bangalore India. The material is based on A Layered Grammar of Graphics by Hadley Wickham. The course is meant for First Year students pursuing a Degree in Art and Design.
The intent of this GSK part is to build Skill in coding in R, and also appreciate R as a way to metaphorically visualize information of various kinds, using predominantly geometric figures and structures.
All RMarkdown files combine code, text, web-images, and figures developed using code. Everything is text; code chunks are enclosed in fences (```)
At the end of this Lab session, we should: - know the types and
structures of tidy data and be able to work with them - be able to
create data visualizations using ggplot
- Understand
aesthetics and scales in `ggplot
The method followed will be based on PRIMM:
parameters
of the code
do and write comments to explain. What bells and
whistles can you see?parameters
code provided to
understand the options
available. Write
comments to show what you have aimed for and achieved.The setup
code chunk below brings into
our coding session R packages that provide specific
computational abilities and also datasets which we can
use.
To reiterate: Packages and datasets are not the same thing !! Packages are (small) collections of programs. Datasets are just….information.
“Tidy Data” is an important way of thinking about what data typically look like in R.
The three features described in the figure above define the nature of tidy data:
Data are imagined to be resulting from an experiment. Each variable represents a parameter/aspect in the experiment. Each row represents an additional datum of measurement. A cell is a single measurement on a single parameter(column) in a single observation(row).
Kinds of Variable are defined by the kind of questions they answer to:
Creating graphs from data is an act of asking questions and viewing answers in a geometric way. Let us write some simple English descriptions of measures and visuals and see what commands they use in R.
Layers are used to create the objects on a plot. They are defined by five basic parts:
We will use “real world” data. Let’s use the penguins
dataset in the palmerpenguins
package. Run
?penguins
in the console to get more information about this
dataset.
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
## 3 Adelie Torgersen 40.3 18 195 3250 female 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
tail(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
## <fct> <fct> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Chinstrap Dream 45.7 17 195 3650 female 2009
## 2 Chinstrap Dream 55.8 19.8 207 4000 male 2009
## 3 Chinstrap Dream 43.5 18.1 202 3400 female 2009
## 4 Chinstrap Dream 49.6 18.2 193 3775 male 2009
## 5 Chinstrap Dream 50.8 19 210 4100 male 2009
## 6 Chinstrap Dream 50.2 18.7 198 3775 female 2009
dim(penguins)
## [1] 344 8
So we know what our data looks like. We pass this data to
ggplot
use to plot as follows: in R this creates an empty
graph sheet!! Because we have not (yet) declared the geometric shapes we
want to use to plot our information.
ggplot(data = penguins) # Creates an empty graphsheet, ready for plotting!!
Now that we have told R what data to use, we need to state what variables to plot and how.
Aesthetic Mapping defines how the variables are applied to the plot, i.e. we take a variable from the data and “metaphorize” it into a geometric feature. We can map variables metaphorically to a variety of geometric things: coordinate, length, height, size, shape, colour, alpha(how dark?)….
The syntax uses:
aes(some_geometric_thing = some_variable)
Remember variable = column.
So if we were graphing information from penguins
, we
might map a penguin’s flipper_length_mm
column to the \(x\)
position, and the body_mass_g
column to
the \(y\) position.
We can try another example of aesthetic mapping with the same dataset:
ggplot(data = penguins)
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
We can try another example of aesthetic mapping with the same dataset:
ggplot(data = penguins)
ggplot(penguins) +
# Plot geom = boxplot. So we need a quantity on the x
geom_boxplot(
aes(x = body_mass_g))
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
ggplot(penguins) +
# Plot geom = boxplot. So we need a quantity on the x
geom_boxplot(
aes(x = body_mass_g,
fill = island) # fill aesthetic = another variable
)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
We can try another example of aesthetic mapping with the same dataset:
ggplot(data = penguins)
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_density(
aes(x = body_mass_g))
## Warning: Removed 2 rows containing non-finite values (stat_density).
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_density(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## Warning: Removed 2 rows containing non-finite values (stat_density).
We can try another example of aesthetic mapping with the same dataset:
ggplot(data = penguins)
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
ggplot(penguins) +
# Plot geom = histogram. So we need a quantity on the x
geom_histogram(
aes(x = body_mass_g,
fill = island) # color aesthetic = another variable
)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Sometimes with dense data we need to adjust the position of elements on the plot, otherwise data points might obscure one another. Bar plots frequently stack or dodge the bars to avoid overlap:
count(x = penguins, species, island) %>%
ggplot(mapping = aes(x = species, y = n, fill = island)) +
geom_bar(stat = "identity") +
ggtitle(label = "A stacked bar chart")
count(x = penguins, species, island) %>%
ggplot(mapping = aes(x = species, y = n, fill = island)) +
geom_bar(stat = "identity", position = "dodge") +
ggtitle(label = "A dodged bar chart")
Sometimes scatterplots with few unique \(x\) and \(y\) values are jittered (random noise is added) to reduce overplotting.
ggplot(data = penguins,
mapping = aes(x = species,
y = body_mass_g)) +
geom_point() +
ggtitle("A point geom with obscured data points")
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data = penguins,
mapping = aes(x = species,
y = body_mass_g)) +
geom_jitter() +
ggtitle("A point geom with jittered data points")
## Warning: Removed 2 rows containing missing values (geom_point).
A statistical transformation (stat)
pre-transforms the data, before plotting. For instance, in a bar graph
you might summarize the data by count
ing the total number
of observations within a set of categories, and then plotting the
count.
count(x = penguins, island)
## # A tibble: 3 × 2
## island n
## <fct> <int>
## 1 Biscoe 168
## 2 Dream 124
## 3 Torgersen 52
mydat <- count(penguins,island)
ggplot(data = mydat) +
geom_col(aes(x = island, y = n))
penguins %>% # This IS a pipe Operator!!
count(., island) %>% # "." represents what is passed from the preceding command
ggplot(.) +
geom_col(aes(x = island, y = n))
penguins %>% # Our pipe Operator
ggplot(.) + # "." becomes the penguins dataset
geom_bar(aes(x = island)) # Note: y = count, and is computed internally!!
Sometimes you don’t need to make a statistical transformation. For example, in a scatterplot you use the raw values for the \(x\) and \(y\) variables to map onto the graph. In these situations, the statistical transformation is an identity transformation - the stat simply passes in the original dataset and exports the exact same dataset.
A scale controls how data is mapped to aesthetic attributes, so we need one scale for every aesthetic property employed in a layer. For example, this graph defines a scale for color:
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
color = species)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
The scale can be changed to use a different color palette:
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g,
color = species)) +
geom_point() +
scale_color_brewer(palette = "Dark2",direction = -1)
## Warning: Removed 2 rows containing missing values (geom_point).
Now we are using a different palette, but the scale is still consistent: all Adelie penguins utilize the same color, whereas Chinstrap use a new color, and so do all the Gentoos.
A coordinate system (coord) maps the position of objects onto the plane of the plot, and controls how the axes and grid lines are drawn. Plots typically use two coordinates (\(x, y\)), but could use any number of coordinates. Most plots are drawn using the Cartesian coordinate system:
x1 <- c(1, 10)
y1 <- c(1, 5)
p <- qplot(x = x1, y = y1, geom = "point", xlab = NULL, ylab = NULL) +
theme_bw()
p +
ggtitle(label = "Cartesian coordinate system")
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
coord_polar()
## Warning: Removed 2 rows containing missing values (geom_point).
This system requires a fixed and equal spacing between values on the axes. That is, the graph draws the same distance between 1 and 2 as it does between 5 and 6. The graph could be drawn using a semi-log coordinate system which logarithmically compresses the distance on an axis:
p +
coord_trans(y = "log10") +
ggtitle(label = "Semi-log coordinate system")
Or could even be drawn using polar coordinates:
p +
coord_polar() +
ggtitle(label = "Polar coordinate system")
Faceting can be used to split the data up into subsets of the entire dataset. This is a powerful tool when investigating whether patterns are the same or different across conditions, and allows the subsets to be visualized on the same plot (known as conditioned or trellis plots). The faceting specification describes which variables should be used to split up the data, and how they should be arranged.
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point() +
facet_wrap(~ island)
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(data = penguins, mapping = aes(x = bill_length_mm, y = body_mass_g, color = sex)) +
geom_point() +
facet_grid(species ~ island, scales = "free_y")
## Warning: Removed 2 rows containing missing values (geom_point).
# Ria's explanation: This code did not work becasue....
Rather than explicitly declaring each component of a layered graphic (which will use more code and introduces opportunities for errors), we can establish intelligent defaults for specific geoms and scales. For instance, whenever we want to use a bar geom, we can default to using a stat that counts the number of observations in each group of our variable in the \(x\) position.
Consider the following scenario: you wish to generate a scatterplot visualizing the relationship between penguins’ bill_length and their body_mass. With no defaults, the code to generate this graph is:
ggplot() +
layer(
data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g),
geom = "point",
stat = "identity",
position = "identity"
) +
scale_x_continuous() +
scale_y_continuous() +
coord_cartesian()
## Warning: Removed 2 rows containing missing values (geom_point).
The above code:
Creates a new plot object (ggplot
)
Adds a layer (layer
)
penguins
)mapping
)geom = "point"
)stat = "identity"
and
position = "identity"
)Establishes two continuous position scales
(scale_x_continuous
and
scale_y_continuous
)
Declares a cartesian coordinate system
(coord_cartesian
)
How can we simplify this using intelligent defaults?
We only need to specify one geom and stat, since each geom has a default stat.
Cartesian coordinate systems are most commonly used, so it should be the default.
Default scales can be added based on the aesthetic and type of variables.
Using these defaults, we can rewrite the above code as:
ggplot() +
geom_point(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g))
## Warning: Removed 2 rows containing missing values (geom_point).
This generates the exact same plot, but uses fewer lines of code.
Because multiple layers can use the same components (data, mapping,
etc.), we can also specify that information in the ggplot()
function rather than in the layer()
function:
ggplot(data = penguins,
mapping = aes(x = bill_length_mm,
y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).
And as we will learn, function arguments in R use specific ordering,
so we can omit the explicit call to data
and
mapping
:
ggplot(penguins, aes(bill_length_mm, body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).