R Markdown & life advice to past self analyzing datasets

Trust me when I say it will make your life much easier if you keep a running document with your analyses and explanations of what you did and why you were doing it, especially if you are juggling multiple projects and need to put something down for a month or so.

I in the past had the “organisational” strategy of having a giant .R file with all my R code for a project, and I would occaisionally update a separate word document with some of the better plots and try to write them up.

This is a bad strategy, don’t be past me

Having a single document where you mix code, figures, and explanations will make things much easier. Karl Broman has a good talk on reproducible research, with pointers lying outside of just the report stage.

library(ghostr)
library(nycflights13)
library(ggplot2)
library(readr)
library(dplyr)
library(viridis)
data('ghost_sightings')
data('flights')

visualize with ggplot2

ggplot2 is a really powerful visualization, assuming that your data is in a tidy format (one observation per row, each variables has its own column). If your data is not yet “tidy”, tidyr and dplyr can greatly smooth that process.

The grammar of graphics philosophy breaks down the mapping data to plot variables like x and y position separately from the actual type of plot being created, such as a dot plot or histogram. Let’s start by looking at the relationship between the size of the engine (displ) and city mileage (cty). We first call ggplot and define the aesthetics for the x and y variables in the eventual plot, regardless of what shapes are drawn (points or lines or whatever). Then we add geom_ calls to draw the actual shapes for the plot we want. Let’s begin with a dot plot.

basic plots

glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class        <chr> "compact", "compact", "compact", "compact", "comp...
ggplot(mpg,aes(displ,cty))+geom_point()

ggplot(mpg,aes(displ,cty,color=drv))+geom_point()

faceting

Faceting is a great way plot parts of your data conditioned on a variable. In this example, we will facet on “color” to see how the relationship between carat and price is affected by this third variable.

ggplot(diamonds,aes(carat,price))+geom_point()+facet_wrap(~color)

color choice matters

Be mindful of the colors that you use when plotting. A substantial fraction of the population is colorblind to varying degrees, and if people print out your figures in black and white points might be indistinguishable without other features to help.

The viridis package provides an aesthetically non-gross colormap that is easier for colorblind folk to read and is more legible when printed in black and white. See their vignette for more details.

library(viridis)
d <- ggplot(diamonds, aes(carat, price))
d + geom_hex() +ggtitle("default ggplot2")

d + geom_hex() + scale_fill_viridis() + ggtitle("viridis default")

There are 5 dplyr verbs

The package “dplyr” makes a lot of basic dataframe manipulation and transformation easier than the equivalent base code, or doing selections just based on column comparisons. It also often ends up being faster as well, especially for larger datasets. Check out this handy cheatsheet on the various manipulations; the dplyr introductary vignette is also very helpful.

filter

filter lets you subset the data using conditional expressions. The following filter is equivalent to flights[flights$delay>0,] . As the conditions become more complex, filter becomes easier and easier to use compared to using the base subsetting options.

glimpse(flights) #glimpse is a good way to get info on all the column variables
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
flights2 <- filter(flights, dep_delay>0)
glimpse(flights2)
## Observations: 128,432
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 601, 608, 611, 613, 623, 632, 64...
## $ sched_dep_time <int> 515, 529, 540, 600, 600, 600, 610, 610, 608, 63...
## $ dep_delay      <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9,...
## $ arr_time       <int> 830, 850, 923, 844, 807, 945, 925, 920, 740, 93...
## $ sched_arr_time <int> 819, 830, 850, 850, 735, 931, 921, 915, 728, 94...
## $ arr_delay      <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA",...
## $ flight         <int> 1545, 1714, 1141, 343, 3768, 303, 135, 1837, 41...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N644JB", "N9EAMQ...
## $ origin         <chr> "EWR", "LGA", "JFK", "EWR", "EWR", "JFK", "JFK"...
## $ dest           <chr> "IAH", "IAH", "MIA", "PBI", "ORD", "SFO", "RSW"...
## $ air_time       <dbl> 227, 227, 160, 147, 139, 366, 175, 153, 52, 151...
## $ distance       <dbl> 1400, 1416, 1089, 1023, 719, 2586, 1074, 1096, ...
## $ hour           <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7,...
## $ minute         <dbl> 15, 29, 40, 0, 0, 0, 10, 10, 8, 36, 45, 45, 0, ...
## $ time_hour      <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

select()

select cuts out just the variables that you want to work with at the moment. The nycflights13 dataset is a good example where the full dataset of 19 variables can be a bit unwieldy if you don’t care about some or most of the variables for the given analysis at hand.

flights2 <- select(flights, dep_delay,arr_delay,carrier,air_time,time_hour)

glimpse(flights2)
## Observations: 336,776
## Variables: 5
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ air_time  <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ time_hour <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0...
glimpse(select(flights,dep_delay:air_time)) #subset by range of named columns
## Observations: 336,776
## Variables: 10
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2,