R Markdown & life advice to past self analyzing datasets

Trust me when I say it will make your life much easier if you keep a running document with your analyses and explanations of what you did and why you were doing it, especially if you are juggling multiple projects and need to put something down for a month or so.

I in the past had the “organisational” strategy of having a giant .R file with all my R code for a project, and I would occaisionally update a separate word document with some of the better plots and try to write them up.

This is a bad strategy, don’t be past me

Having a single document where you mix code, figures, and explanations will make things much easier. Karl Broman has a good talk on reproducible research, with pointers lying outside of just the report stage.

library(ghostr)
library(nycflights13)
library(ggplot2)
library(readr)
library(dplyr)
library(viridis)
data('ghost_sightings')
data('flights')

visualize with ggplot2

ggplot2 is a really powerful visualization, assuming that your data is in a tidy format (one observation per row, each variables has its own column). If your data is not yet “tidy”, tidyr and dplyr can greatly smooth that process.

The grammar of graphics philosophy breaks down the mapping data to plot variables like x and y position separately from the actual type of plot being created, such as a dot plot or histogram. Let’s start by looking at the relationship between the size of the engine (displ) and city mileage (cty). We first call ggplot and define the aesthetics for the x and y variables in the eventual plot, regardless of what shapes are drawn (points or lines or whatever). Then we add geom_ calls to draw the actual shapes for the plot we want. Let’s begin with a dot plot.

basic plots

glimpse(mpg)

## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class        <chr> "compact", "compact", "compact", "compact", "comp...

ggplot(mpg,aes(displ,cty))+geom_point()

ggplot(mpg,aes(displ,cty,color=drv))+geom_point()

faceting

Faceting is a great way plot parts of your data conditioned on a variable. In this example, we will facet on “color” to see how the relationship between carat and price is affected by this third variable.

ggplot(diamonds,aes(carat,price))+geom_point()+facet_wrap(~color)

color choice matters

Be mindful of the colors that you use when plotting. A substantial fraction of the population is colorblind to varying degrees, and if people print out your figures in black and white points might be indistinguishable without other features to help.

The viridis package provides an aesthetically non-gross colormap that is easier for colorblind folk to read and is more legible when printed in black and white. See their vignette for more details.

library(viridis)
d <- ggplot(diamonds, aes(carat, price))
d + geom_hex() +ggtitle("default ggplot2")

d + geom_hex() + scale_fill_viridis() + ggtitle("viridis default")

There are 5 dplyr verbs

The package “dplyr” makes a lot of basic dataframe manipulation and transformation easier than the equivalent base code, or doing selections just based on column comparisons. It also often ends up being faster as well, especially for larger datasets. Check out this handy cheatsheet on the various manipulations; the dplyr introductary vignette is also very helpful.

filter

filter lets you subset the data using conditional expressions. The following filter is equivalent to flights[flights$delay>0,] . As the conditions become more complex, filter becomes easier and easier to use compared to using the base subsetting options.

glimpse(flights) #glimpse is a good way to get info on all the column variables

## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

flights2 <- filter(flights, dep_delay>0)
glimpse(flights2)

## Observations: 128,432
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 601, 608, 611, 613, 623, 632, 64...
## $ sched_dep_time <int> 515, 529, 540, 600, 600, 600, 610, 610, 608, 63...
## $ dep_delay      <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9,...
## $ arr_time       <int> 830, 850, 923, 844, 807, 945, 925, 920, 740, 93...
## $ sched_arr_time <int> 819, 830, 850, 850, 735, 931, 921, 915, 728, 94...
## $ arr_delay      <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA",...
## $ flight         <int> 1545, 1714, 1141, 343, 3768, 303, 135, 1837, 41...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N644JB", "N9EAMQ...
## $ origin         <chr> "EWR", "LGA", "JFK", "EWR", "EWR", "JFK", "JFK"...
## $ dest           <chr> "IAH", "IAH", "MIA", "PBI", "ORD", "SFO", "RSW"...
## $ air_time       <dbl> 227, 227, 160, 147, 139, 366, 175, 153, 52, 151...
## $ distance       <dbl> 1400, 1416, 1089, 1023, 719, 2586, 1074, 1096, ...
## $ hour           <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7,...
## $ minute         <dbl> 15, 29, 40, 0, 0, 0, 10, 10, 8, 36, 45, 45, 0, ...
## $ time_hour      <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...

select()

select cuts out just the variables that you want to work with at the moment. The nycflights13 dataset is a good example where the full dataset of 19 variables can be a bit unwieldy if you don’t care about some or most of the variables for the given analysis at hand.

flights2 <- select(flights, dep_delay,arr_delay,carrier,air_time,time_hour)

glimpse(flights2)

## Observations: 336,776
## Variables: 5
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ air_time  <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ time_hour <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0...

glimpse(select(flights,dep_delay:air_time)) #subset by range of named columns

## Observations: 336,776
## Variables: 10
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...

mutate()

Create new columns using mutate based on transformations or combinations of existing columns.

flights2 <- mutate(flights, 
                   gain=arr_delay- dep_delay,
                   speed=distance/air_time * 60 )

glimpse(flights2)

## Observations: 336,776
## Variables: 21
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
## $ gain           <dbl> 9, 16, 31, -17, -19, 16, 24, -11, -5, 10, 0, -1...
## $ speed          <dbl> 370.0441, 374.2731, 408.3750, 516.7213, 394.137...

arrange()

Sort dataframes according to different variables with arrange(). Defaults to ascending order, use desc() to reverse to descending order.

flights2 = arrange(flights, dep_delay)
glimpse(flights2)

## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 12, 2, 11, 1, 1, 8, 10, 3, 3, 5, 5, 9, 10, 11, ...
## $ day            <int> 7, 3, 10, 11, 29, 9, 23, 30, 2, 5, 14, 18, 29, ...
## $ dep_time       <int> 2040, 2022, 1408, 1900, 1703, 729, 1907, 2030, ...
## $ sched_dep_time <int> 2123, 2055, 1440, 1930, 1730, 755, 1932, 2055, ...
## $ dep_delay      <dbl> -43, -33, -32, -30, -27, -26, -25, -25, -24, -2...
## $ arr_time       <int> 40, 2240, 1549, 2233, 1947, 1002, 2143, 2213, 1...
## $ sched_arr_time <int> 2352, 2338, 1559, 2243, 1957, 955, 2143, 2250, ...
## $ arr_delay      <dbl> 48, -58, -10, -10, -10, 7, 0, -37, -30, -44, -2...
## $ carrier        <chr> "B6", "DL", "EV", "DL", "F9", "MQ", "EV", "MQ",...
## $ flight         <int> 97, 1715, 5713, 1435, 837, 3478, 4361, 4573, 33...
## $ tailnum        <chr> "N592JB", "N612DL", "N825AS", "N934DL", "N208FR...
## $ origin         <chr> "JFK", "LGA", "LGA", "LGA", "LGA", "LGA", "EWR"...
## $ dest           <chr> "DEN", "MSY", "IAD", "TPA", "DEN", "DTW", "TYS"...
## $ air_time       <dbl> 265, 162, 52, 139, 250, 88, 111, 87, 55, 150, 1...
## $ distance       <dbl> 1626, 1183, 229, 1010, 1620, 502, 631, 502, 301...
## $ hour           <dbl> 21, 20, 14, 19, 17, 7, 19, 20, 14, 9, 9, 16, 21...
## $ minute         <dbl> 23, 55, 40, 30, 30, 55, 32, 55, 55, 58, 38, 55,...
## $ time_hour      <time> 2013-12-07 21:00:00, 2013-02-03 20:00:00, 2013...

flights2 = arrange(flights, desc(dep_delay))
glimpse(flights2)

## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 6, 1, 9, 7, 4, 3, 6, 7, 12, 5, 1, 2, 5, 12, ...
## $ day            <int> 9, 15, 10, 20, 22, 10, 17, 27, 22, 5, 3, 1, 10,...
## $ dep_time       <int> 641, 1432, 1121, 1139, 845, 1100, 2321, 959, 22...
## $ sched_dep_time <int> 900, 1935, 1635, 1845, 1600, 1900, 810, 1900, 7...
## $ dep_delay      <dbl> 1301, 1137, 1126, 1014, 1005, 960, 911, 899, 89...
## $ arr_time       <int> 1242, 1607, 1239, 1457, 1044, 1342, 135, 1236, ...
## $ sched_arr_time <int> 1530, 2120, 1810, 2210, 1815, 2211, 1020, 2226,...
## $ arr_delay      <dbl> 1272, 1127, 1109, 1007, 989, 931, 915, 850, 895...
## $ carrier        <chr> "HA", "MQ", "MQ", "AA", "MQ", "DL", "DL", "DL",...
## $ flight         <int> 51, 3535, 3695, 177, 3075, 2391, 2119, 2007, 20...
## $ tailnum        <chr> "N384HA", "N504MQ", "N517MQ", "N338AA", "N665MQ...
## $ origin         <chr> "JFK", "JFK", "EWR", "JFK", "JFK", "JFK", "LGA"...
## $ dest           <chr> "HNL", "CMH", "ORD", "SFO", "CVG", "TPA", "MSP"...
## $ air_time       <dbl> 640, 74, 111, 354, 96, 139, 167, 313, 109, 149,...
## $ distance       <dbl> 4983, 483, 719, 2586, 589, 1005, 1020, 2454, 76...
## $ hour           <dbl> 9, 19, 16, 18, 16, 19, 8, 19, 7, 17, 20, 18, 8,...
## $ minute         <dbl> 0, 35, 35, 45, 0, 0, 10, 0, 59, 0, 55, 35, 30, ...
## $ time_hour      <time> 2013-01-09 09:00:00, 2013-06-15 19:00:00, 2013...

summarize()

Collapse data to a single row. Especially useful if you want to summarise subsets of the data based on a variable with group_by().

summarize(flights, avg_depdelay = mean(dep_delay, na.rm = TRUE) )

## # A tibble: 1 x 1
##   avg_depdelay
##          <dbl>
## 1     12.63907

Chaining dplyr verbs with %>% (pipe)

The pipe operater let’s you compactly perform operations in a left to right order. The unpiped version goes from inside out as nested function calls.

flights2 = flights %>%
                              filter(dep_delay > 0) %>%
                              select(dep_delay, arr_delay, carrier) %>%
                              mutate(gain = arr_delay - dep_delay)
glimpse(flights2)

## Observations: 128,432
## Variables: 4
## $ dep_delay <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9, 2, 3...
## $ arr_delay <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -31, 4...
## $ carrier   <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA", "EV"...
## $ gain      <dbl> 9, 16, 31, -7, 24, 3, 1, -8, -12, -17, -7, -8, -32, ...

#equivalent but much harder to read code. 
#Even breaking onto multiple lines doesn't really help
flights2 = mutate( select( filter(flights,dep_delay > 0), 
                         dep_delay, arr_delay, carrier), 
                  gain = arr_delay - dep_delay)
glimpse(flights2)

## Observations: 128,432
## Variables: 4
## $ dep_delay <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9, 2, 3...
## $ arr_delay <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -31, 4...
## $ carrier   <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA", "EV"...
## $ gain      <dbl> 9, 16, 31, -7, 24, 3, 1, -8, -12, -17, -7, -8, -32, ...

rapid plotting

flights %>%                                        # Start with the 'flights' dataset
  filter(dep_delay> 0) %>%                        # Then, filter down to rows where departure was delayed
  ggplot(aes(x=carrier,y=dep_delay)) +             # Then, plot using ggplot
  geom_boxplot()  +                                #  and create a boxplot
  coord_flip()                                    #flip axes to better use space

Grouping operations with group_by()

group_by defines a variable to cut the original dataset into mini-datasets for further operations. This example is lifted verbatim from the dplyr vignette. Here, the grouping variable is unique plains, thereby averaging all the flights an airplane did together. Therefore, each point in the figure represents one airplane’s data.

by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)

# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

Mapping and geoVis stuff

A good tool if you are interested in this type of data is “ggmap” . We will use it now to plot ghost sightings and their relative density in Kentucky.

library(ggmap)
library(ghostr)
data(ghost_sightings)
ky_bb=make_bbox(lon,lat,data=ghost_sightings,f=.1)
map <- get_stamenmap(ky_bb, zoom = 7, maptype = "toner-lite")
ggmap(map) + 
  geom_point(aes(x=lon,y=lat,size=sightings),data=ghost_sightings) +
  xlab("Longitude")+ylab("Latitude")

#can use data from multiple datasets by specifying data= in geoms

Twitter mining and social network analysis

Here I will introduce some code to at people talking around hashtags (mostly scientific conferences). You can get the code at https://github.com/thomas-keller/tweet-conf . A more extended explanation of these analyses is at my website http://thomas-keller.github.io/articles/I-analyzed-evolution-2016-twitter-and-you-can-too-for-other-conferences/ .

In this example I’m just using a csv of parsed tweets rather than downloading something to introduce one less thing than can go wrong.

library(twitteR)
library(ROAuth)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(dplyr)

#formatTwDate from SmappR https://github.com/SMAPPNYU/smappR/
#By Pablo Barbera http://pablobarbera.com/ and others

formatTwDate <- function(datestring, format="datetime"){
  if (format=="datetime"){
    date <- as.POSIXct(datestring, format="%a %b %d %H:%M:%S %z %Y")
  }
  if (format=="date"){
    date <- as.Date(datestring, format="%a %b %d %H:%M:%S %z %Y")
  }   
  return(date)
}

#search along the hashtag (can be have multiple hashtags if you want/need)
#convert to dataframe
hashtag<-'#SciPy2016'
confname<-substr(hashtag,2,nchar(hashtag))
#tw_list <- searchTwitter(hashtag, n = 1e4, since = '2016-08-3')# , until='2016-07-14') #~5k tweets
tw_df<-read_csv('scipy2016.csv') 
#tw_df<-twListToDF(tw_list)
tw_df<-unique(tw_df)
filename<-paste0(confname,".csv")
#write.csv(tw_df,file=filename,row.names=F)

Here is a simple wordcloud that demonstrates using tidytext (unnest_tokens) and dplyr chaining. The pipe %>% basically feeds the object on the left to the function on the right.

users<-data.frame(word=tolower(tw_df$screenName),lexicon=rep('whatevs',nrow(tw_df)))
#breaks down tweets into words for tidy (word) level analyses
tidy_tw<-tw_df %>% unnest_tokens(word,text)

#removes uninformatives words / ones that oversaturate wordcloud
tw_stop<-data.frame(word=c(confname,tolower(confname),'htt','25','http','amp','gt','t.c','rt','https','t.co','___','1','2','3','4','5','6','7','8','9',"i\'m",'15','30','45','00','10'),lexicon='whatevs')
data("stop_words")
tidy_cloud <- tidy_tw %>%
 anti_join(tw_stop) %>%
  anti_join(stop_words) %>%
  anti_join(users)

print(tidy_cloud %>% count(word, sort = TRUE))

## # A tibble: 2,532 x 2
##          word     n
##         <chr> <int>
## 1      python   543
## 2        talk   518
## 3  jupyterlab   292
## 4        tool   220
## 5       scipy   196
## 6     jupyter   180
## 7        data   171
## 8       alpha   168
## 9        blog   167
## 10       post   167
## # ... with 2,522 more rows

tidy_cloud %>%
 count(word) %>%
 with(wordcloud(word, n,max.words = 100,colors=brewer.pal(8,'Dark2')))

Now we want to get a sense of who are the most active users of this hashtag. Here I’m introducing some ways to modify the default ggplot output

user.tweets <- as.data.frame(table(tw_df$screenName))
names(user.tweets) <- c("User", "Tweets")

# Order the table by number of tweets per user & do some culling
user.tweets <- user.tweets[with(user.tweets, order(-Tweets)), ]
user.tweets_fig<-user.tweets[user.tweets$Tweets>2,]
user.tweets_fig<-user.tweets_fig[1:40,]

#make the plot for the top 40 or so
#I normally hate the x and y guide lines, but they serve a purpose with the extreme skew and names
ggplot(data=user.tweets_fig, aes(x=reorder(User, Tweets), y=Tweets)) +
  geom_bar(stat='identity') +
  coord_flip() +
  scale_y_continuous("Tweets") +
  scale_x_discrete("User") +
  labs(title = paste(hashtag, " tweets per user")) +
  theme_bw() +
  theme(axis.title = element_text(face="bold"), axis.text.y = element_text(size=6))

Twitter sentiment

This is a simple “bag of words” sentiment analysis

tidy_tw$created<-formatTwDate(tidy_tw$created)
tw_df$created<-formatTwDate(tw_df$created)


library(tidyr)
bing <- sentiments %>%
 filter(lexicon == "bing") %>%
 select(-score)

conf_sent <- tidy_tw %>%
 inner_join(bing) %>%
 count(id, sentiment) %>% 
 spread(sentiment, n, fill = 0) %>%
 mutate(sentiment = positive - negative) %>%
 inner_join(tw_df[,c(5,8)]) #join on id and created

library(cowplot)
library(scales)
library(lubridate)

#adjust time zone of tweets with lubridate
conf_sent$created<-ymd_hms(conf_sent$created,tz='EST')

#Example could include label, but don't have time to figure out what is driving
#inflection points of moods during these other conferences
df_labels<-data.frame(times=strptime(c("2016-07-13 12:00:00","2016-07-15 0:00:00","2016-07-16 16:30:00","2016-07-18 6:30:00"),"%Y-%m-%d %H:%M:%S"),
                      labels=c("it begins!\nmixers for all","science cafe\nfunny-man",'final day\nmixer stuff','that was pretty\ngood reflection'),
                      y=c(1.5,1.0,1.0,1.0))
ggplot(conf_sent, aes(created, sentiment)) +
 geom_smooth() + xlab("tweet time") + ylab("tweet sentiment")+
 scale_x_datetime(breaks = date_breaks("day")) + background_grid(major = "xy", minor = "none") +
 theme(axis.text.x=element_text(angle=315,vjust=.6))+
  #geom_text(data=df_labels,aes(x=times,y=y,label=labels),size=4)+
  ggtitle(paste(hashtag,"positive or negative emotions (think first order ~vibe of conf.)"))

 #coord_cartesian(ylim=c(-.5,1.2)) #+geom_text(data=df_labels,aes(x=times,y=y,label=labels),size=4)

tweet retweet network

library(twitteR)
library(tidytext)
library(ggplot2)
library(dplyr)
library(igraph)
library(stringr)

hm<-sort(table(tw_df$screenName))
outdf<-data.frame(screen_name=names(hm),num_tweets=hm)[,c(1,3)]
outdf<-outdf[order(outdf[,2],decreasing=T),]

#OK, start of new code to develop RT network
#code (regex especially!!!) used liberally from
# https://sites.google.com/site/miningtwitter/questions/user-tweets/who-retweet

#TODO:
#replace retweet network construction (not plotting)
#with https://github.com/nfahlgren/conference_twitter_stats/blob/master/retweet_network_generic.R
#it's cleaner and doesn't rely on regex horrors I don't understand

rt_net<-grep("(RT|via)((?:\\b\\W*@\\w+)+)", tw_df$text, 
             ignore.case=TRUE,value=TRUE)
rt_neti<-grep("(RT|via)((?:\\b\\W*@\\w+)+)", tw_df$text, 
              ignore.case=TRUE)

#next, create list to store user names
who_retweet <- as.list(1:length(rt_net))
who_post <- as.list(1:length(rt_net))

# for loop
for (i in 1:length(rt_net))
{ 
  # get tweet with retweet entity
  #nrow= ???
  twit <- tw_df[rt_neti[i],]
  # get retweet source 
  poster<-str_extract_all(twit$text,"(RT|via)((?:\\b\\W*@\\w+)+)")  
  #remove ':'
  poster <- gsub(":", "", unlist(poster)) 
  # name of retweeted user
  who_post[[i]] <- gsub("(RT @|via @)", "", poster, ignore.case=TRUE) 
  # name of retweeting user 
  who_retweet[[i]] <- rep(twit$screenName, length(poster)) 
}

# unlist
who_post <- unlist(who_post)
who_retweet <- unlist(who_retweet)

####
#Preprocessing the dataframes as as contacts to something
#igraph likes

#I guess I need an edge aesthetic for ggraph to paint with
retweeter_poster <- data.frame(from=who_retweet, to=who_post,retweets=1)

#filters out some bad parsing and users who arent in the node graph
#node_df has the screen_name and number of tweets per user, which will serve as the vertex/node dataframe
#in igraph speak
node_df<-outdf
names(node_df)<-c("id","num_tweets")
#This step #REALLLY IMPORTANT# for plotting purposes, determines how dense the network is
#need to tune based on how big you want your input network is
node_df2<-droplevels(node_df[1:50,]) #selecting only the top 50 posting from #evol2016 for plotting purposes
filt_rt_post<-retweeter_poster[retweeter_poster$from %in% node_df2$id & retweeter_poster$to %in% node_df2$id,]
filt_rt_post<-droplevels(filt_rt_post) #ditch all those fleshbags that had to talk to people instead of tweeting
head(filt_rt_post)

##              from           to retweets
## 43        scopatz       dotsdl        1
## 48      SciPyConf  JackieKazil        1
## 51      SciPyConf      ericmjl        1
## 56    chendaniely     jhamrick        1
## 60     geo_leeman dopplershift        1
## 61 jnuneziglesias    SciPyConf        1

#this creates a directed graph with vertex/node info on num_tweets, and edge info on retweets
rt_graph<-graph_from_data_frame(d=droplevels(filt_rt_post),vertices=droplevels(node_df2),directed=T)

#simplify the graph to remove any possible self retweets since now twitter is dumb any allows that
#and any multiple edges
#have to wait a couple seconds to let graph be generated before simplify call
#merge all the multiple rts a person has into one edge to simplify visualization
rt_graph<-simplify(rt_graph,remove.multiple=T,remove.loops=TRUE,edge.attr.comb='sum')

ggraph network

library(ggraph)
library(ggplot2)
#jpeg('evol2016_top50_twitter_network.jpg',width=960,height=960,pointsize=12)
ggraph(rt_graph,'igraph',algorithm='kk')+
  geom_edge_fan(aes(alpha=retweet),edge_alpha=0.075)+
  geom_node_point(aes(size=num_tweets))+
  geom_node_text(aes(label=name,vjust=-1.5))+
  ggforce::theme_no_axes()+
  theme(legend.position=c(.08,.88))

network degree distribution

deg.dist <-degree_distribution(rt_graph, cumulative=T, mode="all")
deg_df<-data.frame(deg=0:max(degree(rt_graph)),cum_freq=1-deg.dist) 
qplot(deg,cum_freq,data=deg_df,xlab="Degree",ylab="Cumulative Frequency")

intro-r-tidyverse

Thomas E. Keller

August 19, 2016