Trust me when I say it will make your life much easier if you keep a running document with your analyses and explanations of what you did and why you were doing it, especially if you are juggling multiple projects and need to put something down for a month or so.
I in the past had the “organisational” strategy of having a giant .R file with all my R code for a project, and I would occaisionally update a separate word document with some of the better plots and try to write them up.
This is a bad strategy, don’t be past me
Having a single document where you mix code, figures, and explanations will make things much easier. Karl Broman has a good talk on reproducible research, with pointers lying outside of just the report stage.
library(ghostr)
library(nycflights13)
library(ggplot2)
library(readr)
library(dplyr)
library(viridis)
data('ghost_sightings')
data('flights')
ggplot2 is a really powerful visualization, assuming that your data is in a tidy format (one observation per row, each variables has its own column). If your data is not yet “tidy”, tidyr and dplyr can greatly smooth that process.
The grammar of graphics philosophy breaks down the mapping data to plot variables like x and y position separately from the actual type of plot being created, such as a dot plot or histogram. Let’s start by looking at the relationship between the size of the engine (displ) and city mileage (cty). We first call ggplot and define the aesthetics for the x and y variables in the eventual plot, regardless of what shapes are drawn (points or lines or whatever). Then we add geom_ calls to draw the actual shapes for the plot we want. Let’s begin with a dot plot.
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "...
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 qua...
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0,...
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1...
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6...
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)...
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4",...
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 1...
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 2...
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p",...
## $ class <chr> "compact", "compact", "compact", "compact", "comp...
ggplot(mpg,aes(displ,cty))+geom_point()
ggplot(mpg,aes(displ,cty,color=drv))+geom_point()
Faceting is a great way plot parts of your data conditioned on a variable. In this example, we will facet on “color” to see how the relationship between carat and price is affected by this third variable.
ggplot(diamonds,aes(carat,price))+geom_point()+facet_wrap(~color)
Be mindful of the colors that you use when plotting. A substantial fraction of the population is colorblind to varying degrees, and if people print out your figures in black and white points might be indistinguishable without other features to help.
The viridis package provides an aesthetically non-gross colormap that is easier for colorblind folk to read and is more legible when printed in black and white. See their vignette for more details.
library(viridis)
d <- ggplot(diamonds, aes(carat, price))
d + geom_hex() +ggtitle("default ggplot2")
d + geom_hex() + scale_fill_viridis() + ggtitle("viridis default")
The package “dplyr” makes a lot of basic dataframe manipulation and transformation easier than the equivalent base code, or doing selections just based on column comparisons. It also often ends up being faster as well, especially for larger datasets. Check out this handy cheatsheet on the various manipulations; the dplyr introductary vignette is also very helpful.
filter lets you subset the data using conditional expressions. The following filter is equivalent to flights[flights$delay>0,] . As the conditions become more complex, filter becomes easier and easier to use compared to using the base subsetting options.
glimpse(flights) #glimpse is a good way to get info on all the column variables
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
flights2 <- filter(flights, dep_delay>0)
glimpse(flights2)
## Observations: 128,432
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 601, 608, 611, 613, 623, 632, 64...
## $ sched_dep_time <int> 515, 529, 540, 600, 600, 600, 610, 610, 608, 63...
## $ dep_delay <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9,...
## $ arr_time <int> 830, 850, 923, 844, 807, 945, 925, 920, 740, 93...
## $ sched_arr_time <int> 819, 830, 850, 850, 735, 931, 921, 915, 728, 94...
## $ arr_delay <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA",...
## $ flight <int> 1545, 1714, 1141, 343, 3768, 303, 135, 1837, 41...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N644JB", "N9EAMQ...
## $ origin <chr> "EWR", "LGA", "JFK", "EWR", "EWR", "JFK", "JFK"...
## $ dest <chr> "IAH", "IAH", "MIA", "PBI", "ORD", "SFO", "RSW"...
## $ air_time <dbl> 227, 227, 160, 147, 139, 366, 175, 153, 52, 151...
## $ distance <dbl> 1400, 1416, 1089, 1023, 719, 2586, 1074, 1096, ...
## $ hour <dbl> 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7,...
## $ minute <dbl> 15, 29, 40, 0, 0, 0, 10, 10, 8, 36, 45, 45, 0, ...
## $ time_hour <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
select cuts out just the variables that you want to work with at the moment. The nycflights13 dataset is a good example where the full dataset of 19 variables can be a bit unwieldy if you don’t care about some or most of the variables for the given analysis at hand.
flights2 <- select(flights, dep_delay,arr_delay,carrier,air_time,time_hour)
glimpse(flights2)
## Observations: 336,776
## Variables: 5
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ time_hour <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0...
glimpse(select(flights,dep_delay:air_time)) #subset by range of named columns
## Observations: 336,776
## Variables: 10
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
Create new columns using mutate based on transformations or combinations of existing columns.
flights2 <- mutate(flights,
gain=arr_delay- dep_delay,
speed=distance/air_time * 60 )
glimpse(flights2)
## Observations: 336,776
## Variables: 21
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <time> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
## $ gain <dbl> 9, 16, 31, -17, -19, 16, 24, -11, -5, 10, 0, -1...
## $ speed <dbl> 370.0441, 374.2731, 408.3750, 516.7213, 394.137...
Sort dataframes according to different variables with arrange(). Defaults to ascending order, use desc() to reverse to descending order.
flights2 = arrange(flights, dep_delay)
glimpse(flights2)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 12, 2, 11, 1, 1, 8, 10, 3, 3, 5, 5, 9, 10, 11, ...
## $ day <int> 7, 3, 10, 11, 29, 9, 23, 30, 2, 5, 14, 18, 29, ...
## $ dep_time <int> 2040, 2022, 1408, 1900, 1703, 729, 1907, 2030, ...
## $ sched_dep_time <int> 2123, 2055, 1440, 1930, 1730, 755, 1932, 2055, ...
## $ dep_delay <dbl> -43, -33, -32, -30, -27, -26, -25, -25, -24, -2...
## $ arr_time <int> 40, 2240, 1549, 2233, 1947, 1002, 2143, 2213, 1...
## $ sched_arr_time <int> 2352, 2338, 1559, 2243, 1957, 955, 2143, 2250, ...
## $ arr_delay <dbl> 48, -58, -10, -10, -10, 7, 0, -37, -30, -44, -2...
## $ carrier <chr> "B6", "DL", "EV", "DL", "F9", "MQ", "EV", "MQ",...
## $ flight <int> 97, 1715, 5713, 1435, 837, 3478, 4361, 4573, 33...
## $ tailnum <chr> "N592JB", "N612DL", "N825AS", "N934DL", "N208FR...
## $ origin <chr> "JFK", "LGA", "LGA", "LGA", "LGA", "LGA", "EWR"...
## $ dest <chr> "DEN", "MSY", "IAD", "TPA", "DEN", "DTW", "TYS"...
## $ air_time <dbl> 265, 162, 52, 139, 250, 88, 111, 87, 55, 150, 1...
## $ distance <dbl> 1626, 1183, 229, 1010, 1620, 502, 631, 502, 301...
## $ hour <dbl> 21, 20, 14, 19, 17, 7, 19, 20, 14, 9, 9, 16, 21...
## $ minute <dbl> 23, 55, 40, 30, 30, 55, 32, 55, 55, 58, 38, 55,...
## $ time_hour <time> 2013-12-07 21:00:00, 2013-02-03 20:00:00, 2013...
flights2 = arrange(flights, desc(dep_delay))
glimpse(flights2)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 6, 1, 9, 7, 4, 3, 6, 7, 12, 5, 1, 2, 5, 12, ...
## $ day <int> 9, 15, 10, 20, 22, 10, 17, 27, 22, 5, 3, 1, 10,...
## $ dep_time <int> 641, 1432, 1121, 1139, 845, 1100, 2321, 959, 22...
## $ sched_dep_time <int> 900, 1935, 1635, 1845, 1600, 1900, 810, 1900, 7...
## $ dep_delay <dbl> 1301, 1137, 1126, 1014, 1005, 960, 911, 899, 89...
## $ arr_time <int> 1242, 1607, 1239, 1457, 1044, 1342, 135, 1236, ...
## $ sched_arr_time <int> 1530, 2120, 1810, 2210, 1815, 2211, 1020, 2226,...
## $ arr_delay <dbl> 1272, 1127, 1109, 1007, 989, 931, 915, 850, 895...
## $ carrier <chr> "HA", "MQ", "MQ", "AA", "MQ", "DL", "DL", "DL",...
## $ flight <int> 51, 3535, 3695, 177, 3075, 2391, 2119, 2007, 20...
## $ tailnum <chr> "N384HA", "N504MQ", "N517MQ", "N338AA", "N665MQ...
## $ origin <chr> "JFK", "JFK", "EWR", "JFK", "JFK", "JFK", "LGA"...
## $ dest <chr> "HNL", "CMH", "ORD", "SFO", "CVG", "TPA", "MSP"...
## $ air_time <dbl> 640, 74, 111, 354, 96, 139, 167, 313, 109, 149,...
## $ distance <dbl> 4983, 483, 719, 2586, 589, 1005, 1020, 2454, 76...
## $ hour <dbl> 9, 19, 16, 18, 16, 19, 8, 19, 7, 17, 20, 18, 8,...
## $ minute <dbl> 0, 35, 35, 45, 0, 0, 10, 0, 59, 0, 55, 35, 30, ...
## $ time_hour <time> 2013-01-09 09:00:00, 2013-06-15 19:00:00, 2013...
Collapse data to a single row. Especially useful if you want to summarise subsets of the data based on a variable with group_by().
summarize(flights, avg_depdelay = mean(dep_delay, na.rm = TRUE) )
## # A tibble: 1 x 1
## avg_depdelay
## <dbl>
## 1 12.63907
The pipe operater let’s you compactly perform operations in a left to right order. The unpiped version goes from inside out as nested function calls.
flights2 = flights %>%
filter(dep_delay > 0) %>%
select(dep_delay, arr_delay, carrier) %>%
mutate(gain = arr_delay - dep_delay)
glimpse(flights2)
## Observations: 128,432
## Variables: 4
## $ dep_delay <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9, 2, 3...
## $ arr_delay <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -31, 4...
## $ carrier <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA", "EV"...
## $ gain <dbl> 9, 16, 31, -7, 24, 3, 1, -8, -12, -17, -7, -8, -32, ...
#equivalent but much harder to read code.
#Even breaking onto multiple lines doesn't really help
flights2 = mutate( select( filter(flights,dep_delay > 0),
dep_delay, arr_delay, carrier),
gain = arr_delay - dep_delay)
glimpse(flights2)
## Observations: 128,432
## Variables: 4
## $ dep_delay <dbl> 2, 4, 2, 1, 8, 11, 3, 13, 24, 8, 1, 1, 1, 2, 9, 2, 3...
## $ arr_delay <dbl> 11, 20, 33, -6, 32, 14, 4, 5, 12, -9, -6, -7, -31, 4...
## $ carrier <chr> "UA", "UA", "AA", "B6", "MQ", "UA", "B6", "AA", "EV"...
## $ gain <dbl> 9, 16, 31, -7, 24, 3, 1, -8, -12, -17, -7, -8, -32, ...
flights %>% # Start with the 'flights' dataset
filter(dep_delay> 0) %>% # Then, filter down to rows where departure was delayed
ggplot(aes(x=carrier,y=dep_delay)) + # Then, plot using ggplot
geom_boxplot() + # and create a boxplot
coord_flip() #flip axes to better use space
group_by defines a variable to cut the original dataset into mini-datasets for further operations. This example is lifted verbatim from the dplyr vignette. Here, the grouping variable is unique plains, thereby averaging all the flights an airplane did together. Therefore, each point in the figure represents one airplane’s data.
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
A good tool if you are interested in this type of data is “ggmap” . We will use it now to plot ghost sightings and their relative density in Kentucky.
library(ggmap)
library(ghostr)
data(ghost_sightings)
ky_bb=make_bbox(lon,lat,data=ghost_sightings,f=.1)
map <- get_stamenmap(ky_bb, zoom = 7, maptype = "toner-lite")
ggmap(map) +
geom_point(aes(x=lon,y=lat,size=sightings),data=ghost_sightings) +
xlab("Longitude")+ylab("Latitude")
#can use data from multiple datasets by specifying data= in geoms
This is a simple “bag of words” sentiment analysis
tidy_tw$created<-formatTwDate(tidy_tw$created)
tw_df$created<-formatTwDate(tw_df$created)
library(tidyr)
bing <- sentiments %>%
filter(lexicon == "bing") %>%
select(-score)
conf_sent <- tidy_tw %>%
inner_join(bing) %>%
count(id, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
inner_join(tw_df[,c(5,8)]) #join on id and created
library(cowplot)
library(scales)
library(lubridate)
#adjust time zone of tweets with lubridate
conf_sent$created<-ymd_hms(conf_sent$created,tz='EST')
#Example could include label, but don't have time to figure out what is driving
#inflection points of moods during these other conferences
df_labels<-data.frame(times=strptime(c("2016-07-13 12:00:00","2016-07-15 0:00:00","2016-07-16 16:30:00","2016-07-18 6:30:00"),"%Y-%m-%d %H:%M:%S"),
labels=c("it begins!\nmixers for all","science cafe\nfunny-man",'final day\nmixer stuff','that was pretty\ngood reflection'),
y=c(1.5,1.0,1.0,1.0))
ggplot(conf_sent, aes(created, sentiment)) +
geom_smooth() + xlab("tweet time") + ylab("tweet sentiment")+
scale_x_datetime(breaks = date_breaks("day")) + background_grid(major = "xy", minor = "none") +
theme(axis.text.x=element_text(angle=315,vjust=.6))+
#geom_text(data=df_labels,aes(x=times,y=y,label=labels),size=4)+
ggtitle(paste(hashtag,"positive or negative emotions (think first order ~vibe of conf.)"))
#coord_cartesian(ylim=c(-.5,1.2)) #+geom_text(data=df_labels,aes(x=times,y=y,label=labels),size=4)
library(twitteR)
library(tidytext)
library(ggplot2)
library(dplyr)
library(igraph)
library(stringr)
hm<-sort(table(tw_df$screenName))
outdf<-data.frame(screen_name=names(hm),num_tweets=hm)[,c(1,3)]
outdf<-outdf[order(outdf[,2],decreasing=T),]
#OK, start of new code to develop RT network
#code (regex especially!!!) used liberally from
# https://sites.google.com/site/miningtwitter/questions/user-tweets/who-retweet
#TODO:
#replace retweet network construction (not plotting)
#with https://github.com/nfahlgren/conference_twitter_stats/blob/master/retweet_network_generic.R
#it's cleaner and doesn't rely on regex horrors I don't understand
rt_net<-grep("(RT|via)((?:\\b\\W*@\\w+)+)", tw_df$text,
ignore.case=TRUE,value=TRUE)
rt_neti<-grep("(RT|via)((?:\\b\\W*@\\w+)+)", tw_df$text,
ignore.case=TRUE)
#next, create list to store user names
who_retweet <- as.list(1:length(rt_net))
who_post <- as.list(1:length(rt_net))
# for loop
for (i in 1:length(rt_net))
{
# get tweet with retweet entity
#nrow= ???
twit <- tw_df[rt_neti[i],]
# get retweet source
poster<-str_extract_all(twit$text,"(RT|via)((?:\\b\\W*@\\w+)+)")
#remove ':'
poster <- gsub(":", "", unlist(poster))
# name of retweeted user
who_post[[i]] <- gsub("(RT @|via @)", "", poster, ignore.case=TRUE)
# name of retweeting user
who_retweet[[i]] <- rep(twit$screenName, length(poster))
}
# unlist
who_post <- unlist(who_post)
who_retweet <- unlist(who_retweet)
####
#Preprocessing the dataframes as as contacts to something
#igraph likes
#I guess I need an edge aesthetic for ggraph to paint with
retweeter_poster <- data.frame(from=who_retweet, to=who_post,retweets=1)
#filters out some bad parsing and users who arent in the node graph
#node_df has the screen_name and number of tweets per user, which will serve as the vertex/node dataframe
#in igraph speak
node_df<-outdf
names(node_df)<-c("id","num_tweets")
#This step #REALLLY IMPORTANT# for plotting purposes, determines how dense the network is
#need to tune based on how big you want your input network is
node_df2<-droplevels(node_df[1:50,]) #selecting only the top 50 posting from #evol2016 for plotting purposes
filt_rt_post<-retweeter_poster[retweeter_poster$from %in% node_df2$id & retweeter_poster$to %in% node_df2$id,]
filt_rt_post<-droplevels(filt_rt_post) #ditch all those fleshbags that had to talk to people instead of tweeting
head(filt_rt_post)
## from to retweets
## 43 scopatz dotsdl 1
## 48 SciPyConf JackieKazil 1
## 51 SciPyConf ericmjl 1
## 56 chendaniely jhamrick 1
## 60 geo_leeman dopplershift 1
## 61 jnuneziglesias SciPyConf 1
#this creates a directed graph with vertex/node info on num_tweets, and edge info on retweets
rt_graph<-graph_from_data_frame(d=droplevels(filt_rt_post),vertices=droplevels(node_df2),directed=T)
#simplify the graph to remove any possible self retweets since now twitter is dumb any allows that
#and any multiple edges
#have to wait a couple seconds to let graph be generated before simplify call
#merge all the multiple rts a person has into one edge to simplify visualization
rt_graph<-simplify(rt_graph,remove.multiple=T,remove.loops=TRUE,edge.attr.comb='sum')
library(ggraph)
library(ggplot2)
#jpeg('evol2016_top50_twitter_network.jpg',width=960,height=960,pointsize=12)
ggraph(rt_graph,'igraph',algorithm='kk')+
geom_edge_fan(aes(alpha=retweet),edge_alpha=0.075)+
geom_node_point(aes(size=num_tweets))+
geom_node_text(aes(label=name,vjust=-1.5))+
ggforce::theme_no_axes()+
theme(legend.position=c(.08,.88))
deg.dist <-degree_distribution(rt_graph, cumulative=T, mode="all")
deg_df<-data.frame(deg=0:max(degree(rt_graph)),cum_freq=1-deg.dist)
qplot(deg,cum_freq,data=deg_df,xlab="Degree",ylab="Cumulative Frequency")