class: center, middle, inverse, title-slide # Data Visualization, activity ### Natalia da Silva
Instituto de EstadÃstica-FCEA-UDELAR --- ## <span style="color:#88398A">Activity 1, `tips` </span> ```r # cargamos los datos library(tidyverse) tips <- read_csv("http://www.ggobi.org/book/data/tips.csv") ``` --- ## <span style="color:#88398A">Your turn! </span> - Create a scatterplot: aesthetic components `x` totbill, `y` tip, `color` smoker - Change the axis title: `x = "Total bill in dollars `, `y = "Tip in dollars"` - Change the name of the legend label to "Smoker" - Change the legend position in the bottom of the plot using `theme` - Change the color palette to Dark2 using `scale_color_brewer` ( `library(RColorBrewer)`) --- ## <span style="color:#88398A">Your turn: solution </span> ```r ggplot(data = tips, aes(x = totbill, y = tip, color = smoker)) + geom_point( )+ scale_color_brewer(palette = "Dark2") + labs(x = "Total bill in dollars", y = "Tips in dollars", color = "Smoker") + theme(aspect.ratio = 1, legend.position = "bottom") ``` <img src="activities_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">Your turn </span> - Include a linear smooth using `geom_smooth` - Change the axis text size to 20 - Change the axis title to 15 --- ## <span style="color:#88398A"> Your turn: solution </span> ```r ggplot(data = tips, aes(x = totbill, y = tip) ) + geom_point( aes(color = smoker)) + scale_color_brewer(palette = "Dark2") + labs(x = "Total bill in dollars", y = "Tips in dollars", color = "Smoker") + theme(aspect.ratio = 1, legend.position = "bottom", axis.text.x = element_text(size = 20), axis.text.y = element_text(size = 20), axis.title = element_text(size = 15)) + geom_smooth(method = "lm") ``` --- ## <span style="color:#88398A"> Your turn: solution </span> <img src="activities_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">Your turn </span> - Create a 100% stacked bar graph for `day` by `sex` - Change the color palette to Dark2 using `scale_fill_discrete` - Change the labels of `day` using `scale_x_discrete` to Thursday, Friday, Saturday and Sunday. - Include axis labs - Order the x axis labels from Thursday to Sunday, use `fct_relevel` --- ## <span style="color:#88398A">Your turn: solution </span> ```r ggplot(data = tips, aes(x = fct_relevel(day, "Thu", "Fri", "Sat", "Sun"), fill = sex)) + geom_bar(position = "fill") + scale_x_discrete( labels = c('Fri'='Friday', 'Thu'= 'Thursday', 'Sat' = 'Saturday', 'Sun' = 'Sunday' )) + scale_fill_brewer(palette = "Dark2") + labs(x = "Days", y ="Frequency" ) ``` --- ## <span style="color:#88398A">Your turn: solution </span> <img src="activities_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- ## <span style = "color:#88398A">Activity 2, `ChickWeight` </span> - Data set `ChickWeight`. You need to load the data in R using `data(ChickWeight)`. - The `ChickWeight` data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. Use `?ChickWeight` to get more information on every one of the variables. --- ## <span style = "color:#88398A">Your Turn </span> - Each chick should have twelve weight measurements. - Use the `dplyr` package to identify how many chicks have a complete set of weight measurements and how many measurements there are in the incomplete cases. - Extract a subset of the data for all chicks with complete information and name the data set `complete`. (Hint: you might want to use `mutate` to introduce a helper variable consisting of the number of observations) --- ## <span style = "color:#88398A">Your Turn, solution </span> There are 45 chicks with 12 observation each. ```r data(ChickWeight) complete <- ChickWeight %>% group_by(Chick) %>% mutate(obschick = n()) %>% filter(obschick == 12) complete %>% head() ``` ``` ## # A tibble: 6 x 5 ## # Groups: Chick [1] ## weight Time Chick Diet obschick ## <dbl> <dbl> <ord> <fct> <int> ## 1 42 0 1 1 12 ## 2 51 2 1 1 12 ## 3 59 4 1 1 12 ## 4 64 6 1 1 12 ## 5 76 8 1 1 12 ## 6 93 10 1 1 12 ``` --- ## <span style = "color:#88398A">Your Turn, solution </span> There are 5 chicks with less than 12 observations, using `summarise` and `filter` you can get the number of observation for the complete cases. ```r ChickWeight %>% group_by(Chick) %>% summarise(obschick = n() ) %>% filter(obschick != 12) ``` ``` ## # A tibble: 5 x 2 ## Chick obschick ## <ord> <int> ## 1 18 2 ## 2 16 7 ## 3 15 8 ## 4 8 11 ## 5 44 10 ``` --- ## <span style = "color:#88398A">Your Turn</span> In the complete data set introduce a new variable that measures the current weight difference compared to day 0. Name this variable `weightgain`. --- ## <span style = "color:#88398A"> Your Turn, solution </span> ```r complete <- complete %>% group_by(Chick) %>% mutate(weightgain = weight - weight[Time==0]) head(complete) ``` ``` ## # A tibble: 6 x 6 ## # Groups: Chick [1] ## weight Time Chick Diet obschick weightgain ## <dbl> <dbl> <ord> <fct> <int> <dbl> ## 1 42 0 1 1 12 0 ## 2 51 2 1 1 12 9 ## 3 59 4 1 1 12 17 ## 4 64 6 1 1 12 22 ## 5 76 8 1 1 12 34 ## 6 93 10 1 1 12 51 ``` --- ## <span style = "color:#88398A">Your Turn </span> Create side-by-side boxplots of `weightgain` by `Diet` for day 21. Change the order of the categories in the `Diet` variable such that the boxplots are ordered by median `weightgain` (use `fct_reorder`). --- ## <span style = "color:#88398A">Your Turn, solution </span> ```r complete %>% filter(Time == 21) %>% ggplot(aes(x = fct_reorder(Diet, weightgain, median), y = weightgain) ) + geom_boxplot() + xlab("Diet") ``` <img src="activities_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> Diet 3 is the most successful, but Diet 4 has the least spread. The median `weightgain` for Diet 3 is 240 gms and for Diet 1 around 125 gms. --- ## <span style = "color:#88398A">Your Turn </span> Create a plot with `Time` along the x axis and `weight` in the y axis. Facet by `Diet`. Use a point layer and also draw one line for each `Chick`. Color by `Diet`. Include the legend on the bottom (check `theme`). --- ## <span style = "color:#88398A">Your Turn, solution </span> ```r complete %>% ggplot( aes(x = Time, y = weight, color = Diet)) + geom_point(size = I(1/2)) + geom_line(aes(group = Chick)) + facet_wrap( ~Diet, ncol = 4) + theme(legend.position = "bottom", aspect.ratio = 1) ``` <img src="activities_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> Chicks in `Diet` 4 are more similar each other in terms of weight than chicks in the other diet groups. There biggest chick is in diet 3 while the smallest one is on diet 2. In all the diets at least one chick lost weight at the end. --- ## <span style = "color:#88398A">Your Turn</span> Select the `Chick` with the maximum weight at `Time` 21 for each of the diets. Redraw the previous plot with only these 4 chicks (and don't facet). --- ## <span style = "color:#88398A">Your Turn, solution</span> ```r ids <- complete %>% filter(Time == 21)%>% group_by(Diet) %>% mutate(maxw = max(weight)) %>% filter(weight == maxw) %>% select(Diet,Chick) complete %>% filter(Chick %in% ids$Chick) %>% ggplot( aes(x = Time, y = weight ,color = Diet)) + geom_point() + geom_line() + theme(aspect.ratio = 1) ``` --- ## <span style = "color:#88398A">Your Turn, solution</span> <!-- --> --- ## <span style = "color:#88398A">Activity 3, Flu </span> - Use the Google Flu Trends data set. Each week begins on the Sunday (Pacific Time) indicated for the row. Data for the current week will be updated each day until Saturday (Pacific Time). - The data are available in http://www.google.org/flutrends, there are information for different countries, we will use for this exercise flu data for US. --- ## <span style = "color:#88398A"> Flu</span> ```r flu.us <-read_delim("https://www.google.org/flutrends/about/data/flu/us/data.txt", delim = ",", col_names = TRUE, skip = 11) ``` This data contains weekly flu information for each US state. Each row in the dataset `flu.us` consists of the number of flu cases in a week. --- ## <span style = "color:#88398A"> Your turn</span> Introduce a new object called `flu.states`, with the following changes: - select only the state level information, `flu.us[, 1:53]` - remove the column `United.States`, - reshape the dataset such that you have one column for Date, on column with State names and one column named Value with the flu cases. (hint: you can use `gather` to reshape the data) --- ## <span style = "color:#88398A"> Your turn, solution</span> ```r library(lubridate) flu.states <- flu.us[, 1:53] %>% dplyr::select(-matches("United States"))%>% gather(State, Value, -Date) head(flu.states) ``` ``` ## # A tibble: 6 x 3 ## Date State Value ## <date> <chr> <int> ## 1 2003-09-28 Alabama 477 ## 2 2003-10-05 Alabama 501 ## 3 2003-10-12 Alabama 492 ## 4 2003-10-19 Alabama 533 ## 5 2003-10-26 Alabama 594 ## 6 2003-11-02 Alabama 715 ``` --- ## <span style = "color:#88398A"> Flu</span> - Introduce a variable Year.month using Date variable such that Year.month rounds Date down to the nearest boundary of the specified time unit. Year.month should be also a date & time object. - Draw a timeseries with monthly flu cases for Iowa. - Define the x labels with year and month information (check scale_x_date), change the x text axis to a 90 angle. Define the x label and y label and title in an informative way. --- ## <span style = "color:#88398A">Your turn, solution</span> ```r flu.states <- flu.states %>% mutate( Year.month = round_date(Date, unit = "month") ) head(flu.states) ``` ``` ## # A tibble: 6 x 4 ## Date State Value Year.month ## <date> <chr> <int> <date> ## 1 2003-09-28 Alabama 477 2003-10-01 ## 2 2003-10-05 Alabama 501 2003-10-01 ## 3 2003-10-12 Alabama 492 2003-10-01 ## 4 2003-10-19 Alabama 533 2003-11-01 ## 5 2003-10-26 Alabama 594 2003-11-01 ## 6 2003-11-02 Alabama 715 2003-11-01 ``` --- ## <span style = "color:#88398A">Your turn</span> - Find the number of flu cases in each month for each state for all months throughout the time frame. - For that, introduce a variable `Year.month` derived from the `Date` variable such that `Year.month` rounds `Date` down to the nearest boundary of the specified time unit. `Year.month` should be also a date & time object. - Create a timeseries plot of monthly flu cases in Iowa on the y axis and `Year.month` along the x axis. Define the x labels with year and month information (check `scale_x_date`, using 12 weeks breaks). To see the x axis labels change the text to 90 degrees. Define the `xlab`, `ylab` and the `title` in an informative way for this problem. --- ## <span style = "color:#88398A">Your turn, solution </span> ```r monthlies <- flu.states %>% group_by(State, Year.month) %>% summarize( cases = sum(Value) ) head(monthlies) ``` ``` ## # A tibble: 6 x 3 ## # Groups: State [1] ## State Year.month cases ## <chr> <date> <int> ## 1 Alabama 2003-10-01 1470 ## 2 Alabama 2003-11-01 2682 ## 3 Alabama 2003-12-01 28614 ## 4 Alabama 2004-01-01 26486 ## 5 Alabama 2004-02-01 8351 ## 6 Alabama 2004-03-01 3659 ``` --- ## <span style = "color:#88398A">Your turn, solution</span> ```r monthlies %>% filter(State == "Iowa" )%>% ggplot( aes(x = Year.month, y = cases)) + scale_x_date(date_labels = "%Y %m", date_breaks = "12 week") + geom_line() + theme(axis.text.x = element_text(angle = 90)) + labs(x = "Date", y = "Estimated number of Flu Cases", title = "Estimated number of Flu Cases in Iowa between 2003 to 2015 ") ``` <img src="activities_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- ## <span style = "color:#88398A">Your turn</span> - Create a seasonal plot of monthly flu cases in Iowa by mapping the number of monthly flu cases to the y axis and using `Month` on the x axis. - Map year to colour. Connect data from the same year by lines. Label these lines (you can use `geom_text`) with their year information on the left (before January time point) and right (after the December point). Define the x label, y label and the title in an informative way and place the legend at the bottom. --- ## <span style = "color:#88398A">Your turn, Solution </span> ```r dat_month <- monthlies %>% mutate( Month = month(Year.month, label = TRUE), Year = year(Year.month)) %>% filter(Year %in% 2010:2014 ,State == "Iowa") dat_month %>% ggplot( aes(x = Month, y = cases, color= as.factor(Year) ) )+ geom_line( aes( group = Year ) ) + geom_point() + labs( x = "Months", y = "Estimated number of Flu Cases", title = "Seasonal Plot of Estimated number of Flu Cases in Iowa between 2010-2014 ", color = "Year") + theme(legend.position = "bottom") + geom_text(aes(x = Month, y = cases, label =Year), data = filter(dat_month, Month == c("Jan","Dec") ) ) ``` --- ## <span style = "color:#88398A">Your turn, Solution </span> <!-- --> --- ## <span style = "color:#88398A">Your turn</span> Using a polygon layer with `geom_polygon` plot a choropleth map of the total number of flu cases for all US states in 2014. (hint: you need to work on the State names to be able to merge the data for this plot - use the `gsub` function for that. Be sure to have 49 states in the result.) --- ## <span style = "color:#88398A">Your turn, solution</span> ```r library(ggmap) library( ggthemes) states<- map_data("state") flu.states$State <- gsub("\\.", " ", flu.states$State) flu.total<- flu.states %>% mutate(State = tolower(State), Year=year(Date))%>% group_by(State,Year) %>% summarise(Total = sum(Value)) dat_merge <- merge(flu.total, states, by.x = "State", by.y = "region") dat_merge %>% filter(Year == 2014) %>% ggplot(aes(x = long, y = lat)) + geom_polygon(aes(fill=Total,group = group)) + ggthemes::theme_map() ``` --- ## <span style = "color:#88398A">Your turn, solution</span> <img src="activities_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> ---