Data Visualization, activity

class: center, middle, inverse, title-slide

# Data Visualization, activity
### Natalia da Silva Instituto de Estadística-FCEA-UDELAR

---

## Activity 1, `tips`

```r
# cargamos los datos 
library(tidyverse)
tips <- read_csv("http://www.ggobi.org/book/data/tips.csv")
```
---

## Your turn!

- Create a scatterplot: aesthetic components `x` totbill, `y` tip, `color` smoker

- Change the axis title: `x = "Total bill in dollars `, `y = "Tip in dollars"`

- Change the name of the legend label to  "Smoker"

- Change the legend position in the bottom of the plot using `theme`

- Change the color palette to Dark2 using `scale_color_brewer` ( `library(RColorBrewer)`)

---

## Your turn: solution

```r
ggplot(data = tips, aes(x = totbill, y = tip, color = smoker)) + 
  geom_point( )+ scale_color_brewer(palette = "Dark2") +
  labs(x = "Total bill in dollars", y = "Tips in dollars", 
       color =  "Smoker") +
  theme(aspect.ratio = 1, legend.position = "bottom") 
```

<img src="activities_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" />
---

## Your turn

- Include a linear smooth using `geom_smooth`

- Change the  axis text size to 20

- Change the  axis title to 15

---

## Your turn: solution

```r
ggplot(data = tips, aes(x = totbill, y = tip) ) +
  geom_point( aes(color = smoker)) +
  scale_color_brewer(palette = "Dark2") +
  labs(x = "Total bill in dollars", y = "Tips in dollars", 
       color = "Smoker") +
  theme(aspect.ratio = 1, legend.position = "bottom", 
        axis.text.x = element_text(size = 20),
        axis.text.y = element_text(size = 20),
        axis.title = element_text(size = 15))  + 
  geom_smooth(method = "lm")
```
---

## Your turn: solution

<img src="activities_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />
---

## Your turn 
- Create a 100% stacked bar graph for `day` by `sex`
 
- Change the color palette to Dark2 using `scale_fill_discrete`

- Change the labels of  `day` using `scale_x_discrete` to Thursday, Friday, Saturday and Sunday.

- Include axis labs

- Order the x axis labels from Thursday to Sunday, use `fct_relevel`

---

## Your turn: solution

```r
 ggplot(data = tips, aes(x = fct_relevel(day, "Thu", "Fri", "Sat", "Sun"), fill = sex)) + 
   geom_bar(position = "fill") +
   scale_x_discrete( labels = c('Fri'='Friday',   'Thu'= 'Thursday',  'Sat' = 'Saturday', 'Sun' = 'Sunday' )) +
  scale_fill_brewer(palette = "Dark2") +
   labs(x = "Days", y ="Frequency" )
```
---

## Your turn: solution

<img src="activities_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" />
---

## Activity 2, `ChickWeight` 
- Data set `ChickWeight`. You need to load the data in R using `data(ChickWeight)`.

- The `ChickWeight` data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. Use `?ChickWeight` to get more information on every one of the variables.

---

## Your Turn

- Each chick should have twelve weight measurements.

- Use the `dplyr` package to identify how many chicks have a complete set of weight measurements and how many measurements there are in the incomplete cases.

- Extract a subset of the data for all chicks with complete information and name the data set `complete`. (Hint: you might want to use `mutate` to introduce a helper variable consisting of the number of observations)

---
## Your Turn, solution

There are 45 chicks with 12 observation each.

```r
data(ChickWeight)
complete <- ChickWeight %>%
 group_by(Chick) %>%
 mutate(obschick = n()) %>%
 filter(obschick == 12)
 complete %>%
 head()
```

```
## # A tibble: 6 x 5
## # Groups: Chick [1]
## weight Time Chick Diet obschick
## <dbl> <dbl> <ord> <fct> <int>
## 1 42 0 1 1 12
## 2 51 2 1 1 12
## 3 59 4 1 1 12
## 4 64 6 1 1 12
## 5 76 8 1 1 12
## 6 93 10 1 1 12
```
---

## Your Turn, solution

There are 5 chicks with less than 12 observations, using `summarise` and `filter` you can get the number of observation for the complete cases.

```r
ChickWeight %>%
  group_by(Chick) %>%
  summarise(obschick = n() ) %>%
  filter(obschick != 12)
```

```
## # A tibble: 5 x 2
## Chick obschick
## <ord> <int>
## 1 18 2
## 2 16 7
## 3 15 8
## 4 8 11
## 5 44 10
```
---

## Your Turn

In the complete data set introduce a new variable that measures the current weight difference compared to day 0. Name this variable `weightgain`.   
---

## Your Turn, solution

```r
complete <- complete %>% 
 group_by(Chick) %>% 
 mutate(weightgain = weight - weight[Time==0]) 
head(complete)
```

```
## # A tibble: 6 x 6
## # Groups: Chick [1]
## weight Time Chick Diet obschick weightgain
## <dbl> <dbl> <ord> <fct> <int> <dbl>
## 1 42 0 1 1 12 0
## 2 51 2 1 1 12 9
## 3 59 4 1 1 12 17
## 4 64 6 1 1 12 22
## 5 76 8 1 1 12 34
## 6 93 10 1 1 12 51
```

---

## Your Turn

Create  side-by-side boxplots of `weightgain` by `Diet` for day 21.  
Change the order of the categories in the `Diet` variable such that the boxplots are ordered by median `weightgain` (use `fct_reorder`).
---

## Your Turn, solution

```r
complete %>% filter(Time == 21) %>%
  ggplot(aes(x = fct_reorder(Diet, weightgain, median), y = weightgain) ) + 
  geom_boxplot() + xlab("Diet")
```

Diet 3 is the most successful, but Diet 4 has the least spread. 
The median `weightgain`  for Diet 3 is 240 gms and for Diet 1 around 125 gms.

---

## Your Turn 
Create a plot with `Time` along the x axis and `weight` in the y axis. Facet by `Diet`. Use a point layer and also draw one line for each `Chick`. Color by `Diet`. Include the legend on the bottom (check `theme`).
---

## Your Turn, solution

```r
complete %>% 
    ggplot( aes(x = Time, y = weight, color = Diet)) + 
      geom_point(size = I(1/2)) + geom_line(aes(group = Chick)) + 
      facet_wrap( ~Diet, ncol = 4) + theme(legend.position = "bottom", aspect.ratio = 1)
```

<img src="activities_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" />
Chicks in `Diet` 4 are more similar each other in terms of weight than chicks in the other diet groups.
There biggest chick is in diet 3 while the smallest one is on diet 2.
In all the diets at least one chick lost weight at the end.

---
## Your Turn

Select the  `Chick` with the maximum weight at `Time` 21 for each of the diets. Redraw the previous plot with only these 4 chicks (and don't facet). 
---

## Your Turn, solution

```r
ids <- complete %>%
 filter(Time == 21)%>% 
 group_by(Diet) %>%
 mutate(maxw = max(weight)) %>%
 filter(weight == maxw) %>% select(Diet,Chick)

complete %>% filter(Chick %in% ids$Chick) %>% 
    ggplot( aes(x = Time, y = weight ,color = Diet)) +
    geom_point() + geom_line() + theme(aspect.ratio = 1) 
```
  
---

## Your Turn, solution

![](activities_files/figure-html/unnamed-chunk-14-1.png)
 
---

## Activity 3, Flu

- Use the Google Flu Trends data set. Each week begins on the Sunday (Pacific Time) indicated for the row. Data for the current week will be updated each day until Saturday (Pacific Time).

- The data are available in http://www.google.org/flutrends, there are information for different countries, we will use for this exercise flu data for US.
---

## Flu

```r
flu.us <-read_delim("https://www.google.org/flutrends/about/data/flu/us/data.txt", delim = ",", col_names = TRUE, skip = 11)
```
This data contains weekly flu information for each US state.
Each row in the dataset `flu.us` consists of the number of flu cases in a week.

---
## Your turn

Introduce a new object called `flu.states`, with the following changes:

- select only the state level information, `flu.us[, 1:53]`
  - remove the column `United.States`,  
  - reshape the dataset such that you have one column for Date, on column with State names and one column named Value with the flu cases. (hint: you can use `gather` to reshape the data)

---

## Your turn, solution

```r
library(lubridate)

flu.states <- flu.us[, 1:53] %>% 
 dplyr::select(-matches("United States"))%>%
 gather(State, Value, -Date) 
head(flu.states)
```

```
## # A tibble: 6 x 3
## Date State Value
## <date> <chr> <int>
## 1 2003-09-28 Alabama 477
## 2 2003-10-05 Alabama 501
## 3 2003-10-12 Alabama 492
## 4 2003-10-19 Alabama 533
## 5 2003-10-26 Alabama 594
## 6 2003-11-02 Alabama 715
```
---

## Flu

- Introduce a variable Year.month using Date variable such that Year.month rounds Date down to the nearest boundary of the specified time unit. Year.month should be also a date & time object.

- Draw a timeseries with monthly flu cases for Iowa.

- Define the x labels with year and month information (check scale_x_date), change the x text axis to a 90 angle. Define the x label and y label and title in an informative way.

---

## Your turn, solution

```r
flu.states <- flu.states %>% mutate(
 Year.month = round_date(Date, unit = "month")
)
head(flu.states)
```

```
## # A tibble: 6 x 4
## Date State Value Year.month
## <date> <chr> <int> <date> 
## 1 2003-09-28 Alabama 477 2003-10-01
## 2 2003-10-05 Alabama 501 2003-10-01
## 3 2003-10-12 Alabama 492 2003-10-01
## 4 2003-10-19 Alabama 533 2003-11-01
## 5 2003-10-26 Alabama 594 2003-11-01
## 6 2003-11-02 Alabama 715 2003-11-01
```
---

## Your turn

- Find the number of flu cases in each month for each state for all months throughout the time frame.

- For that, introduce a variable `Year.month` derived from the `Date` variable such that `Year.month` rounds `Date` down to the nearest boundary of the specified time unit. `Year.month` should be also a date & time object.

- Create a  timeseries plot of monthly flu cases in Iowa on the y axis and `Year.month` along the x axis. Define the x labels with year and month information (check `scale_x_date`,  using 12 weeks breaks). To see the x axis labels change the text to  90 degrees.  Define the `xlab`, `ylab` and the `title` in an informative way for this problem.
---

## Your turn, solution

```r
monthlies <- flu.states %>% group_by(State, Year.month) %>%
 summarize(
 cases = sum(Value)
 )
head(monthlies)
```

```
## # A tibble: 6 x 3
## # Groups: State [1]
## State Year.month cases
## <chr> <date> <int>
## 1 Alabama 2003-10-01 1470
## 2 Alabama 2003-11-01 2682
## 3 Alabama 2003-12-01 28614
## 4 Alabama 2004-01-01 26486
## 5 Alabama 2004-02-01 8351
## 6 Alabama 2004-03-01 3659
```
---

## Your turn, solution

```r
monthlies %>% 
  filter(State == "Iowa" )%>%
  ggplot( aes(x = Year.month, y = cases)) + scale_x_date(date_labels = "%Y %m", date_breaks = "12 week") +
  geom_line()  + theme(axis.text.x =  element_text(angle = 90)) + labs(x = "Date", y = "Estimated number of Flu Cases", title = "Estimated number of Flu Cases in Iowa between 2003 to 2015 ")
```

<img src="activities_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
---

## Your turn

- Create a  seasonal plot of monthly flu cases in Iowa by mapping the number of monthly flu cases to the y axis and using `Month` on the x axis.

- Map year to colour. Connect data from the same year by lines. 
Label these lines (you can use `geom_text`) with their year information on the left (before January time point) and right (after the December point).  Define the x label, y label and the title in an informative way and place the legend at the bottom.

---

## Your turn, Solution

```r
dat_month <- monthlies %>% mutate(
 Month = month(Year.month, label = TRUE),
 Year = year(Year.month)) %>% 
 filter(Year %in% 2010:2014 ,State == "Iowa")

dat_month %>% ggplot( aes(x = Month, y = cases, color= as.factor(Year) ) )+ 
  geom_line( aes( group = Year ) ) + geom_point() + labs( x = "Months", y = "Estimated number of Flu Cases",  title = "Seasonal Plot of Estimated number of Flu Cases in Iowa between 2010-2014 ", color = "Year") + theme(legend.position = "bottom") + geom_text(aes(x = Month, y = cases, label =Year), data = filter(dat_month, Month == c("Jan","Dec") ) ) 
```
---

## Your turn, Solution 
![](activities_files/figure-html/unnamed-chunk-21-1.png)
---
## Your turn

Using a polygon layer with `geom_polygon` plot a choropleth map of the total number of flu cases for all US states in 2014.
(hint: you need to work on the State names to be able to merge the data for this plot - use the `gsub` function for that. Be sure to have 49 states in the result.)

---

## Your turn, solution

```r
library(ggmap)

library( ggthemes)
states<- map_data("state")

flu.states$State <- gsub("\\.", " ", flu.states$State)
flu.total<- flu.states %>% mutate(State = tolower(State), Year=year(Date))%>%
 group_by(State,Year) %>% summarise(Total = sum(Value))

dat_merge <- merge(flu.total, states, by.x = "State", by.y = "region")

dat_merge %>% filter(Year == 2014) %>% ggplot(aes(x = long, y = lat)) +
  geom_polygon(aes(fill=Total,group = group)) + ggthemes::theme_map()
```

---

## Your turn, solution

---