Data Visualization

class: center, middle, inverse, title-slide

# Data Visualization
### Natalia da Silva Instituto de Estadística-FCEA-UDELAR

---

## About me
.pull-left[ 
<img src = "portrait.png" width = 400 class = "center">

]

.pull-right[

- Assistant professor, Instituto de Estadística-Universidad de la República (IESTA-UDELAR), Montevideo Uruguay.

- PhD and Msc. in Statistics from Iowa State University, USA

- Interests: supervised learning methods, computational statistics, visualization and meta-analysis

- Co-founder of R-Ladies Ames,  R-Ladies Montevideo and GURU::MVD

- Contact info:  natalia@iesta.edu.uy, @pacocuak, http://natydasilva.com
]

---
## Link to the presentation

https://natydasilva.github.io/CODATA/#1

---

## About this workshop

- Why we use Visualization?

- EDA, type of variables and viz examples

- Ideas for an effective visualization

- Why to use ggplot2?

- Grammar of graphics

- Examples

---

# Importance of data visualization

"The greatest value of a picture is when it forces us to notice what we never expected to see." <a name=cite-tukey77></a>[Tukey (1977)](#bib-tukey77)

"Graphs provide powerful tools both for analyzing scientific data and for communicating quantitative information" <a name=cite-cleveland></a>[Cleveland
William and McGuill (1985)](#bib-cleveland)
---

## Few people escape to these ideas

1. The numerical calculations are exact, but graph are rough

2. For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis

3. Performing intrincate calculations is virtuous, whereas actually looking at the data is cheating <a name=cite-anscombe></a>[Anscombe
F (1973)](#bib-anscombe)

---

## Statistical Visualization

Visualization plays an important role in all the stages of the statistical analysis.
- **Initial Exploration:** To find general and specific patterns in the data.

- **Models:** Check the data assumptions before to run a model.

- **Diagnostics:** Visualize the model in the data space or the data in the model space.
---

## EDA
 
 Exploratory data analysis (EDA), it is an iterative process to explore broadly different aspects of the data.
 
 Origins: John Tukey encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments .

- help us to understand our data

- we need to define questions to guide our research

- each questions focused on something specific about the data and this defines the type of data the model and possible transformations

- key point, define interesting questions for our problem

---

## Which is the relationship between X and Y?
<table class="table table-bordered" style="font-size: 15px; width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:right;"> X </th>
 <th style="text-align:right;"> Y </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:right;"> 1.972 </td>
 <td style="text-align:right;"> 1.236 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1.112 </td>
 <td style="text-align:right;"> 1.994 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 0.000 </td>
 <td style="text-align:right;"> 1.009 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 0.665 </td>
 <td style="text-align:right;"> 1.942 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 0.235 </td>
 <td style="text-align:right;"> 0.356 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 0.247 </td>
 <td style="text-align:right;"> 1.658 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1.275 </td>
 <td style="text-align:right;"> 1.961 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 0.702 </td>
 <td style="text-align:right;"> 0.045 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1.760 </td>
 <td style="text-align:right;"> 0.350 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1.691 </td>
 <td style="text-align:right;"> 0.277 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1.628 </td>
 <td style="text-align:right;"> 1.778 </td>
 </tr>
 <tr>
 <td style="text-align:right;"> 1.957 </td>
 <td style="text-align:right;"> 1.290 </td>
 </tr>
</tbody>
</table>
---

## Why to use visualization?

---
 
<img src="gelman.png" width="700" class="center">

- Statisticians recommend graphical displays but often do not follow this recommendation in presenting their own research.

- They analyze some papers in JASA and shows some examples from table to plots

- They said there is a good reason to be lazy, it takes a lot of work to make a good visualization

- Nice graphs are possible, especially when we think hard about why we want to display these numbers in the first place

-  If the world’s leading statistical journal doesn’t do it right, there is obviously still room for progress

<a name=cite-gelman2002let></a>[Gelman, Pasarica, and Dodhia (2002)](#bib-gelman2002let)

---

## What is the sample mean value for x and y?

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-3-1.png)
--

`$\bar x =$` 54.27  and     `$\bar y =$` 47.83

---

## What is the sample mean value for x and y?

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-5-1.png)
--

`$\bar x =$` 54.27 and     `$\bar y =$` 47.84

---

## What is the sample mean value for x and y?

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-7-1.png)
--

`$\bar x =$` 54.26 and     `$\bar y =$` 47.83

---

## Why to use visualization? 
.pull-left[

https://github.com/stephlocke/datasauRus

- <a name=cite-matejka2017same></a>[Matejka and Fitzmaurice (2017)](#bib-matejka2017same)
]

.pull-right[

- Same statistical summaries

- Very different distributions
 
- [Link algoritmo](https://www.autodeskresearch.com/publications/samestats).

]
---

## Why to use visualization? 
.pull-left[ 
- Graphs provides more information than numerical summaries
- Anscombe’s quartet [Anscombe
F (1973)](#bib-anscombe)
- `$n = 11$` 
- `$\bar x= 9.0$` 
- `$\bar y = 7.5$` 
- `$\hat \beta_1= 0.5$` 
- `$y = 3 + 0.5 x$` 
- `$R^2 = 0.667$` 
- `$....$`
]
.pull-right[
<img src = "Quartet.png" width = "400" class = "center">
]

---

## Visualization is not new 
Charles Minard, Napoleon’s Russian Campaign in 1812

---

## Visualization is not new 
Florence Nightingale, Coxcombs Causes of Mortality for the Army in the East (1858)

---

## Data Visualization

- Data visualizations are based on data

- Summarize information

- Different type of visualizations emphasizes different aspects of data

---

## Information, data

- Data contains cases and variables.

- In tabular form cases are the rows and columns are usually the variables.

Variables can take different forms:

- Continuous, Categorical, Temporal,  Spatial

---

## Type of data
 Key point to do visualization: understand the type of variables you have

- **Continuous variable:** a variable with infinite number of values, like “time” or “weight”.

- **Discrete variable:** numerical variable that can only take on a certain number of values. Example, “number of students in the class”

- **Categorical variable:** variables than can be put into groups or categories. Example, hair color, dog breeds

- and more...
---

## What are the most important types of data displays?

Impossible to answer, but ... some important types are:

- dotplots, histogram, density plot, time series, barcharts (all one
dimensional)

- scatterplot, side-by-side boxplots (two-dimensional) parallel coordinate plots, mosaic plots (higher dimensions) 
 
- maps (geographic information)
 
- and more

---
## Why so many different types?

- Different data displays are due to number and type of the variables

- emphasis: every chart shows a different aspect of the data

---

## EDA rules?
 
Not clear rules to do EDA but we can begin with some basic questions:

- What is the variability in my data?

- How is the distribution of each variable?

- Is there any relationship between some selected variables?

- How these two variables covariate?

---

## Distribución

To identify which tools to use in EDA we should identify the type of variables to analyze

Example:
- Categorical variables, we can analyze the distribution using a bar charts

- Continuous variables, we can analyze the distribution  using an histogram

---

## Barchart versus Histogram, starwars
.pull-left[ 
 
![](Part_I_ggplot2_files/figure-html/unnamed-chunk-9-1.png)

- Variable: categorical/qualitative

- Bins separated by equal space

- Bins can be re-ordered, and should be to improve the viz

- Parameter: none

]

.pull-right[

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-10-1.png)
- Variable: continuous/quantitative

- Bins adjacent

- Bins can't be re-ordered

- Parameter: bin width
]

---

## All the same...?
.pull-left[

- Histogram, density plot, jittered dot plot and boxplot

- Every plot emphasizes a different aspect of the data:
skewness, modality, symmetry, gaps, outliers

- There is not right plot all are useful

]

.pull-right[
![](Part_I_ggplot2_files/figure-html/unnamed-chunk-11-1.png)
]
---

## Covariation
 
- Covariation is the joint variation for two variables

- The best way to detect the covariation is to visualize the relationship between two variables

- The type of variable is what defines the way to visualize covariation

---
 
## Categorical vs Continous
 
- It is common to explore the distribution of a continuous variables according to a categorical variable

- histograms or densities colored by a categorical variable to compare the distributions for each category

---

## Categorial vs Continous, boxplot and violin

Boxplot:
- We see the data distribution for continuous variables

- based on the five-number summaries (min, Q1, Q2, Q3, max).

- outliers( Q1–1.5 IQR and  Q3 + 1.5 IQR)

Violin: it is a combination of boxplot and density plot

---

## Categorial vs Continous, boxplot and violin

---

## Categórica vs categóricas, Bars

- It is useful to describe the categorical variables, for each category the bar length is the proportion to the number of observation in each category

- If we want to represent more than one categorical variable we can use stacked bar graph.

- Simple stacked bar graph: place each value for the segment after the previous one. The total value of the bar is all the segment values added together. Ideal for comparing the total amounts across each group/segmented bar.

- 100% Stack Bar Graphs show the percentage-of-the-whole of each group and are plotted by the percentage of each value to the total amount in each group. This makes it easier to see the relative differences between quantities in each group.

---

## Categórica vs categóricas, Bars

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-14-1.png)

---

## Static versus interactive
.pull-left[
Two key components to be accomplish in an interactive visualization:

- Interaction in every individual visualization (mouse over, zoom, labels, etc)

- Links between different graphics

Additionally be able to control dynamic rotations in higher dimensions.
]
.pull-right[
<div id="htmlwidget-b43d13257d7d2cb0e230" style="width:360px;height:360px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-b43d13257d7d2cb0e230">{"x":{"data":[{"x":[18,21,20,21,16,18,18,18,16,20,19,15,17,17,15,15,17,16,14,11,14,13,12,16,15,16,15,15,14,11,11,14,19,22,18,18,17,18,17,16,16,17,17,11,15,15,16,16,15,14,13,14,14,14,9,11,11,13,13,9,13,11,13,11,12,9,13,13,12,9,11,11,13,11,11,11,12,14,15,14,13,13,13,14,14,13,13,13,11,13,18,18,17,16,15,15,15,15,14,28,24,25,23,24,26,25,24,21,18,18,21,21,18,18,19,19,19,20,20,17,16,17,17,15,15,14,9,14,13,11,11,12,12,11,11,11,12,14,13,13,13,21,19,23,23,19,19,18,19,19,14,15,14,12,18,16,17,18,16,18,18,20,19,20,18,21,19,19,19,20,20,19,20,15,16,15,15,16,14,21,21,21,21,18,18,19,21,21,21,22,18,18,18,24,24,26,28,26,11,13,15,16,17,15,15,15,16,21,19,21,22,17,33,21,19,22,21,21,21,16,17,35,29,21,19,20,20,21,18,19,21,16,18,17],"y":[29,29,31,30,26,26,27,26,25,28,27,25,25,25,25,24,25,23,20,15,20,17,17,26,23,26,25,24,19,14,15,17,27,30,26,29,26,24,24,22,22,24,24,17,22,21,23,23,19,18,17,17,19,19,12,17,15,17,17,12,17,16,18,15,16,12,17,17,16,12,15,16,17,15,17,17,18,17,19,17,19,19,17,17,17,16,16,17,15,17,26,25,26,24,21,22,23,22,20,33,32,32,29,32,34,36,36,29,26,27,30,31,26,26,28,26,29,28,27,24,24,24,22,19,20,17,12,19,18,14,15,18,18,15,17,16,18,17,19,19,17,29,27,31,32,27,26,26,25,25,17,17,20,18,26,26,27,28,25,25,24,27,25,26,23,26,26,26,26,25,27,25,27,20,20,19,17,20,17,29,27,31,31,26,26,28,27,29,31,31,26,26,27,30,33,35,37,35,15,18,20,20,22,17,19,18,20,29,26,29,29,24,44,29,26,29,29,29,29,23,24,44,41,29,26,28,29,29,29,28,29,26,26,26],"text":["cty: 18 hwy: 29","cty: 21 hwy: 29","cty: 20 hwy: 31","cty: 21 hwy: 30","cty: 16 hwy: 26","cty: 18 hwy: 26","cty: 18 hwy: 27","cty: 18 hwy: 26","cty: 16 hwy: 25","cty: 20 hwy: 28","cty: 19 hwy: 27","cty: 15 hwy: 25","cty: 17 hwy: 25","cty: 17 hwy: 25","cty: 15 hwy: 25","cty: 15 hwy: 24","cty: 17 hwy: 25","cty: 16 hwy: 23","cty: 14 hwy: 20","cty: 11 hwy: 15","cty: 14 hwy: 20","cty: 13 hwy: 17","cty: 12 hwy: 17","cty: 16 hwy: 26","cty: 15 hwy: 23","cty: 16 hwy: 26","cty: 15 hwy: 25","cty: 15 hwy: 24","cty: 14 hwy: 19","cty: 11 hwy: 14","cty: 11 hwy: 15","cty: 14 hwy: 17","cty: 19 hwy: 27","cty: 22 hwy: 30","cty: 18 hwy: 26","cty: 18 hwy: 29","cty: 17 hwy: 26","cty: 18 hwy: 24","cty: 17 hwy: 24","cty: 16 hwy: 22","cty: 16 hwy: 22","cty: 17 hwy: 24","cty: 17 hwy: 24","cty: 11 hwy: 17","cty: 15 hwy: 22","cty: 15 hwy: 21","cty: 16 hwy: 23","cty: 16 hwy: 23","cty: 15 hwy: 19","cty: 14 hwy: 18","cty: 13 hwy: 17","cty: 14 hwy: 17","cty: 14 hwy: 19","cty: 14 hwy: 19","cty: 9 hwy: 12","cty: 11 hwy: 17","cty: 11 hwy: 15","cty: 13 hwy: 17","cty: 13 hwy: 17","cty: 9 hwy: 12","cty: 13 hwy: 17","cty: 11 hwy: 16","cty: 13 hwy: 18","cty: 11 hwy: 15","cty: 12 hwy: 16","cty: 9 hwy: 12","cty: 13 hwy: 17","cty: 13 hwy: 17","cty: 12 hwy: 16","cty: 9 hwy: 12","cty: 11 hwy: 15","cty: 11 hwy: 16","cty: 13 hwy: 17","cty: 11 hwy: 15","cty: 11 hwy: 17","cty: 11 hwy: 17","cty: 12 hwy: 18","cty: 14 hwy: 17","cty: 15 hwy: 19","cty: 14 hwy: 17","cty: 13 hwy: 19","cty: 13 hwy: 19","cty: 13 hwy: 17","cty: 14 hwy: 17","cty: 14 hwy: 17","cty: 13 hwy: 16","cty: 13 hwy: 16","cty: 13 hwy: 17","cty: 11 hwy: 15","cty: 13 hwy: 17","cty: 18 hwy: 26","cty: 18 hwy: 25","cty: 17 hwy: 26","cty: 16 hwy: 24","cty: 15 hwy: 21","cty: 15 hwy: 22","cty: 15 hwy: 23","cty: 15 hwy: 22","cty: 14 hwy: 20","cty: 28 hwy: 33","cty: 24 hwy: 32","cty: 25 hwy: 32","cty: 23 hwy: 29","cty: 24 hwy: 32","cty: 26 hwy: 34","cty: 25 hwy: 36","cty: 24 hwy: 36","cty: 21 hwy: 29","cty: 18 hwy: 26","cty: 18 hwy: 27","cty: 21 hwy: 30","cty: 21 hwy: 31","cty: 18 hwy: 26","cty: 18 hwy: 26","cty: 19 hwy: 28","cty: 19 hwy: 26","cty: 19 hwy: 29","cty: 20 hwy: 28","cty: 20 hwy: 27","cty: 17 hwy: 24","cty: 16 hwy: 24","cty: 17 hwy: 24","cty: 17 hwy: 22","cty: 15 hwy: 19","cty: 15 hwy: 20","cty: 14 hwy: 17","cty: 9 hwy: 12","cty: 14 hwy: 19","cty: 13 hwy: 18","cty: 11 hwy: 14","cty: 11 hwy: 15","cty: 12 hwy: 18","cty: 12 hwy: 18","cty: 11 hwy: 15","cty: 11 hwy: 17","cty: 11 hwy: 16","cty: 12 hwy: 18","cty: 14 hwy: 17","cty: 13 hwy: 19","cty: 13 hwy: 19","cty: 13 hwy: 17","cty: 21 hwy: 29","cty: 19 hwy: 27","cty: 23 hwy: 31","cty: 23 hwy: 32","cty: 19 hwy: 27","cty: 19 hwy: 26","cty: 18 hwy: 26","cty: 19 hwy: 25","cty: 19 hwy: 25","cty: 14 hwy: 17","cty: 15 hwy: 17","cty: 14 hwy: 20","cty: 12 hwy: 18","cty: 18 hwy: 26","cty: 16 hwy: 26","cty: 17 hwy: 27","cty: 18 hwy: 28","cty: 16 hwy: 25","cty: 18 hwy: 25","cty: 18 hwy: 24","cty: 20 hwy: 27","cty: 19 hwy: 25","cty: 20 hwy: 26","cty: 18 hwy: 23","cty: 21 hwy: 26","cty: 19 hwy: 26","cty: 19 hwy: 26","cty: 19 hwy: 26","cty: 20 hwy: 25","cty: 20 hwy: 27","cty: 19 hwy: 25","cty: 20 hwy: 27","cty: 15 hwy: 20","cty: 16 hwy: 20","cty: 15 hwy: 19","cty: 15 hwy: 17","cty: 16 hwy: 20","cty: 14 hwy: 17","cty: 21 hwy: 29","cty: 21 hwy: 27","cty: 21 hwy: 31","cty: 21 hwy: 31","cty: 18 hwy: 26","cty: 18 hwy: 26","cty: 19 hwy: 28","cty: 21 hwy: 27","cty: 21 hwy: 29","cty: 21 hwy: 31","cty: 22 hwy: 31","cty: 18 hwy: 26","cty: 18 hwy: 26","cty: 18 hwy: 27","cty: 24 hwy: 30","cty: 24 hwy: 33","cty: 26 hwy: 35","cty: 28 hwy: 37","cty: 26 hwy: 35","cty: 11 hwy: 15","cty: 13 hwy: 18","cty: 15 hwy: 20","cty: 16 hwy: 20","cty: 17 hwy: 22","cty: 15 hwy: 17","cty: 15 hwy: 19","cty: 15 hwy: 18","cty: 16 hwy: 20","cty: 21 hwy: 29","cty: 19 hwy: 26","cty: 21 hwy: 29","cty: 22 hwy: 29","cty: 17 hwy: 24","cty: 33 hwy: 44","cty: 21 hwy: 29","cty: 19 hwy: 26","cty: 22 hwy: 29","cty: 21 hwy: 29","cty: 21 hwy: 29","cty: 21 hwy: 29","cty: 16 hwy: 23","cty: 17 hwy: 24","cty: 35 hwy: 44","cty: 29 hwy: 41","cty: 21 hwy: 29","cty: 19 hwy: 26","cty: 20 hwy: 28","cty: 20 hwy: 29","cty: 21 hwy: 29","cty: 18 hwy: 29","cty: 19 hwy: 28","cty: 21 hwy: 29","cty: 16 hwy: 26","cty: 18 hwy: 26","cty: 17 hwy: 26"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":0.333333333333333,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":23.3059360730594,"r":7.30593607305936,"b":37.2602739726027,"l":37.2602739726027},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[7.7,36.3],"tickmode":"array","ticktext":["10","15","20","25","30","35"],"tickvals":[10,15,20,25,30,35],"categoryorder":"array","categoryarray":["10","15","20","25","30","35"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":"cty","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[10.4,45.6],"tickmode":"array","ticktext":["20","30","40"],"tickvals":[20,30,40],"categoryorder":"array","categoryarray":["20","30","40"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":"hwy","titlefont":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.88976377952756,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","barmode":"relative"},"config":{"doubleClick":"reset","modeBarButtonsToAdd":[{"name":"Collaborate","icon":{"width":1000,"ascent":500,"descent":-50,"path":"M487 375c7-10 9-23 5-36l-79-259c-3-12-11-23-22-31-11-8-22-12-35-12l-263 0c-15 0-29 5-43 15-13 10-23 23-28 37-5 13-5 25-1 37 0 0 0 3 1 7 1 5 1 8 1 11 0 2 0 4-1 6 0 3-1 5-1 6 1 2 2 4 3 6 1 2 2 4 4 6 2 3 4 5 5 7 5 7 9 16 13 26 4 10 7 19 9 26 0 2 0 5 0 9-1 4-1 6 0 8 0 2 2 5 4 8 3 3 5 5 5 7 4 6 8 15 12 26 4 11 7 19 7 26 1 1 0 4 0 9-1 4-1 7 0 8 1 2 3 5 6 8 4 4 6 6 6 7 4 5 8 13 13 24 4 11 7 20 7 28 1 1 0 4 0 7-1 3-1 6-1 7 0 2 1 4 3 6 1 1 3 4 5 6 2 3 3 5 5 6 1 2 3 5 4 9 2 3 3 7 5 10 1 3 2 6 4 10 2 4 4 7 6 9 2 3 4 5 7 7 3 2 7 3 11 3 3 0 8 0 13-1l0-1c7 2 12 2 14 2l218 0c14 0 25-5 32-16 8-10 10-23 6-37l-79-259c-7-22-13-37-20-43-7-7-19-10-37-10l-248 0c-5 0-9-2-11-5-2-3-2-7 0-12 4-13 18-20 41-20l264 0c5 0 10 2 16 5 5 3 8 6 10 11l85 282c2 5 2 10 2 17 7-3 13-7 17-13z m-304 0c-1-3-1-5 0-7 1-1 3-2 6-2l174 0c2 0 4 1 7 2 2 2 4 4 5 7l6 18c0 3 0 5-1 7-1 1-3 2-6 2l-173 0c-3 0-5-1-8-2-2-2-4-4-4-7z m-24-73c-1-3-1-5 0-7 2-2 3-2 6-2l174 0c2 0 5 0 7 2 3 2 4 4 5 7l6 18c1 2 0 5-1 6-1 2-3 3-5 3l-174 0c-3 0-5-1-7-3-3-1-4-4-5-6z"},"click":"function(gd) { \n // is this being viewed in RStudio?\n if (location.search == '?viewer_pane=1') {\n alert('To learn about plotly for collaboration, visit:\\n https://cpsievert.github.io/plotly_book/plot-ly-for-collaboration.html');\n } else {\n window.open('https://cpsievert.github.io/plotly_book/plot-ly-for-collaboration.html', '_blank');\n }\n }"}],"cloud":false},"source":"A","attrs":{"a4ef7e6f1235":{"x":{},"y":{},"type":"scatter"}},"cur_data":"a4ef7e6f1235","visdat":{"a4ef7e6f1235":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"base_url":"https://plot.ly"},"evals":["config.modeBarButtonsToAdd.0.click"],"jsHooks":[]}</script>
]
---

## Some Examples
Percentage of students that abandon first year of high school in Uruguay, 2016
<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />

---

## Some Examples
<img src = "pairs_hex.png" width = "500" height = "500" class = "center">

---

## Some Examples

---

## Effective Visualization

- Not all the visualizations are equally effective

- There are different criteria to evaluate graphs (Cleveland, Tufte, Car, Wainer, etc  )

- Lets see a set of criteria to evaluate graphics

---

## Tufte's rules

Tufte elaborate some guidelines for constructing graphics and we can use it as criteria for evaluating graphics

---

## Tufte's rules

1. Show the data

2. Induce the viewer to think about the data

3. Avoid distorting what the data have to say

4. Present many numbers in a small space

5. Make large data sets coherent

6. Reveal the data at several levels of detail

7. Serve a reasonably clear purpose

8. Be closely integrated with the statistical and verbal descriptions of the data
---

## 1. Show the data

.pull-left[ 
<img src = "showdata.png" width = 400 class = "center">

]

.pull-right[

What is the data in this picture?
- Data:
address of deaths from Cholera location of water pumps

Supporting structure:
street map

- Improvement:
de-emphasize supporting structure (e.g. by using a lighter shade of grey)
]
---

## De-emphasize grids

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-17-1.png)![](Part_I_ggplot2_files/figure-html/unnamed-chunk-17-2.png)
Dan Carr: background+pale grid+dark data marks Sets plot off from page, and makes grey scale equivalent to text block of same size

---

## 3. Avoid distorting what the data have to say

.pull-left[  
What's the data?

]

.pull-right[
- Year and Fuel economy standard
- Represented by a timeline and line segment
]

---

## Lie Factor (Tufte)
.pull-left[
 `$\frac{\text{Size of effect shown in graphic}}{ \text{Size of effect in data}}$`

Should be close to 1
]

.pull-right[
Fuel economy example:

- Data: `$\frac{27.5-18.0}{ 18.0}*100 = 53\%$`

- Graphic: `$\frac{5.3-0.6}{0.6}*100 = 783\%$`

- Lie factor: `$\frac{783}{53}*100= 14.8\% >>> 1$` Huge! 
]
---

## 4. Present many numbers in a small space

- Data-ink ratio (Tufte)

- Divide the total ink used to draw the data by the total ink used to draw the graphic.

- How do you calculate this? Not easily! Generally not possible in practice 
---

## Induce the viewer to think about the data

How do you do this?

- 6 Make large data sets coherent
- 7 Reveal the data at several levels of detail
- 8 Serve a reasonably clear purpose
- 9  Be closely integrated with the statistical and verbal descriptions of the data
---

## Cleveland's principles of graphical construction

Cleveland’s graphical construction concerns primarily statistical plots of data, for a scientific audience.

The two over-reaching principles are:
 
- Make the data stand out

- Avoid superfluity

---

## Cleveland's principles

- Clear vision/understanding

- Use of guides: scales, axes, tick marks, grid lines, legends

- Extensive captions

- Aspect ratio: scale of horizontal to vertical
---

## Effective Visualization, Cleveland

- Based on graphical perception studies

- The best visualizations are the ones that require the use of  "pre-attentive" vision (instantly without apparent effort)  [Cleveland
William and McGuill (1985)](#bib-cleveland).

---

## Cleveland, graphical perception

<img src = "cleveland.png" width = 600 class = "center">
---

## How to decode a graphic?

- When a person looks at a graph,the information visually decoded by the person's visual system.

- A graphical method is successful only if the decoding is effective.

- Good visualizations are the ones that optimize the human visual system

- If this is true we should know how the human system decode a graph

---

## Comparing quantitative variables
 
Cleveland provides an order of the elementary tasks for the graphical perception of quantitative information

1. Position along a common scale
2. Position on identical but nonaligned scales
3. Length
4. Angle/slope
5. Area
6. Volume, density or color saturation 
7. Color Hue
---

## How to use this?
- We should identify which is the most important comparison I want in my quantitative variables

- We should codificate the most important comparison in the graphical elements in the table (1. position along a common scale )
---

## Color hue

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-18-1.png)

---

## Length

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-19-1.png)

---

## Position along a common scale

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-20-1.png)

---

## Miles per galon comparing element task

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-21-1.png)
---

## Miles per galon comparing element task

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-22-1.png)
---

## Angle or slope 
- Eye color in Starwars characters
![](Part_I_ggplot2_files/figure-html/unnamed-chunk-23-1.png)
---

## Angle or slope
- Pie charts are always an error!!!

- But now we know why

- Because decodificate quantitative variables based on angles is more difficult than using other graphical elements in terms of perception

- Use always a bar graph instead of a pie chart

---

## Visualization to comunicate information...

---

## Criteria for Evaluating Graphics

- Context (Rhetoric) :

- What is the main message? Sub-messages? Story. 
    - Why/when was it produced? 
    - Who’s the audience?

- Content (Aesthetic):

- What are the pieces of information?
    - How is the information coded into the graphic?
    - What conventions are used? What is unconventional?
    - Is the data accurately represented? Lie factor, trustworthiness. 
    - What is the ratio of data to ink in the plot? High, medium, low. 
    - What’s missing?

- Perception (Perceptual)

- How clearly is the information represented? What is emphasized, de-emphasized?
 - How is the viewer drawn in?
 - What is your overall impression, opinion?
---
## Meterial to read about viz

To have more ideas about what plots to produce to answer the questions you are interested in: <a name=cite-robbins2012creating></a>[Robbins (2012)](#bib-robbins2012creating), <a name=cite-cleveland1993visualizing></a>[Cleveland (1993)](#bib-cleveland1993visualizing), <a name=cite-chambers2017graphical></a>[Chambers (2017)](#bib-chambers2017graphical) and [Tukey (1977)](#bib-tukey77)
---

class: inverse, center, middle

## ggplot2 and the Grammar of Graphics
---

## Why to use ggplot2?

- **ggplot2** an R package for producing statistical, or data, graphics developed by <a name=cite-wickham2016ggplot2></a>[Wickham (2016)](#bib-wickham2016ggplot2).

- differs with most other graphics packages because it has a deep underlying grammar.

- This grammar, based on the Grammar of Graphics theory <a name=cite-wilkinson2006grammar></a>[Wilkinson (2006)](#bib-wilkinson2006grammar).

- It is the most used R package to do visualization and is 10 years old

- [ggplot2: Elegant Graphics for Data Analysis](https://github.com/hadley/ggplot2-book)

---
 
## Grammar of Graphics

The **Grammar of graphics** answer the following questions:

- What is a statistical graphic?

- How to describe a graph?

- How to create a graph?

"A statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)"
---

## Grammar of Graphics

- A statistic is a function of data, like the sample mean `$\bar x = \sum_{i=1}^n \frac{x_i}{n}$`  or sample variance  `$S^2 = \sum_{i=1}^n \frac{(xi- \bar x)^2}{n-1}$`

- The grammar of graphics provides a tight connection between data and statistics.

- Key point about `ggplot2`, its makes plots another type of statistic.

- It is a function of the data, a mapping from **data** to aesthetic attributes of geometric objects.

---

## Why to use the grammar of graphics?

.pull-left[ 
- To do visualizations we already know

- To create new visualization

- To identify better graphs to visualize our data
]
.pull-right[

The limit is your imagination!

<img src = "imaginacion.png" width =200>
]
---

## `ggplot2`

- Set of independent components, gives flexibility

- Not limited to predetermined plots, you can create what you want

- Defined in based to a set of principles, easy to learn

-  The graphs are easily reproducible

- You can make publication quality graphs in a short time

- Design to work in an iterative way based on layers

---

## Grammar of graphics

- **data**: with a set of aesthetic mappings (aes) describing how variables in the data are mapped to aesthetic attributes that you can perceive.

- **layers**: geometric elements (**geoms**, points, lines, polygons, text, ...) and statistical transformations summarize data in many useful ways (**stats**, identities, counts, bins,...)

- **scales**: map values in the data space to values in an aesthetic space  (ej. color, size, shape or position).Scales draw a legend or axes.

- **coord**: describes how data coordinates are mapped to the plane of the graphic. Normally Cartesians, but example pie charts use polar coordinates

- **facetting**: how to break up the data into subsets and how to display those subsets as small multiples.

- **theme** controls the finer points of display, like the font size and background colour
---

## Instalar ggplot2

- Installing ggplot2

```r
install.packages("ggplot2") 
```
- Load the package

`ggplot2` but R packages are contributed and can change in future iterations

Developing version available in GitHub: https://github.com/tidyverse/ggplot2

```r
install.packages("devtools")
library(devtools)
install_github("tidyverse/ggplot2")
```
- `ggplot2` is part of `tidyverse`

---

## `ggplot2` ayuda

mail list: http://groups.google.com/group/ggplot2

stackoverflow: http://stackoverflow.com

---

## Tip Example

```r
# cargamos los datos 
tips <- read_csv("http://www.ggobi.org/book/data/tips.csv")

head(tips)
```

```
## # A tibble: 6 x 8
## obs totbill tip sex smoker day time size
## <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <int>
## 1 1 17.0 1.01 F No Sun Night 2
## 2 2 10.3 1.66 M No Sun Night 3
## 3 3 21.0 3.5 M No Sun Night 3
## 4 4 23.7 3.31 M No Sun Night 2
## 5 5 24.6 3.61 F No Sun Night 4
## 6 6 25.3 4.71 M No Sun Night 4
```

---

## Three components of every plot

- **data**: data to visualize

- **aes**: A set of aesthetic mapping between variables in the data and visual properties (e.g color, size etc)

- **layer**: At least one layer describing how to render each observation. Each layer is created with **geom** function .

---

## Three components of every plot

- **data**: tips

- **aes**: totbill maping to `x` position, tip to `y` position.

- **layer**: points with `geom_point`.

Let's make our first plot!
---
## Three components of every plot

Aspect ratio: ratio between the width and the height of a rectangle

---

## Three components of every plot
Overplotting

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" />
---

## Tip example
 
 What do you see?

- Weak and lineal relationship between tip and total bill

- A lot of variability

- horizontal lines indicate the preference to 1 dollar tips

---

## Color, size, shape and other aes

To include other variables, we can use other **aes** (color, shape, size)

```r
aes(x = totbill, y = tip, colour = sex)

aes(x = totbill, y = tip, shape = sex)

aes(x = totbill, y = tip, size = size)
```
---

## Color

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" />
---

## Fixed color 
To fix the aesthetic color, outside `aes` (outside layer) or use `I('blue')` in `aes`

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" />
---

## shape

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" />
---

## size

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" />
---

## size

---

## Facets

- We can display additional categorical variables in a graph subseting the graphical display.

- Create table of plots subseting the data and partitioning the graphical display.

- Two types:  `facet_grid` y `facet_wrap`

---

## `facet_wrap`

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" />
---

## `facet_grid`

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" />
---

## Facetting

`facet wrap()` and `facet grid()` you can control whether the position scales are the same in all panels (fixed) or allowed to vary between panels (free) with the scales parameter:

- scales = "fixed": x and y scales are fixed across all panels.
- scales = "free_x": the x scale is free, and the y scale is fixed.
- scales = "free_y": the y scale is free, and the x scale is fixed.
- scales = "free": x and y scales vary across panels.

Fixed scales make it easier to see patterns across panels; free scales make it easier to see patterns within panels.
---
 
## Other geoms
 
If we substitute `geom_pont()` with another `geom` we get a different visualization .
 Most common `geoms`:
 
- `geom_smooth()` 
- `geom_boxplot()`
- `geom_histogram()` 
- `geom_bar()`
- `geom_path()` y `geom_lines()`

each `geom` its associate with particular geometric elements
---

## More geoms

http://ggplot2.tidyverse.org/reference/
---

## Extensions

More than 40 `ggplot2` extensions

http://www.ggplot2-exts.org/gallery/

---

## Include labs

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" />
---

## `geom` Examples

```r
p <- ggplot(tips, 
 aes(x = day, y = tip))
```
- `p `
- `p + geom_point()` 
- `p + geom_boxplot()`
- `p + geom_violin()`

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-37-1.png)
---

## `geom`, Example

```r
p <- ggplot(tips, aes(x = day, 
 fill = smoker)) 
```
- `p + geom_bar()`
- `p + geom_bar(position="stack")` 
- `p + geom_bar(position="dodge")` 
- `p + geom_bar(position="fill")`

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-39-1.png)
---

## scales

- A scale controls the mapping from data to aesthetic attributes, and we need a scale for every aesthetic used on a plot.

- We need to convert them from data units  (totbill, sex, etc) to graphical units (color, shape, etc)

- This conversion process is called scaling and performed by `scales`

- Each scale operates across all the data in the plot, ensuring a consistent mapping from data to aesthetics.

- `colors` are represented by a six-letter hexadecimal string, `sizes` by a number and `shapes` by an integer

---

## scales

You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control
---

## scales

- The aesthetic mapping only said a variable is mapped to an aesthetic element but doesn't say how to be done.

- When a variable is mapped to a `shape` using  `aes(shape = x)` doesn't specify the specific shape  (`shape`) should take.

- When we use  `aes(color = z)` we don't said the specific color

- Describe the color, shape, size, etc (color, shape, size) is done using transformations in `scale`
---

## scales
- `color` and `fill`
- `size`
- `shape`
- `linetype`

`scales` modify a series functions with this structure `scale_<aesthetic>_<type>`. See `scale_<tab>`, list of `scale` functions.
---

## `scales` disponibles

<img src="summary.png" width="700">
---

## Color Scales

- Colors are controlled through scales

- `scale_colour_discrete`(scale_colour_hue) and `scale_colour_continuous` (scale_colour_gradient) are the default choices for factor variables and numeric variables

- We can change parameters to the default scale, or we can change the scale function

---

`scale_colour_gradient (..., low = "#132B43", high = "#56B1F7", space = "Lab", na.value = "grey50", guide = “colourbar")`

<img src = "colorgradient.png" width = 200 class = "center">
 
- colors can be specified by hex code, name or through rgb()

- Gradient goes from low to high - that should match the interpretation of the mapped variable

---

`scale_colour_gradient2(..., low = muted("red"), mid = "white", 
high = muted("blue"), midpoint = 0, space = "Lab", na.value = "grey50", guide = "colourbar")`

<img src = "colorgradient2.png" width = 200 class = "center">
 
- midpoint is value of the ‘neutral’ color

- gradient2 is a divergent color scheme

- best matches a variable that goes from large negative to zero to large positive (or below mean, above mean)

---

`scale_color_hue (..., h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1, na.value = "grey50")`

- uses hue, chroma and luminance (=value)

- each level of a variable is assigned a different level of `h`
---

`scale_colour_brewer(..., type = "seq", palette = 1, direction = 1)`

- brewer schemes are defined in RColorBrewer (Neuwirth, 2014)

- palettes can be specified by name or index

- see also http://colorbrewer2.org/ (Brewer et al 2002)

---

## Color Brewer Schemes 
There are 3 types of palettes, sequential, diverging, and qualitative.

1. **Secuancial** are suited to ordered data that progress from low to high. With light colors for low data values to dark colors for high data values

2. **Cualitativa** do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data.

3. **Divergente** put equal emphasis on mid-range critical values and extremes at both ends of the data range. The middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors

---

## Color Brewer Schemes

`RColorBrewer` provides color schemes for discrete variables
.left-column[

`display.brewer.all()`

- Sequential

- Qualitative

- Divergent
]

.right-column[

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-40-1.png)
]
---

## Color & Fill

-  While specified palette `Set2` has 8 colors

- Lack of colors in the palette triggers ggplot warnings (and invalidates plot as seen above):

- `RColorBrewer` gives us a way to produce larger palettes by interpolating existing ones with constructor function `colorRampPalette`

- they build palettes with arbitrary number of colors by interpolating existing palette.

---

## Color & Fill
Select `Set1` palette

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-41-1.png)
---

## Color & Fill
Select `Set1` palette

```r
ggplot(data = tips) + 
  geom_bar(aes(factor(round(tip)),  
               fill = factor(round(tip)) )) + 
  scale_fill_brewer( palette = "Set1") +
  labs(x = "Tip in USD", fill = "Tip")
```
---

## Color & Fill

usar `colorRampPalette(brewer.pal(9, "Set1"))`

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-43-1.png)
---

## Color & Fill

usar `colorRampPalette(brewer.pal(9, "Set1"))`

```r
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
 
ggplot(data = tips) + 
  geom_bar(aes(factor(round(tip)),
               fill = factor(round(tip)) )) + 
  scale_fill_manual( values = getPalette(10)) + 
  labs(x = "Tip in USD", fill = "Tip")
```
---

## Color & Fill

- Area plots use fill to map values to the fill color

- only discrete color scales can be used: `scale_fill_hue`, `scale_fill_brewer`, `scale_fill_grey`, ...

- `scale_fill_manual` (..., values)
values is a vector of color values. At least as many colors as levels in the variable have to be listed
---

## Shape

- Point shape

- `scale_shape_continuous()`, `scale_shape_discrete()`,
`scale_shape_manual()`

<img src = "shape2.png" width = 300 class = "center">
 
---

## Shape

scale_shape_manual(values=c(8, 15))
<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" />
---

## Include labs 
Previously we use `labs` but the long way is...

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" />
---

## Output

- ggsave selects graphics device based on file extension
- `ggplot(tips, aes(totbill, tip, colour = smoker)) +
  geom_point() + theme(aspect.ratio = 1)`
  
- `gsave("ppt123.png")` # png (pixelated raster image) 
- `ggsave("ppt123.pdf")` # pdf (scalable vector image)
---

## Grammar of Graphics

```r
ggplot(tips, aes(totbill, tip, colour = smoker)) +
  geom_point() + theme(aspect.ratio = 1)
```

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" />
---

## Behind this plot

- Each observation is represented as a point which position is associate with to variables (horizontal position and vertical)

- Each point has size, color, shape, called aesthetic elements`aes`

-   `aes` are properties that can be seen in the plot. Each
`aes` can be mapped to a variable or to be fix to a constant value

- `total` is mapped to an horizontal position, `propina` to the vertical position and `fuma` to color. size and shape are not mapped to variables (default values) 
---

## Behind this plot

```
## # A tibble: 3 x 3
## totbill tip smoker
## <dbl> <dbl> <chr> 
## 1 17.0 1.01 No 
## 2 10.3 1.66 No 
## 3 21.0 3.5 No
```
- New data, mapping of the aesthetic elements to the original data

|x  | y | colour|
|---|---|-------|
|17.0|	1.01|	     No|		
|10.3|	1.66|	     No|		
|21.0|	3.5|	     No|		
---

## Layers in a plot

- The data, aesthetic mapping, geometric objects and statistical transformations define a **layer**

- We can define a graphic with multiple layers
---

## layers in a graphic
 The grammar of layers define the components of a graphic:
 
- data and subset of variable mapping to a aesthetic elements

- one or more layers, each layer has a geometric element, a statistical transformation, a position and optional data and `aes`
---

## Layers in a graphic

```r
ggplot() +
  layer(
    data = tips, mapping = aes(x = totbill, y = tip), 
    geom = "point", stat = "identity", position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()
```
Equivalent to :

```r
ggplot(data = tips, aes(x = totbill, y = tip)) +
  geom_point() 
```
---

## Layers in a graphic

More than one data set:

```r
ggplot() +
  geom_point(data = tips, aes(x = totbill, y = tip)) +
  geom_point(data = data.frame(x = 30, y = 6), aes(x, y), 
             color = "red", size = 10)
```

<img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" />
---

## Layers in a graphic

Layers in a bar graph

```r
p1 <- ggplot() +
 layer(
 data = tips, mapping = aes(x = day , y = ..prop.., group = 1), 
 geom = "bar", stat = "count", position = "identity"
 ) +
 scale_x_discrete() +
 scale_y_continuous()
 coord_cartesian()
```
Equivalent to :

```r
p1 <- ggplot(data = tips, aes(x = day, y =..prop.., group = 1)) +
 geom_bar() 
```
---

## Bars in proportion

```r
ggplot_build(p1)$data[[1]]
```

```
##            y count       prop x group PANEL ymin       ymax xmin xmax
## 1 0.07786885    19 0.07786885 1     1     1    0 0.07786885 0.55 1.45
## 2 0.35655738    87 0.35655738 2     1     1    0 0.35655738 1.55 2.45
## 3 0.31147541    76 0.31147541 3     1     1    0 0.31147541 2.55 3.45
## 4 0.25409836    62 0.25409836 4     1     1    0 0.25409836 3.55 4.45
##   colour   fill size linetype alpha
## 1     NA grey35  0.5        1    NA
## 2     NA grey35  0.5        1    NA
## 3     NA grey35  0.5        1    NA
## 4     NA grey35  0.5        1    NA
```
---

## histogram

```r
ggplot(data = tips, aes(x = totbill , y =..density..)) +
  geom_histogram() 
```

![](Part_I_ggplot2_files/figure-html/unnamed-chunk-56-1.png)
---

## themes 
- Themes allow to control every aspect of non-data related aspects of a plot

- `theme` gives you control of fonts, background, tick marks, etc.

- Two pre-defined themes: `theme_grey` (default), `theme_bw`

- Use `theme_set` if you want it to apply theme to every future plot, e.g. `theme_set(theme_bw())`

- `ggthemes` package defines additional themes
`library(help = "ggthemes")` lists all themes

---

## Element

- You can also make your own theme, or modify and existing.

- Themes are made up of elements which can be one of:

- `element_line`,  `element_text`, `element_rect`,
`element_blank`

- Gives you a lot of control over plot appearance.
---

## Element

- Axis:
`axis.line`, `axis.text.x`, `axis.text.y`, `axis.ticks`, `axis.title.x`, `axis.title.y`

- Legend: 
`legend.background`, `legend.key`, `legend.text`, `legend.title`

- Panel: 
`panel.background`, `panel.border`, `panel.grid.major`, `panel.grid.minor`

- Strip (facetting): 
`strip.background`, `strip.text.x`, `strip.text.y`

for a complete overview see ?theme
---

## theme
Let's do this plot!
![](Part_I_ggplot2_files/figure-html/unnamed-chunk-57-1.png)
---

## theme
Let's do this plot!

## theme
Let's do this plot!
![](Part_I_ggplot2_files/figure-html/unnamed-chunk-59-1.png)
---

## theme
Let's do this plot!

```r
ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) +
  geom_point() + theme(aspect.ratio = 1, legend.position = "bottom",
panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey92"),
panel.border = element_rect(colour = "grey20", fill = NA),
legend.key = element_rect(fill = "white"),
axis.text.x = element_text(size =20),
axis.text.y = element_text(size = 20),
axis.title = element_text(size = 30),
legend.text = element_text(size = 20), 
legend.title = element_text(size = 20)

)
```
---

## Activity

https://natydasilva.github.io/CODATA_Activity/#1
---

## CC licence

<img src="cc.png" width="200">
---

# References

<a name=bib-anscombe></a>[Anscombe F, J.](#cite-anscombe) (1973).
"Graphs in statistical analysis". In: _The American Statistician_.

<a name=bib-chambers2017graphical></a>[Chambers, J.
M.](#cite-chambers2017graphical) (2017). _Graphical Methods for
Data Analysis: 0_. Chapman and Hall/CRC.

<a name=bib-cleveland></a>[Cleveland William, S. and R.
McGuill](#cite-cleveland) (1985). " Graphical perception and
graphical methods for analyzing scientific data ".

<a name=bib-cleveland1993visualizing></a>[Cleveland, W.
S.](#cite-cleveland1993visualizing) (1993). _Visualizing data_.
Vol. 2. Hobart Press Summit, NJ.

<a name=bib-gelman2002let></a>[Gelman, A, C. Pasarica and R.
Dodhia](#cite-gelman2002let) (2002). "Let's practice what we
preach: turning tables into graphs". In: _The American
Statistician_ 56.2, pp. 121-130.

<a name=bib-matejka2017same></a>[Matejka, J. and G.
Fitzmaurice](#cite-matejka2017same) (2017). "Same stats, different
graphs: generating datasets with varied appearance and identical
statistics through simulated annealing". In: _Proceedings of the
2017 CHI Conference on Human Factors in Computing Systems_. ACM. ,
pp. 1290-1294.

<a name=bib-robbins2012creating></a>[Robbins, N.
B.](#cite-robbins2012creating) (2012). _Creating more effective
graphs_. Wiley.

<a name=bib-tukey77></a>[Tukey, J. W.](#cite-tukey77) (1977).
"Exploratory Data Analysis".

<a name=bib-wickham2016ggplot2></a>[Wickham,
H.](#cite-wickham2016ggplot2) (2016). _ggplot2: elegant graphics
for data analysis_. Springer.

<a name=bib-wilkinson2006grammar></a>[Wilkinson,
L.](#cite-wilkinson2006grammar) (2006). _The grammar of graphics_.
Springer Science & Business Media.