class: center, middle, inverse, title-slide # Data Visualization ### Natalia da Silva
Instituto de Estadística-FCEA-UDELAR --- ## <span style = "color:#883984">About me</span> .pull-left[ <img src = "portrait.png" width = 400 class = "center"> ] .pull-right[ - Assistant professor, Instituto de Estadística-Universidad de la República (IESTA-UDELAR), Montevideo Uruguay. - PhD and Msc. in Statistics from Iowa State University, USA - Interests: supervised learning methods, computational statistics, visualization and meta-analysis - Co-founder of R-Ladies Ames, R-Ladies Montevideo and GURU::MVD - Contact info: natalia@iesta.edu.uy, @pacocuak, http://natydasilva.com ] --- ## <span style="color:#88398A"> Link to the presentation </span> https://natydasilva.github.io/CODATA/#1 --- ## <span style="color:#88398A"> About this workshop </span> - Why we use Visualization? - EDA, type of variables and viz examples - Ideas for an effective visualization - Why to use ggplot2? - Grammar of graphics - Examples --- # <span style="color:#88398A">Importance of data visualization </span> "The greatest value of a picture is when it forces us to notice what we never expected to see." <a name=cite-tukey77></a>[Tukey (1977)](#bib-tukey77) "Graphs provide powerful tools both for analyzing scientific data and for communicating quantitative information" <a name=cite-cleveland></a>[Cleveland William and McGuill (1985)](#bib-cleveland) --- ## <span style="color:#88398A"> Few people escape to these ideas</span> 1. The numerical calculations are exact, but graph are rough 2. For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis 3. Performing intrincate calculations is virtuous, whereas actually looking at the data is cheating <a name=cite-anscombe></a>[Anscombe F (1973)](#bib-anscombe) --- ## <span style="color:#88398A"> Statistical Visualization</span> Visualization plays an important role in all the stages of the statistical analysis. - **Initial Exploration:** To find general and specific patterns in the data. - **Models:** Check the data assumptions before to run a model. - **Diagnostics:** Visualize the model in the data space or the data in the model space. --- ## <span style="color:#88398A"> EDA</span> Exploratory data analysis (EDA), it is an iterative process to explore broadly different aspects of the data. Origins: John Tukey encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments . - help us to understand our data - we need to define questions to guide our research - each questions focused on something specific about the data and this defines the type of data the model and possible transformations - key point, define interesting questions for our problem --- ## <span style="color:#88398A"> Which is the relationship between X and Y?</span> <table class="table table-bordered" style="font-size: 15px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> X </th> <th style="text-align:right;"> Y </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1.972 </td> <td style="text-align:right;"> 1.236 </td> </tr> <tr> <td style="text-align:right;"> 1.112 </td> <td style="text-align:right;"> 1.994 </td> </tr> <tr> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 1.009 </td> </tr> <tr> <td style="text-align:right;"> 0.665 </td> <td style="text-align:right;"> 1.942 </td> </tr> <tr> <td style="text-align:right;"> 0.235 </td> <td style="text-align:right;"> 0.356 </td> </tr> <tr> <td style="text-align:right;"> 0.247 </td> <td style="text-align:right;"> 1.658 </td> </tr> <tr> <td style="text-align:right;"> 1.275 </td> <td style="text-align:right;"> 1.961 </td> </tr> <tr> <td style="text-align:right;"> 0.702 </td> <td style="text-align:right;"> 0.045 </td> </tr> <tr> <td style="text-align:right;"> 1.760 </td> <td style="text-align:right;"> 0.350 </td> </tr> <tr> <td style="text-align:right;"> 1.691 </td> <td style="text-align:right;"> 0.277 </td> </tr> <tr> <td style="text-align:right;"> 1.628 </td> <td style="text-align:right;"> 1.778 </td> </tr> <tr> <td style="text-align:right;"> 1.957 </td> <td style="text-align:right;"> 1.290 </td> </tr> </tbody> </table> --- ## <span style="color:#88398A"> Why to use visualization?</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- <img src="gelman.png" width="700" class="center"> - Statisticians recommend graphical displays but often do not follow this recommendation in presenting their own research. - They analyze some papers in JASA and shows some examples from table to plots - They said there is a good reason to be lazy, it takes a lot of work to make a good visualization - Nice graphs are possible, especially when we think hard about why we want to display these numbers in the first place - If the world’s leading statistical journal doesn’t do it right, there is obviously still room for progress <a name=cite-gelman2002let></a>[Gelman, Pasarica, and Dodhia (2002)](#bib-gelman2002let) --- ## <span style = "color:#883984">What is the sample mean value for x and y? </span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-3-1.png)<!-- --> -- `\(\bar x =\)` 54.27 and `\(\bar y =\)` 47.83 --- ## <span style = "color:#883984">What is the sample mean value for x and y? </span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-5-1.png)<!-- --> -- `\(\bar x =\)` 54.27 and `\(\bar y =\)` 47.84 --- ## <span style = "color:#883984">What is the sample mean value for x and y? </span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-7-1.png)<!-- --> -- `\(\bar x =\)` 54.26 and `\(\bar y =\)` 47.83 --- ## <span style = "color:#883984">Why to use visualization? </span> .pull-left[ <img src="DinoSequential.gif" height="200" width="500"/> https://github.com/stephlocke/datasauRus - <a name=cite-matejka2017same></a>[Matejka and Fitzmaurice (2017)](#bib-matejka2017same) ] .pull-right[ - Same statistical summaries - Very different distributions - [Link algoritmo](https://www.autodeskresearch.com/publications/samestats). ] --- ## <span style = "color:#883984">Why to use visualization? </span> .pull-left[ - Graphs provides more information than numerical summaries - Anscombe’s quartet [Anscombe F (1973)](#bib-anscombe) - `\(n = 11\)` - `\(\bar x= 9.0\)` - `\(\bar y = 7.5\)` - `\(\hat \beta_1= 0.5\)` - `\(y = 3 + 0.5 x\)` - `\(R^2 = 0.667\)` - `\(....\)` ] .pull-right[ <img src = "Quartet.png" width = "400" class = "center"> ] <!-- % \item Sum of squares of x - 110.0 --> <!-- % \item Regression sum of squares = 27.50 (1 d.f.) --> <!-- % \item Residual sum of squares of y = 13.75 (9 d.f.) --> <!-- % \item Estimated standard error of bi = 0.118 --> <!-- % \item Multiple R2 = 0.667 --> --- ## <span style="color:#88398A"> Visualization is not new</span> Charles Minard, Napoleon’s Russian Campaign in 1812 <img src = "napo.png" width = 700 class = "center"> --- ## <span style="color:#88398A"> Visualization is not new</span> Florence Nightingale, Coxcombs Causes of Mortality for the Army in the East (1858) <img src = "nighty.png" width = 700 class = "center"> --- ## <span style="color:#88398A"> Data Visualization </span> - Data visualizations are based on data - Summarize information - Different type of visualizations emphasizes different aspects of data --- ## <span style="color:#88398A"> Information, data </span> - Data contains cases and variables. - In tabular form cases are the rows and columns are usually the variables. Variables can take different forms: - Continuous, Categorical, Temporal, Spatial --- ## <span style="color:#88398A"> Type of data</span> Key point to do visualization: understand the type of variables you have - **Continuous variable:** a variable with infinite number of values, like “time” or “weight”. - **Discrete variable:** numerical variable that can only take on a certain number of values. Example, “number of students in the class” - **Categorical variable:** variables than can be put into groups or categories. Example, hair color, dog breeds - and more... --- ## <span style="color:#88398A"> What are the most important types of data displays?</span> Impossible to answer, but ... some important types are: - dotplots, histogram, density plot, time series, barcharts (all one dimensional) - scatterplot, side-by-side boxplots (two-dimensional) parallel coordinate plots, mosaic plots (higher dimensions) - maps (geographic information) - and more --- ## <span style="color:#88398A"> Why so many different types?</span> - Different data displays are due to number and type of the variables - emphasis: every chart shows a different aspect of the data --- ## <span style="color:#88398A">EDA rules?</span> Not clear rules to do EDA but we can begin with some basic questions: - What is the variability in my data? - How is the distribution of each variable? - Is there any relationship between some selected variables? - How these two variables covariate? --- ## <span style="color:#88398A">Distribución</span> To identify which tools to use in EDA we should identify the type of variables to analyze Example: - Categorical variables, we can analyze the distribution using a bar charts - Continuous variables, we can analyze the distribution using an histogram --- ## <span style="color:#883984"> Barchart versus Histogram, starwars</span> .pull-left[ ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-9-1.png)<!-- --> - Variable: categorical/qualitative - Bins separated by equal space - Bins can be re-ordered, and should be to improve the viz - Parameter: none ] .pull-right[ ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-10-1.png)<!-- --> - Variable: continuous/quantitative - Bins adjacent - Bins can't be re-ordered - Parameter: bin width ] --- ## <span style="color:#883984"> All the same...?</span> .pull-left[ - Histogram, density plot, jittered dot plot and boxplot - Every plot emphasizes a different aspect of the data: skewness, modality, symmetry, gaps, outliers - There is not right plot all are useful ] .pull-right[ ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-11-1.png)<!-- --> ] --- ## <span style="color:#88398A"> Covariation</span> - Covariation is the joint variation for two variables - The best way to detect the covariation is to visualize the relationship between two variables - The type of variable is what defines the way to visualize covariation --- ## <span style="color:#88398A"> Categorical vs Continous</span> - It is common to explore the distribution of a continuous variables according to a categorical variable - histograms or densities colored by a categorical variable to compare the distributions for each category <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Categorial vs Continous, boxplot and violin</span> Boxplot: - We see the data distribution for continuous variables - based on the five-number summaries (min, Q1, Q2, Q3, max). - outliers( Q1–1.5 IQR and Q3 + 1.5 IQR) Violin: it is a combination of boxplot and density plot --- ## <span style="color:#88398A"> Categorial vs Continous, boxplot and violin</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Categórica vs categóricas, Bars</span> - It is useful to describe the categorical variables, for each category the bar length is the proportion to the number of observation in each category - If we want to represent more than one categorical variable we can use stacked bar graph. - Simple stacked bar graph: place each value for the segment after the previous one. The total value of the bar is all the segment values added together. Ideal for comparing the total amounts across each group/segmented bar. - 100% Stack Bar Graphs show the percentage-of-the-whole of each group and are plotted by the percentage of each value to the total amount in each group. This makes it easier to see the relative differences between quantities in each group. --- ## <span style="color:#88398A"> Categórica vs categóricas, Bars</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-14-1.png)<!-- --> --- ## <span style = "color:#883984">Static versus interactive</span> .pull-left[ Two key components to be accomplish in an interactive visualization: - Interaction in every individual visualization (mouse over, zoom, labels, etc) - Links between different graphics Additionally be able to control dynamic rotations in higher dimensions. ] .pull-right[
] --- ## <span style="color:#88398A"> Some Examples</span> Percentage of students that abandon first year of high school in Uruguay, 2016 <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Some Examples</span> <img src = "pairs_hex.png" width = "500" height = "500" class = "center"> --- ## <span style="color:#88398A"> Some Examples</span> <iframe class="vimeo-embed" src="https://player.vimeo.com/video/222028803" width="600" height="600" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen> </iframe> --- ## <span style = "color:#883984">Effective Visualization</span> - Not all the visualizations are equally effective - There are different criteria to evaluate graphs (Cleveland, Tufte, Car, Wainer, etc ) - Lets see a set of criteria to evaluate graphics --- ## <span style = "color:#883984">Tufte's rules</span> Tufte elaborate some guidelines for constructing graphics and we can use it as criteria for evaluating graphics --- ## <span style = "color:#883984">Tufte's rules</span> 1. Show the data 2. Induce the viewer to think about the data 3. Avoid distorting what the data have to say 4. Present many numbers in a small space 5. Make large data sets coherent 6. Reveal the data at several levels of detail 7. Serve a reasonably clear purpose 8. Be closely integrated with the statistical and verbal descriptions of the data --- ## <span style = "color:#883984">1. Show the data</span> .pull-left[ <img src = "showdata.png" width = 400 class = "center"> ] .pull-right[ What is the data in this picture? - Data: address of deaths from Cholera location of water pumps Supporting structure: street map - Improvement: de-emphasize supporting structure (e.g. by using a lighter shade of grey) ] --- ## <span style = "color:#883984">De-emphasize grids</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-17-1.png)<!-- -->![](Part_I_ggplot2_files/figure-html/unnamed-chunk-17-2.png)<!-- --> Dan Carr: background+pale grid+dark data marks Sets plot off from page, and makes grey scale equivalent to text block of same size --- ## <span style = "color: #883984"> 3. Avoid distorting what the data have to say</span> .pull-left[ What's the data? <img src = "distortdata.png" width = 600 class = "center"> ] .pull-right[ - Year and Fuel economy standard - Represented by a timeline and line segment ] --- ## <span style = "color: #883984"> Lie Factor (Tufte)</span> .pull-left[ `\(\frac{\text{Size of effect shown in graphic}}{ \text{Size of effect in data}}\)` Should be close to 1 ] .pull-right[ Fuel economy example: - Data: `\(\frac{27.5-18.0}{ 18.0}*100 = 53\%\)` - Graphic: `\(\frac{5.3-0.6}{0.6}*100 = 783\%\)` - Lie factor: `\(\frac{783}{53}*100= 14.8\% >>> 1\)` Huge! ] --- ## <span style = "color:#883984">4. Present many numbers in a small space</span> - Data-ink ratio (Tufte) - Divide the total ink used to draw the data by the total ink used to draw the graphic. - How do you calculate this? Not easily! Generally not possible in practice --- ## <span style = "color:#883984">Induce the viewer to think about the data</span> How do you do this? - 6 Make large data sets coherent - 7 Reveal the data at several levels of detail - 8 Serve a reasonably clear purpose - 9 Be closely integrated with the statistical and verbal descriptions of the data --- ## <span style = "color:#883984">Cleveland's principles of graphical construction </span> Cleveland’s graphical construction concerns primarily statistical plots of data, for a scientific audience. The two over-reaching principles are: - Make the data stand out - Avoid superfluity --- ## <span style = "color:#883984"> Cleveland's principles</span> - Clear vision/understanding - Use of guides: scales, axes, tick marks, grid lines, legends - Extensive captions - Aspect ratio: scale of horizontal to vertical --- ## <span style = "color:#883984">Effective Visualization, Cleveland</span> - Based on graphical perception studies - The best visualizations are the ones that require the use of "pre-attentive" vision (instantly without apparent effort) [Cleveland William and McGuill (1985)](#bib-cleveland). --- ## <span style = "color:#883984">Cleveland, graphical perception</span> <img src = "cleveland.png" width = 600 class = "center"> --- ## <span style = "color:#883984"> How to decode a graphic?</span> - When a person looks at a graph,the information visually decoded by the person's visual system. - A graphical method is successful only if the decoding is effective. - Good visualizations are the ones that optimize the human visual system - If this is true we should know how the human system decode a graph <!-- % Tres operaciones visuales de la percepci\'on de patrones por Cleveland: --> <!-- % \begin{itemize} --> <!-- % \item \textbf{Deteci\'on:} reconocimiento visual de un objeto geométrico codificado en un valor f\'isico --> <!-- % \item \textbf{Ensamblado:} es el agrupado de elementos gr\'aficos detectados --> <!-- % \item \textbf{Estimaci\'on:} es la actividad visual de los valores relativos de 2 o mas valores cuantitativos --> <!-- % %(mayor menor distinto ratios etc) --> <!-- % \end{itemize} --> --- ## <span style="color:#88398A"> Comparing quantitative variables</span> Cleveland provides an order of the elementary tasks for the graphical perception of quantitative information 1. Position along a common scale 2. Position on identical but nonaligned scales 3. Length 4. Angle/slope 5. Area 6. Volume, density or color saturation 7. Color Hue --- ## <span style="color:#88398A"> How to use this?</span> - We should identify which is the most important comparison I want in my quantitative variables - We should codificate the most important comparison in the graphical elements in the table (1. position along a common scale ) --- ## <span style="color:#88398A">Color hue</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-18-1.png)<!-- --> <!-- <img src = "clplot.png" width = 800> --> --- ## <span style="color:#88398A"> Length</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-19-1.png)<!-- --> --- ## <span style="color:#88398A"> Position along a common scale</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-20-1.png)<!-- --> --- ## <span style="color:#88398A"> Miles per galon comparing element task</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-21-1.png)<!-- --> --- ## <span style="color:#88398A"> Miles per galon comparing element task</span> ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-22-1.png)<!-- --> --- ## <span style="color:#88398A"> Angle or slope </span> - Eye color in Starwars characters ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-23-1.png)<!-- --> --- ## <span style="color:#88398A">Angle or slope</span> - Pie charts are always an error!!! - But now we know why - Because decodificate quantitative variables based on angles is more difficult than using other graphical elements in terms of perception - Use always a bar graph instead of a pie chart --- ## <span style="color:#88398A">Visualization to comunicate information...</span> <img src = "ej1.png" width =700> --- ## <span style="color:#88398A">Criteria for Evaluating Graphics</span> - Context (Rhetoric) : - What is the main message? Sub-messages? Story. - Why/when was it produced? - Who’s the audience? - Content (Aesthetic): - What are the pieces of information? - How is the information coded into the graphic? - What conventions are used? What is unconventional? - Is the data accurately represented? Lie factor, trustworthiness. - What is the ratio of data to ink in the plot? High, medium, low. - What’s missing? - Perception (Perceptual) - How clearly is the information represented? What is emphasized, de-emphasized? - How is the viewer drawn in? - What is your overall impression, opinion? --- ## <span style="color:#88398A">Meterial to read about viz</span> To have more ideas about what plots to produce to answer the questions you are interested in: <a name=cite-robbins2012creating></a>[Robbins (2012)](#bib-robbins2012creating), <a name=cite-cleveland1993visualizing></a>[Cleveland (1993)](#bib-cleveland1993visualizing), <a name=cite-chambers2017graphical></a>[Chambers (2017)](#bib-chambers2017graphical) and [Tukey (1977)](#bib-tukey77) --- class: inverse, center, middle ## ggplot2 and the Grammar of Graphics --- ## <span style="color:#88398A"> Why to use ggplot2?</span> - **ggplot2** an R package for producing statistical, or data, graphics developed by <a name=cite-wickham2016ggplot2></a>[Wickham (2016)](#bib-wickham2016ggplot2). - differs with most other graphics packages because it has a deep underlying grammar. - This grammar, based on the Grammar of Graphics theory <a name=cite-wilkinson2006grammar></a>[Wilkinson (2006)](#bib-wilkinson2006grammar). - It is the most used R package to do visualization and is 10 years old - [ggplot2: Elegant Graphics for Data Analysis](https://github.com/hadley/ggplot2-book) --- ## <span style="color:#88398A"> Grammar of Graphics </span> The **Grammar of graphics** answer the following questions: - What is a statistical graphic? - How to describe a graph? - How to create a graph? "A statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)" --- ## <span style="color:#88398A"> Grammar of Graphics </span> - A statistic is a function of data, like the sample mean `\(\bar x = \sum_{i=1}^n \frac{x_i}{n}\)` or sample variance `\(S^2 = \sum_{i=1}^n \frac{(xi- \bar x)^2}{n-1}\)` - The grammar of graphics provides a tight connection between data and statistics. - Key point about `ggplot2`, its makes plots another type of statistic. - It is a function of the data, a mapping from **data** to aesthetic attributes of geometric objects. --- ## <span style="color:#88398A"> Why to use the grammar of graphics? </span> <!-- is made up of a set of independent components that can be composed in many different ways. This makes ggplot2 very powerful because you are not limited to a set of pre-specified graphics, but you can create new graphics that are precisely tailored for your problem. --> .pull-left[ - To do visualizations we already know - To create new visualization - To identify better graphs to visualize our data ] .pull-right[ The limit is your imagination! <img src = "imaginacion.png" width =200> ] --- ## <span style="color:#88398A">`ggplot2` </span> - Set of independent components, gives flexibility - Not limited to predetermined plots, you can create what you want - Defined in based to a set of principles, easy to learn - The graphs are easily reproducible - You can make publication quality graphs in a short time - Design to work in an iterative way based on layers --- ## <span style="color:#88398A">Grammar of graphics </span> - **data**: with a set of aesthetic mappings (aes) describing how variables in the data are mapped to aesthetic attributes that you can perceive. - **layers**: geometric elements (**geoms**, points, lines, polygons, text, ...) and statistical transformations summarize data in many useful ways (**stats**, identities, counts, bins,...) - **scales**: map values in the data space to values in an aesthetic space (ej. color, size, shape or position).Scales draw a legend or axes. - **coord**: describes how data coordinates are mapped to the plane of the graphic. Normally Cartesians, but example pie charts use polar coordinates - **facetting**: how to break up the data into subsets and how to display those subsets as small multiples. - **theme** controls the finer points of display, like the font size and background colour --- ## <span style="color:#88398A">Instalar ggplot2</span> - Installing ggplot2 ```r install.packages("ggplot2") ``` - Load the package `ggplot2` but R packages are contributed and can change in future iterations Developing version available in GitHub: https://github.com/tidyverse/ggplot2 ```r install.packages("devtools") library(devtools) install_github("tidyverse/ggplot2") ``` - `ggplot2` is part of `tidyverse` --- ## <span style="color:#88398A">`ggplot2` ayuda</span> mail list: http://groups.google.com/group/ggplot2 stackoverflow: http://stackoverflow.com --- <!-- tips <- read_csv("http://www.ggobi.org/book/data/tips.csv") --> <!-- tips <- read_csv("tips.csv") --> ## <span style = "color:#88398A"> Tip Example </span> ```r # cargamos los datos tips <- read_csv("http://www.ggobi.org/book/data/tips.csv") head(tips) ``` ``` ## # A tibble: 6 x 8 ## obs totbill tip sex smoker day time size ## <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <int> ## 1 1 17.0 1.01 F No Sun Night 2 ## 2 2 10.3 1.66 M No Sun Night 3 ## 3 3 21.0 3.5 M No Sun Night 3 ## 4 4 23.7 3.31 M No Sun Night 2 ## 5 5 24.6 3.61 F No Sun Night 4 ## 6 6 25.3 4.71 M No Sun Night 4 ``` --- ## <span style="color:#88398A"> Three components of every plot </span> - **data**: data to visualize - **aes**: A set of aesthetic mapping between variables in the data and visual properties (e.g color, size etc) - **layer**: At least one layer describing how to render each observation. Each layer is created with **geom** function . --- ## <span style="color:#88398A">Three components of every plot</span> - **data**: tips - **aes**: totbill maping to `x` position, tip to `y` position. - **layer**: points with `geom_point`. Let's make our first plot! --- ## <span style="color:#88398A">Three components of every plot</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> Aspect ratio: ratio between the width and the height of a rectangle --- ## <span style="color:#88398A">Three components of every plot</span> Overplotting <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Tip example</span> What do you see? - Weak and lineal relationship between tip and total bill - A lot of variability - horizontal lines indicate the preference to 1 dollar tips --- ## <span style="color:#88398A">Color, size, shape and other aes</span> To include other variables, we can use other **aes** (color, shape, size) ```r aes(x = totbill, y = tip, colour = sex) aes(x = totbill, y = tip, shape = sex) aes(x = totbill, y = tip, size = size) ``` --- ## <span style="color:#88398A">Color </span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">Fixed color </span> To fix the aesthetic color, outside `aes` (outside layer) or use `I('blue')` in `aes` <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> shape </span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">size</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">size</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Facets </span> - We can display additional categorical variables in a graph subseting the graphical display. - Create table of plots subseting the data and partitioning the graphical display. - Two types: `facet_grid` y `facet_wrap` --- ## <span style="color:#88398A"> `facet_wrap` </span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">`facet_grid`</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">Facetting</span> `facet wrap()` and `facet grid()` you can control whether the position scales are the same in all panels (fixed) or allowed to vary between panels (free) with the scales parameter: - scales = "fixed": x and y scales are fixed across all panels. - scales = "free_x": the x scale is free, and the y scale is fixed. - scales = "free_y": the y scale is free, and the x scale is fixed. - scales = "free": x and y scales vary across panels. Fixed scales make it easier to see patterns across panels; free scales make it easier to see patterns within panels. --- ## <span style="color:#88398A">Other geoms</span> If we substitute `geom_pont()` with another `geom` we get a different visualization . Most common `geoms`: - `geom_smooth()` - `geom_boxplot()` - `geom_histogram()` - `geom_bar()` - `geom_path()` y `geom_lines()` each `geom` its associate with particular geometric elements --- ## <span style="color:#88398A"> More geoms</span> http://ggplot2.tidyverse.org/reference/ --- ## <span style="color:#88398A"> Extensions</span> More than 40 `ggplot2` extensions http://www.ggplot2-exts.org/gallery/ --- ## <span style="color:#88398A">Include labs</span> <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> `geom` Examples</span> ```r p <- ggplot(tips, aes(x = day, y = tip)) ``` - `p ` - `p + geom_point()` - `p + geom_boxplot()` - `p + geom_violin()` ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-37-1.png)<!-- --> --- ## <span style="color:#88398A"> `geom`, Example </span> ```r p <- ggplot(tips, aes(x = day, fill = smoker)) ``` - `p + geom_bar()` - `p + geom_bar(position="stack")` - `p + geom_bar(position="dodge")` - `p + geom_bar(position="fill")` ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-39-1.png)<!-- --> --- ## <span style="color:#88398A">scales </span> - A scale controls the mapping from data to aesthetic attributes, and we need a scale for every aesthetic used on a plot. - We need to convert them from data units (totbill, sex, etc) to graphical units (color, shape, etc) - This conversion process is called scaling and performed by `scales` - Each scale operates across all the data in the plot, ensuring a consistent mapping from data to aesthetics. - `colors` are represented by a six-letter hexadecimal string, `sizes` by a number and `shapes` by an integer --- ## <span style="color:#88398A">scales </span> You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control --- ## <span style="color:#88398A">scales </span> - The aesthetic mapping only said a variable is mapped to an aesthetic element but doesn't say how to be done. - When a variable is mapped to a `shape` using `aes(shape = x)` doesn't specify the specific shape (`shape`) should take. - When we use `aes(color = z)` we don't said the specific color - Describe the color, shape, size, etc (color, shape, size) is done using transformations in `scale` --- ## <span style="color:#88398A">scales</span> - `color` and `fill` - `size` - `shape` - `linetype` `scales` modify a series functions with this structure `scale_<aesthetic>_<type>`. See `scale_<tab>`, list of `scale` functions. --- ## <span style="color:#88398A">`scales` disponibles</span> <img src="summary.png" width="700"> --- ## <span style="color:#88398A">Color Scales</span> - Colors are controlled through scales - `scale_colour_discrete`(scale_colour_hue) and `scale_colour_continuous` (scale_colour_gradient) are the default choices for factor variables and numeric variables - We can change parameters to the default scale, or we can change the scale function --- `scale_colour_gradient (..., low = "#132B43", high = "#56B1F7", space = "Lab", na.value = "grey50", guide = “colourbar")` <img src = "colorgradient.png" width = 200 class = "center"> - colors can be specified by hex code, name or through rgb() - Gradient goes from low to high - that should match the interpretation of the mapped variable --- `scale_colour_gradient2(..., low = muted("red"), mid = "white", high = muted("blue"), midpoint = 0, space = "Lab", na.value = "grey50", guide = "colourbar")` <img src = "colorgradient2.png" width = 200 class = "center"> - midpoint is value of the ‘neutral’ color - gradient2 is a divergent color scheme - best matches a variable that goes from large negative to zero to large positive (or below mean, above mean) --- `scale_color_hue (..., h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1, na.value = "grey50")` <img src = "colorhue.png" width = 200 class = "center"> - uses hue, chroma and luminance (=value) - each level of a variable is assigned a different level of `h` --- `scale_colour_brewer(..., type = "seq", palette = 1, direction = 1)` <img src = "colorbrewer.png" width = 200 class = "center"> - brewer schemes are defined in RColorBrewer (Neuwirth, 2014) - palettes can be specified by name or index - see also http://colorbrewer2.org/ (Brewer et al 2002) --- ## <span style="color:#88398A">Color Brewer Schemes</span> There are 3 types of palettes, sequential, diverging, and qualitative. 1. **Secuancial** are suited to ordered data that progress from low to high. With light colors for low data values to dark colors for high data values 2. **Cualitativa** do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data. 3. **Divergente** put equal emphasis on mid-range critical values and extremes at both ends of the data range. The middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors --- ## <span style="color:#88398A"> Color Brewer Schemes</span> `RColorBrewer` provides color schemes for discrete variables .left-column[ `display.brewer.all()` - Sequential - Qualitative - Divergent ] .right-column[ ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-40-1.png)<!-- --> ] --- ## <span style="color:#88398A"> Color & Fill</span> - While specified palette `Set2` has 8 colors - Lack of colors in the palette triggers ggplot warnings (and invalidates plot as seen above): - `RColorBrewer` gives us a way to produce larger palettes by interpolating existing ones with constructor function `colorRampPalette` - they build palettes with arbitrary number of colors by interpolating existing palette. --- ## <span style="color:#88398A"> Color & Fill</span> Select `Set1` palette ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-41-1.png)<!-- --> --- ## <span style="color:#88398A"> Color & Fill</span> Select `Set1` palette ```r ggplot(data = tips) + geom_bar(aes(factor(round(tip)), fill = factor(round(tip)) )) + scale_fill_brewer( palette = "Set1") + labs(x = "Tip in USD", fill = "Tip") ``` --- ## <span style="color:#88398A"> Color & Fill</span> usar `colorRampPalette(brewer.pal(9, "Set1"))` ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-43-1.png)<!-- --> --- ## <span style="color:#88398A"> Color & Fill</span> usar `colorRampPalette(brewer.pal(9, "Set1"))` ```r getPalette = colorRampPalette(brewer.pal(9, "Set1")) ggplot(data = tips) + geom_bar(aes(factor(round(tip)), fill = factor(round(tip)) )) + scale_fill_manual( values = getPalette(10)) + labs(x = "Tip in USD", fill = "Tip") ``` --- ## <span style="color:#88398A"> Color & Fill</span> - Area plots use fill to map values to the fill color - only discrete color scales can be used: `scale_fill_hue`, `scale_fill_brewer`, `scale_fill_grey`, ... - `scale_fill_manual` (..., values) values is a vector of color values. At least as many colors as levels in the variable have to be listed --- ## <span style="color:#88398A"> Shape </span> - Point shape - `scale_shape_continuous()`, `scale_shape_discrete()`, `scale_shape_manual()` <img src = "shape2.png" width = 300 class = "center"> --- ## <span style="color:#88398A"> Shape </span> scale_shape_manual(values=c(8, 15)) <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">Include labs </span> Previously we use `labs` but the long way is... <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Output</span> - ggsave selects graphics device based on file extension - `ggplot(tips, aes(totbill, tip, colour = smoker)) + geom_point() + theme(aspect.ratio = 1)` - `gsave("ppt123.png")` # png (pixelated raster image) - `ggsave("ppt123.pdf")` # pdf (scalable vector image) --- ## <span style="color:#88398A"> Grammar of Graphics</span> ```r ggplot(tips, aes(totbill, tip, colour = smoker)) + geom_point() + theme(aspect.ratio = 1) ``` <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A"> Behind this plot</span> - Each observation is represented as a point which position is associate with to variables (horizontal position and vertical) - Each point has size, color, shape, called aesthetic elements`aes` - `aes` are properties that can be seen in the plot. Each `aes` can be mapped to a variable or to be fix to a constant value - `total` is mapped to an horizontal position, `propina` to the vertical position and `fuma` to color. size and shape are not mapped to variables (default values) --- ## <span style="color:#88398A">Behind this plot</span> ``` ## # A tibble: 3 x 3 ## totbill tip smoker ## <dbl> <dbl> <chr> ## 1 17.0 1.01 No ## 2 10.3 1.66 No ## 3 21.0 3.5 No ``` - New data, mapping of the aesthetic elements to the original data |x | y | colour| |---|---|-------| |17.0| 1.01| No| |10.3| 1.66| No| |21.0| 3.5| No| --- ## <span style="color:#88398A"> Layers in a plot</span> - The data, aesthetic mapping, geometric objects and statistical transformations define a **layer** - We can define a graphic with multiple layers --- ## <span style="color:#88398A">layers in a graphic</span> The grammar of layers define the components of a graphic: - data and subset of variable mapping to a aesthetic elements - one or more layers, each layer has a geometric element, a statistical transformation, a position and optional data and `aes` --- ## <span style="color:#88398A">Layers in a graphic</span> ```r ggplot() + layer( data = tips, mapping = aes(x = totbill, y = tip), geom = "point", stat = "identity", position = "identity" ) + scale_x_continuous() + scale_y_continuous() + coord_cartesian() ``` Equivalent to : ```r ggplot(data = tips, aes(x = totbill, y = tip)) + geom_point() ``` --- ## <span style="color:#88398A">Layers in a graphic</span> More than one data set: ```r ggplot() + geom_point(data = tips, aes(x = totbill, y = tip)) + geom_point(data = data.frame(x = 30, y = 6), aes(x, y), color = "red", size = 10) ``` <img src="Part_I_ggplot2_files/figure-html/unnamed-chunk-51-1.png" style="display: block; margin: auto;" /> --- ## <span style="color:#88398A">Layers in a graphic</span> Layers in a bar graph ```r p1 <- ggplot() + layer( data = tips, mapping = aes(x = day , y = ..prop.., group = 1), geom = "bar", stat = "count", position = "identity" ) + scale_x_discrete() + scale_y_continuous() coord_cartesian() ``` Equivalent to : ```r p1 <- ggplot(data = tips, aes(x = day, y =..prop.., group = 1)) + geom_bar() ``` --- ## <span style="color:#88398A">Bars in proportion</span> ```r ggplot_build(p1)$data[[1]] ``` ``` ## y count prop x group PANEL ymin ymax xmin xmax ## 1 0.07786885 19 0.07786885 1 1 1 0 0.07786885 0.55 1.45 ## 2 0.35655738 87 0.35655738 2 1 1 0 0.35655738 1.55 2.45 ## 3 0.31147541 76 0.31147541 3 1 1 0 0.31147541 2.55 3.45 ## 4 0.25409836 62 0.25409836 4 1 1 0 0.25409836 3.55 4.45 ## colour fill size linetype alpha ## 1 NA grey35 0.5 1 NA ## 2 NA grey35 0.5 1 NA ## 3 NA grey35 0.5 1 NA ## 4 NA grey35 0.5 1 NA ``` --- ## <span style="color:#88398A">histogram </span> ```r ggplot(data = tips, aes(x = totbill , y =..density..)) + geom_histogram() ``` ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-56-1.png)<!-- --> --- ## <span style="color:#88398A">themes </span> - Themes allow to control every aspect of non-data related aspects of a plot - `theme` gives you control of fonts, background, tick marks, etc. - Two pre-defined themes: `theme_grey` (default), `theme_bw` - Use `theme_set` if you want it to apply theme to every future plot, e.g. `theme_set(theme_bw())` - `ggthemes` package defines additional themes `library(help = "ggthemes")` lists all themes --- ## <span style="color:#88398A"> Element </span> - You can also make your own theme, or modify and existing. - Themes are made up of elements which can be one of: - `element_line`, `element_text`, `element_rect`, `element_blank` - Gives you a lot of control over plot appearance. --- ## <span style="color:#88398A"> Element </span> - Axis: `axis.line`, `axis.text.x`, `axis.text.y`, `axis.ticks`, `axis.title.x`, `axis.title.y` - Legend: `legend.background`, `legend.key`, `legend.text`, `legend.title` - Panel: `panel.background`, `panel.border`, `panel.grid.major`, `panel.grid.minor` - Strip (facetting): `strip.background`, `strip.text.x`, `strip.text.y` for a complete overview see ?theme --- ## <span style="color:#88398A"> theme</span> Let's do this plot! ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-57-1.png)<!-- --> --- ## <span style="color:#88398A"> theme</span> Let's do this plot! ```r ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) + geom_point() + theme(aspect.ratio = 1, legend.position = "bottom", panel.background = element_rect(fill = "white"), panel.grid = element_line(colour = "grey92"), panel.border = element_rect(colour = "grey20", fill = NA), legend.key = element_rect(fill = "white")) ``` --- ## <span style="color:#88398A"> theme</span> Let's do this plot! ![](Part_I_ggplot2_files/figure-html/unnamed-chunk-59-1.png)<!-- --> --- ## <span style="color:#88398A"> theme</span> Let's do this plot! ```r ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) + geom_point() + theme(aspect.ratio = 1, legend.position = "bottom", panel.background = element_rect(fill = "white"), panel.grid = element_line(colour = "grey92"), panel.border = element_rect(colour = "grey20", fill = NA), legend.key = element_rect(fill = "white"), axis.text.x = element_text(size =20), axis.text.y = element_text(size = 20), axis.title = element_text(size = 30), legend.text = element_text(size = 20), legend.title = element_text(size = 20) ) ``` --- ## <span style="color:#88398A"> Activity</span> https://natydasilva.github.io/CODATA_Activity/#1 --- ## <span style="color:#88398A"> CC licence</span> <img src="cc.png" width="200"> --- # References <a name=bib-anscombe></a>[Anscombe F, J.](#cite-anscombe) (1973). "Graphs in statistical analysis". In: _The American Statistician_. <a name=bib-chambers2017graphical></a>[Chambers, J. M.](#cite-chambers2017graphical) (2017). _Graphical Methods for Data Analysis: 0_. Chapman and Hall/CRC. <a name=bib-cleveland></a>[Cleveland William, S. and R. McGuill](#cite-cleveland) (1985). " Graphical perception and graphical methods for analyzing scientific data ". <a name=bib-cleveland1993visualizing></a>[Cleveland, W. S.](#cite-cleveland1993visualizing) (1993). _Visualizing data_. Vol. 2. Hobart Press Summit, NJ. <a name=bib-gelman2002let></a>[Gelman, A, C. Pasarica and R. Dodhia](#cite-gelman2002let) (2002). "Let's practice what we preach: turning tables into graphs". In: _The American Statistician_ 56.2, pp. 121-130. <a name=bib-matejka2017same></a>[Matejka, J. and G. Fitzmaurice](#cite-matejka2017same) (2017). "Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing". In: _Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems_. ACM. , pp. 1290-1294. <a name=bib-robbins2012creating></a>[Robbins, N. B.](#cite-robbins2012creating) (2012). _Creating more effective graphs_. Wiley. <a name=bib-tukey77></a>[Tukey, J. W.](#cite-tukey77) (1977). "Exploratory Data Analysis". <a name=bib-wickham2016ggplot2></a>[Wickham, H.](#cite-wickham2016ggplot2) (2016). _ggplot2: elegant graphics for data analysis_. Springer. <a name=bib-wilkinson2006grammar></a>[Wilkinson, L.](#cite-wilkinson2006grammar) (2006). _The grammar of graphics_. Springer Science & Business Media. <a name=bib-anscombe></a>[Anscombe F, J.](#cite-anscombe) (1973). "Graphs in statistical analysis". In: _The American Statistician_. <a name=bib-chambers2017graphical></a>[Chambers, J. M.](#cite-chambers2017graphical) (2017). _Graphical Methods for Data Analysis: 0_. Chapman and Hall/CRC. <a name=bib-cleveland></a>[Cleveland William, S. and R. McGuill](#cite-cleveland) (1985). " Graphical perception and graphical methods for analyzing scientific data ". <a name=bib-cleveland1993visualizing></a>[Cleveland, W. S.](#cite-cleveland1993visualizing) (1993). _Visualizing data_. Vol. 2. Hobart Press Summit, NJ. <a name=bib-gelman2002let></a>[Gelman, A, C. Pasarica and R. Dodhia](#cite-gelman2002let) (2002). "Let's practice what we preach: turning tables into graphs". In: _The American Statistician_ 56.2, pp. 121-130. <a name=bib-matejka2017same></a>[Matejka, J. and G. Fitzmaurice](#cite-matejka2017same) (2017). "Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing". In: _Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems_. ACM. , pp. 1290-1294. <a name=bib-robbins2012creating></a>[Robbins, N. B.](#cite-robbins2012creating) (2012). _Creating more effective graphs_. Wiley. <a name=bib-tukey77></a>[Tukey, J. W.](#cite-tukey77) (1977). "Exploratory Data Analysis". <a name=bib-wickham2016ggplot2></a>[Wickham, H.](#cite-wickham2016ggplot2) (2016). _ggplot2: elegant graphics for data analysis_. Springer. <a name=bib-wilkinson2006grammar></a>[Wilkinson, L.](#cite-wilkinson2006grammar) (2006). _The grammar of graphics_. Springer Science & Business Media.