+ - 0:00:00
Notes for current slide
Notes for next slide

Data Visualization

Natalia da Silva


Instituto de Estadística-FCEA-UDELAR

1 / 136

About me

  • Assistant professor, Instituto de Estadística-Universidad de la República (IESTA-UDELAR), Montevideo Uruguay.

  • PhD and Msc. in Statistics from Iowa State University, USA

  • Interests: supervised learning methods, computational statistics, visualization and meta-analysis

  • Co-founder of R-Ladies Ames, R-Ladies Montevideo and GURU::MVD

  • Contact info: natalia@iesta.edu.uy, @pacocuak, http://natydasilva.com

2 / 136

https://natydasilva.github.io/CODATA/#1

3 / 136

About this workshop

  • Why we use Visualization?

  • EDA, type of variables and viz examples

  • Ideas for an effective visualization

  • Why to use ggplot2?

  • Grammar of graphics

  • Examples

4 / 136

Importance of data visualization

"The greatest value of a picture is when it forces us to notice what we never expected to see." Tukey (1977)

"Graphs provide powerful tools both for analyzing scientific data and for communicating quantitative information" Cleveland William and McGuill (1985)

5 / 136

Few people escape to these ideas

  1. The numerical calculations are exact, but graph are rough

  2. For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis

  3. Performing intrincate calculations is virtuous, whereas actually looking at the data is cheating Anscombe F (1973)

6 / 136

Statistical Visualization

Visualization plays an important role in all the stages of the statistical analysis.

  • Initial Exploration: To find general and specific patterns in the data.

  • Models: Check the data assumptions before to run a model.

  • Diagnostics: Visualize the model in the data space or the data in the model space.
7 / 136

EDA

Exploratory data analysis (EDA), it is an iterative process to explore broadly different aspects of the data.

Origins: John Tukey encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments .

  • help us to understand our data

  • we need to define questions to guide our research

  • each questions focused on something specific about the data and this defines the type of data the model and possible transformations

  • key point, define interesting questions for our problem

8 / 136

Which is the relationship between X and Y?

X Y
1.972 1.236
1.112 1.994
0.000 1.009
0.665 1.942
0.235 0.356
0.247 1.658
1.275 1.961
0.702 0.045
1.760 0.350
1.691 0.277
1.628 1.778
1.957 1.290
9 / 136

Why to use visualization?

10 / 136

  • Statisticians recommend graphical displays but often do not follow this recommendation in presenting their own research.

  • They analyze some papers in JASA and shows some examples from table to plots

  • They said there is a good reason to be lazy, it takes a lot of work to make a good visualization

  • Nice graphs are possible, especially when we think hard about why we want to display these numbers in the first place

  • If the world’s leading statistical journal doesn’t do it right, there is obviously still room for progress

Gelman, Pasarica, and Dodhia (2002)

11 / 136

What is the sample mean value for x and y?

12 / 136

What is the sample mean value for x and y?

\(\bar x =\) 54.27 and \(\bar y =\) 47.83

12 / 136

What is the sample mean value for x and y?

13 / 136

What is the sample mean value for x and y?

\(\bar x =\) 54.27 and \(\bar y =\) 47.84

13 / 136

What is the sample mean value for x and y?

14 / 136

What is the sample mean value for x and y?

\(\bar x =\) 54.26 and \(\bar y =\) 47.83

14 / 136

Why to use visualization?

  • Same statistical summaries

  • Very different distributions

  • Link algoritmo.

15 / 136

Why to use visualization?

  • Graphs provides more information than numerical summaries
  • Anscombe’s quartet Anscombe F (1973)
  • \(n = 11\)
  • \(\bar x= 9.0\)
  • \(\bar y = 7.5\)
  • \(\hat \beta_1= 0.5\)
  • \(y = 3 + 0.5 x\)
  • \(R^2 = 0.667\)
  • \(....\)

16 / 136

Visualization is not new

Charles Minard, Napoleon’s Russian Campaign in 1812

17 / 136

Visualization is not new

Florence Nightingale, Coxcombs Causes of Mortality for the Army in the East (1858)

18 / 136

Data Visualization

  • Data visualizations are based on data

  • Summarize information

  • Different type of visualizations emphasizes different aspects of data

19 / 136

Information, data

  • Data contains cases and variables.

  • In tabular form cases are the rows and columns are usually the variables.

Variables can take different forms:

  • Continuous, Categorical, Temporal, Spatial
20 / 136

Type of data

Key point to do visualization: understand the type of variables you have

  • Continuous variable: a variable with infinite number of values, like “time” or “weight”.

  • Discrete variable: numerical variable that can only take on a certain number of values. Example, “number of students in the class”

  • Categorical variable: variables than can be put into groups or categories. Example, hair color, dog breeds

  • and more...

21 / 136

What are the most important types of data displays?

Impossible to answer, but ... some important types are:

  • dotplots, histogram, density plot, time series, barcharts (all one dimensional)

  • scatterplot, side-by-side boxplots (two-dimensional) parallel coordinate plots, mosaic plots (higher dimensions)

  • maps (geographic information)

  • and more

22 / 136

Why so many different types?

  • Different data displays are due to number and type of the variables

  • emphasis: every chart shows a different aspect of the data

23 / 136

EDA rules?

Not clear rules to do EDA but we can begin with some basic questions:

  • What is the variability in my data?

  • How is the distribution of each variable?

  • Is there any relationship between some selected variables?

  • How these two variables covariate?

24 / 136

Distribución

To identify which tools to use in EDA we should identify the type of variables to analyze

Example:

  • Categorical variables, we can analyze the distribution using a bar charts

  • Continuous variables, we can analyze the distribution using an histogram

25 / 136

Barchart versus Histogram, starwars

  • Variable: categorical/qualitative

  • Bins separated by equal space

  • Bins can be re-ordered, and should be to improve the viz

  • Parameter: none

  • Variable: continuous/quantitative

  • Bins adjacent

  • Bins can't be re-ordered

  • Parameter: bin width

26 / 136

All the same...?

  • Histogram, density plot, jittered dot plot and boxplot

  • Every plot emphasizes a different aspect of the data: skewness, modality, symmetry, gaps, outliers

  • There is not right plot all are useful

27 / 136

Covariation

  • Covariation is the joint variation for two variables

  • The best way to detect the covariation is to visualize the relationship between two variables

  • The type of variable is what defines the way to visualize covariation

28 / 136

Categorical vs Continous

  • It is common to explore the distribution of a continuous variables according to a categorical variable

  • histograms or densities colored by a categorical variable to compare the distributions for each category

29 / 136

Categorial vs Continous, boxplot and violin

Boxplot:

  • We see the data distribution for continuous variables

  • based on the five-number summaries (min, Q1, Q2, Q3, max).

  • outliers( Q1–1.5 IQR and Q3 + 1.5 IQR)

Violin: it is a combination of boxplot and density plot

30 / 136

Categorial vs Continous, boxplot and violin

31 / 136

Categórica vs categóricas, Bars

  • It is useful to describe the categorical variables, for each category the bar length is the proportion to the number of observation in each category

  • If we want to represent more than one categorical variable we can use stacked bar graph.

  • Simple stacked bar graph: place each value for the segment after the previous one. The total value of the bar is all the segment values added together. Ideal for comparing the total amounts across each group/segmented bar.

  • 100% Stack Bar Graphs show the percentage-of-the-whole of each group and are plotted by the percentage of each value to the total amount in each group. This makes it easier to see the relative differences between quantities in each group.

32 / 136

Categórica vs categóricas, Bars

33 / 136

Static versus interactive

Two key components to be accomplish in an interactive visualization:

  • Interaction in every individual visualization (mouse over, zoom, labels, etc)

  • Links between different graphics

Additionally be able to control dynamic rotations in higher dimensions.

101520253035203040
ctyhwy
34 / 136

Some Examples

Percentage of students that abandon first year of high school in Uruguay, 2016

35 / 136

Some Examples

36 / 136

Some Examples

37 / 136

Effective Visualization

  • Not all the visualizations are equally effective

  • There are different criteria to evaluate graphs (Cleveland, Tufte, Car, Wainer, etc )

  • Lets see a set of criteria to evaluate graphics

38 / 136

Tufte's rules

Tufte elaborate some guidelines for constructing graphics and we can use it as criteria for evaluating graphics

39 / 136

Tufte's rules

  1. Show the data

  2. Induce the viewer to think about the data

  3. Avoid distorting what the data have to say

  4. Present many numbers in a small space

  5. Make large data sets coherent

  6. Reveal the data at several levels of detail

  7. Serve a reasonably clear purpose

  8. Be closely integrated with the statistical and verbal descriptions of the data

40 / 136

1. Show the data

What is the data in this picture?

  • Data: address of deaths from Cholera location of water pumps

Supporting structure: street map

  • Improvement: de-emphasize supporting structure (e.g. by using a lighter shade of grey)
41 / 136

De-emphasize grids

Dan Carr: background+pale grid+dark data marks Sets plot off from page, and makes grey scale equivalent to text block of same size

42 / 136

3. Avoid distorting what the data have to say

What's the data?

  • Year and Fuel economy standard
  • Represented by a timeline and line segment
43 / 136

Lie Factor (Tufte)

\(\frac{\text{Size of effect shown in graphic}}{ \text{Size of effect in data}}\)

Should be close to 1

Fuel economy example:

  • Data: \(\frac{27.5-18.0}{ 18.0}*100 = 53\%\)

  • Graphic: \(\frac{5.3-0.6}{0.6}*100 = 783\%\)

  • Lie factor: \(\frac{783}{53}*100= 14.8\% >>> 1\) Huge!

44 / 136

4. Present many numbers in a small space

  • Data-ink ratio (Tufte)

  • Divide the total ink used to draw the data by the total ink used to draw the graphic.

  • How do you calculate this? Not easily! Generally not possible in practice

45 / 136

Induce the viewer to think about the data

How do you do this?

  • 6 Make large data sets coherent
  • 7 Reveal the data at several levels of detail
  • 8 Serve a reasonably clear purpose
  • 9 Be closely integrated with the statistical and verbal descriptions of the data
46 / 136

Cleveland's principles of graphical construction

Cleveland’s graphical construction concerns primarily statistical plots of data, for a scientific audience.

The two over-reaching principles are:

  • Make the data stand out

  • Avoid superfluity

47 / 136

Cleveland's principles

  • Clear vision/understanding

  • Use of guides: scales, axes, tick marks, grid lines, legends

  • Extensive captions

  • Aspect ratio: scale of horizontal to vertical

48 / 136

Effective Visualization, Cleveland

  • Based on graphical perception studies

  • The best visualizations are the ones that require the use of "pre-attentive" vision (instantly without apparent effort) Cleveland William and McGuill (1985).

49 / 136

Cleveland, graphical perception

50 / 136

How to decode a graphic?

  • When a person looks at a graph,the information visually decoded by the person's visual system.

  • A graphical method is successful only if the decoding is effective.

  • Good visualizations are the ones that optimize the human visual system

  • If this is true we should know how the human system decode a graph

51 / 136

Comparing quantitative variables

Cleveland provides an order of the elementary tasks for the graphical perception of quantitative information

  1. Position along a common scale
  2. Position on identical but nonaligned scales
  3. Length
  4. Angle/slope
  5. Area
  6. Volume, density or color saturation
  7. Color Hue
52 / 136

How to use this?

  • We should identify which is the most important comparison I want in my quantitative variables

  • We should codificate the most important comparison in the graphical elements in the table (1. position along a common scale )

53 / 136

Color hue

54 / 136

Length

55 / 136

Position along a common scale

56 / 136

Miles per galon comparing element task

57 / 136

Miles per galon comparing element task

58 / 136

Angle or slope

  • Eye color in Starwars characters
59 / 136

Angle or slope

  • Pie charts are always an error!!!

  • But now we know why

  • Because decodificate quantitative variables based on angles is more difficult than using other graphical elements in terms of perception

  • Use always a bar graph instead of a pie chart

60 / 136

Visualization to comunicate information...

61 / 136

Criteria for Evaluating Graphics

  • Context (Rhetoric) :

    • What is the main message? Sub-messages? Story.
    • Why/when was it produced?
    • Who’s the audience?
  • Content (Aesthetic):

    • What are the pieces of information?
    • How is the information coded into the graphic?
    • What conventions are used? What is unconventional?
    • Is the data accurately represented? Lie factor, trustworthiness.
    • What is the ratio of data to ink in the plot? High, medium, low.
    • What’s missing?
  • Perception (Perceptual)

    • How clearly is the information represented? What is emphasized, de-emphasized?
    • How is the viewer drawn in?
    • What is your overall impression, opinion?
62 / 136

Meterial to read about viz

To have more ideas about what plots to produce to answer the questions you are interested in: Robbins (2012), Cleveland (1993), Chambers (2017) and Tukey (1977)

63 / 136

ggplot2 and the Grammar of Graphics

64 / 136

Why to use ggplot2?

  • ggplot2 an R package for producing statistical, or data, graphics developed by Wickham (2016).

  • differs with most other graphics packages because it has a deep underlying grammar.

  • This grammar, based on the Grammar of Graphics theory Wilkinson (2006).

  • It is the most used R package to do visualization and is 10 years old

  • ggplot2: Elegant Graphics for Data Analysis

65 / 136

Grammar of Graphics

The Grammar of graphics answer the following questions:

  • What is a statistical graphic?

  • How to describe a graph?

  • How to create a graph?

"A statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)"

66 / 136

Grammar of Graphics

  • A statistic is a function of data, like the sample mean \(\bar x = \sum_{i=1}^n \frac{x_i}{n}\) or sample variance \(S^2 = \sum_{i=1}^n \frac{(xi- \bar x)^2}{n-1}\)

  • The grammar of graphics provides a tight connection between data and statistics.

  • Key point about ggplot2, its makes plots another type of statistic.

  • It is a function of the data, a mapping from data to aesthetic attributes of geometric objects.

67 / 136

Why to use the grammar of graphics?

  • To do visualizations we already know

  • To create new visualization

  • To identify better graphs to visualize our data

The limit is your imagination!

68 / 136

ggplot2

  • Set of independent components, gives flexibility

  • Not limited to predetermined plots, you can create what you want

  • Defined in based to a set of principles, easy to learn

  • The graphs are easily reproducible

  • You can make publication quality graphs in a short time

  • Design to work in an iterative way based on layers

69 / 136

Grammar of graphics

  • data: with a set of aesthetic mappings (aes) describing how variables in the data are mapped to aesthetic attributes that you can perceive.

  • layers: geometric elements (geoms, points, lines, polygons, text, ...) and statistical transformations summarize data in many useful ways (stats, identities, counts, bins,...)

  • scales: map values in the data space to values in an aesthetic space (ej. color, size, shape or position).Scales draw a legend or axes.

  • coord: describes how data coordinates are mapped to the plane of the graphic. Normally Cartesians, but example pie charts use polar coordinates

  • facetting: how to break up the data into subsets and how to display those subsets as small multiples.

  • theme controls the finer points of display, like the font size and background colour

70 / 136

Instalar ggplot2

  • Installing ggplot2
install.packages("ggplot2")
  • Load the package

ggplot2 but R packages are contributed and can change in future iterations

Developing version available in GitHub: https://github.com/tidyverse/ggplot2

install.packages("devtools")
library(devtools)
install_github("tidyverse/ggplot2")
  • ggplot2 is part of tidyverse
71 / 136

ggplot2 ayuda

mail list: http://groups.google.com/group/ggplot2

stackoverflow: http://stackoverflow.com

72 / 136

Tip Example

# cargamos los datos
tips <- read_csv("http://www.ggobi.org/book/data/tips.csv")
head(tips)
## # A tibble: 6 x 8
## obs totbill tip sex smoker day time size
## <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <int>
## 1 1 17.0 1.01 F No Sun Night 2
## 2 2 10.3 1.66 M No Sun Night 3
## 3 3 21.0 3.5 M No Sun Night 3
## 4 4 23.7 3.31 M No Sun Night 2
## 5 5 24.6 3.61 F No Sun Night 4
## 6 6 25.3 4.71 M No Sun Night 4
73 / 136

Three components of every plot

  • data: data to visualize

  • aes: A set of aesthetic mapping between variables in the data and visual properties (e.g color, size etc)

  • layer: At least one layer describing how to render each observation. Each layer is created with geom function .

74 / 136

Three components of every plot

  • data: tips

  • aes: totbill maping to x position, tip to y position.

  • layer: points with geom_point.

Let's make our first plot!

75 / 136

Three components of every plot

Aspect ratio: ratio between the width and the height of a rectangle

76 / 136

Three components of every plot

Overplotting

77 / 136

Tip example

What do you see?

  • Weak and lineal relationship between tip and total bill

  • A lot of variability

  • horizontal lines indicate the preference to 1 dollar tips
78 / 136

Color, size, shape and other aes

To include other variables, we can use other aes (color, shape, size)

aes(x = totbill, y = tip, colour = sex)
aes(x = totbill, y = tip, shape = sex)
aes(x = totbill, y = tip, size = size)
79 / 136

Color

80 / 136

Fixed color

To fix the aesthetic color, outside aes (outside layer) or use I('blue') in aes

81 / 136

shape

82 / 136

size

83 / 136

size

84 / 136

Facets

  • We can display additional categorical variables in a graph subseting the graphical display.

  • Create table of plots subseting the data and partitioning the graphical display.

  • Two types: facet_grid y facet_wrap

85 / 136

facet_wrap

86 / 136

facet_grid

87 / 136

Facetting

facet wrap() and facet grid() you can control whether the position scales are the same in all panels (fixed) or allowed to vary between panels (free) with the scales parameter:

  • scales = "fixed": x and y scales are fixed across all panels.
  • scales = "free_x": the x scale is free, and the y scale is fixed.
  • scales = "free_y": the y scale is free, and the x scale is fixed.
  • scales = "free": x and y scales vary across panels.

Fixed scales make it easier to see patterns across panels; free scales make it easier to see patterns within panels.

88 / 136

Other geoms

If we substitute geom_pont() with another geom we get a different visualization . Most common geoms:

  • geom_smooth()
  • geom_boxplot()
  • geom_histogram()
  • geom_bar()
  • geom_path() y geom_lines()

each geom its associate with particular geometric elements

89 / 136

Extensions

More than 40 ggplot2 extensions

http://www.ggplot2-exts.org/gallery/

91 / 136

Include labs

92 / 136

geom Examples

p <- ggplot(tips,
aes(x = day, y = tip))
  • p
  • p + geom_point()
  • p + geom_boxplot()
  • p + geom_violin()

93 / 136

geom, Example

p <- ggplot(tips, aes(x = day,
fill = smoker))
  • p + geom_bar()
  • p + geom_bar(position="stack")
  • p + geom_bar(position="dodge")
  • p + geom_bar(position="fill")

94 / 136

scales

  • A scale controls the mapping from data to aesthetic attributes, and we need a scale for every aesthetic used on a plot.

  • We need to convert them from data units (totbill, sex, etc) to graphical units (color, shape, etc)

  • This conversion process is called scaling and performed by scales

  • Each scale operates across all the data in the plot, ensuring a consistent mapping from data to aesthetics.

  • colors are represented by a six-letter hexadecimal string, sizes by a number and shapes by an integer

95 / 136

scales

You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control

96 / 136

scales

  • The aesthetic mapping only said a variable is mapped to an aesthetic element but doesn't say how to be done.

  • When a variable is mapped to a shape using aes(shape = x) doesn't specify the specific shape (shape) should take.

  • When we use aes(color = z) we don't said the specific color

  • Describe the color, shape, size, etc (color, shape, size) is done using transformations in scale

97 / 136

scales

  • color and fill
  • size
  • shape
  • linetype

scales modify a series functions with this structure scale_<aesthetic>_<type>. See scale_<tab>, list of scale functions.

98 / 136

scales disponibles

99 / 136

Color Scales

  • Colors are controlled through scales

  • scale_colour_discrete(scale_colour_hue) and scale_colour_continuous (scale_colour_gradient) are the default choices for factor variables and numeric variables

  • We can change parameters to the default scale, or we can change the scale function

100 / 136

scale_colour_gradient (..., low = "#132B43", high = "#56B1F7", space = "Lab", na.value = "grey50", guide = “colourbar")

  • colors can be specified by hex code, name or through rgb()

  • Gradient goes from low to high - that should match the interpretation of the mapped variable

101 / 136

scale_colour_gradient2(..., low = muted("red"), mid = "white", high = muted("blue"), midpoint = 0, space = "Lab", na.value = "grey50", guide = "colourbar")

  • midpoint is value of the ‘neutral’ color

  • gradient2 is a divergent color scheme

  • best matches a variable that goes from large negative to zero to large positive (or below mean, above mean)

102 / 136

scale_color_hue (..., h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1, na.value = "grey50")

  • uses hue, chroma and luminance (=value)

  • each level of a variable is assigned a different level of h

103 / 136

scale_colour_brewer(..., type = "seq", palette = 1, direction = 1)

  • brewer schemes are defined in RColorBrewer (Neuwirth, 2014)

  • palettes can be specified by name or index

  • see also http://colorbrewer2.org/ (Brewer et al 2002)

104 / 136

Color Brewer Schemes

There are 3 types of palettes, sequential, diverging, and qualitative.

  1. Secuancial are suited to ordered data that progress from low to high. With light colors for low data values to dark colors for high data values
  1. Cualitativa do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data.

  2. Divergente put equal emphasis on mid-range critical values and extremes at both ends of the data range. The middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors

105 / 136

Color Brewer Schemes

RColorBrewer provides color schemes for discrete variables

display.brewer.all()

  • Sequential
  • Qualitative
  • Divergent

106 / 136

Color & Fill

  • While specified palette Set2 has 8 colors

  • Lack of colors in the palette triggers ggplot warnings (and invalidates plot as seen above):

  • RColorBrewer gives us a way to produce larger palettes by interpolating existing ones with constructor function colorRampPalette

  • they build palettes with arbitrary number of colors by interpolating existing palette.

107 / 136

Color & Fill

Select Set1 palette

108 / 136

Color & Fill

Select Set1 palette

ggplot(data = tips) +
geom_bar(aes(factor(round(tip)),
fill = factor(round(tip)) )) +
scale_fill_brewer( palette = "Set1") +
labs(x = "Tip in USD", fill = "Tip")
109 / 136

Color & Fill

usar colorRampPalette(brewer.pal(9, "Set1"))

110 / 136

Color & Fill

usar colorRampPalette(brewer.pal(9, "Set1"))

getPalette = colorRampPalette(brewer.pal(9, "Set1"))
ggplot(data = tips) +
geom_bar(aes(factor(round(tip)),
fill = factor(round(tip)) )) +
scale_fill_manual( values = getPalette(10)) +
labs(x = "Tip in USD", fill = "Tip")
111 / 136

Color & Fill

  • Area plots use fill to map values to the fill color

  • only discrete color scales can be used: scale_fill_hue, scale_fill_brewer, scale_fill_grey, ...

  • scale_fill_manual (..., values) values is a vector of color values. At least as many colors as levels in the variable have to be listed

112 / 136

Shape

  • Point shape

  • scale_shape_continuous(), scale_shape_discrete(), scale_shape_manual()

113 / 136

Shape

scale_shape_manual(values=c(8, 15))

114 / 136

Include labs

Previously we use labs but the long way is...

115 / 136

Output

  • ggsave selects graphics device based on file extension
  • ggplot(tips, aes(totbill, tip, colour = smoker)) + geom_point() + theme(aspect.ratio = 1)

  • gsave("ppt123.png") # png (pixelated raster image)

  • ggsave("ppt123.pdf") # pdf (scalable vector image)
116 / 136

Grammar of Graphics

ggplot(tips, aes(totbill, tip, colour = smoker)) +
geom_point() + theme(aspect.ratio = 1)

117 / 136

Behind this plot

  • Each observation is represented as a point which position is associate with to variables (horizontal position and vertical)

  • Each point has size, color, shape, called aesthetic elementsaes

  • aes are properties that can be seen in the plot. Each aes can be mapped to a variable or to be fix to a constant value

  • total is mapped to an horizontal position, propina to the vertical position and fuma to color. size and shape are not mapped to variables (default values)

118 / 136

Behind this plot

## # A tibble: 3 x 3
## totbill tip smoker
## <dbl> <dbl> <chr>
## 1 17.0 1.01 No
## 2 10.3 1.66 No
## 3 21.0 3.5 No
  • New data, mapping of the aesthetic elements to the original data
x y colour
17.0 1.01 No
10.3 1.66 No
21.0 3.5 No
119 / 136

Layers in a plot

  • The data, aesthetic mapping, geometric objects and statistical transformations define a layer

  • We can define a graphic with multiple layers

120 / 136

layers in a graphic

The grammar of layers define the components of a graphic:

  • data and subset of variable mapping to a aesthetic elements

  • one or more layers, each layer has a geometric element, a statistical transformation, a position and optional data and aes

121 / 136

Layers in a graphic

ggplot() +
layer(
data = tips, mapping = aes(x = totbill, y = tip),
geom = "point", stat = "identity", position = "identity"
) +
scale_x_continuous() +
scale_y_continuous() +
coord_cartesian()

Equivalent to :

ggplot(data = tips, aes(x = totbill, y = tip)) +
geom_point()
122 / 136

Layers in a graphic

More than one data set:

ggplot() +
geom_point(data = tips, aes(x = totbill, y = tip)) +
geom_point(data = data.frame(x = 30, y = 6), aes(x, y),
color = "red", size = 10)

123 / 136

Layers in a graphic

Layers in a bar graph

p1 <- ggplot() +
layer(
data = tips, mapping = aes(x = day , y = ..prop.., group = 1),
geom = "bar", stat = "count", position = "identity"
) +
scale_x_discrete() +
scale_y_continuous()
coord_cartesian()

Equivalent to :

p1 <- ggplot(data = tips, aes(x = day, y =..prop.., group = 1)) +
geom_bar()
124 / 136

Bars in proportion

ggplot_build(p1)$data[[1]]
## y count prop x group PANEL ymin ymax xmin xmax
## 1 0.07786885 19 0.07786885 1 1 1 0 0.07786885 0.55 1.45
## 2 0.35655738 87 0.35655738 2 1 1 0 0.35655738 1.55 2.45
## 3 0.31147541 76 0.31147541 3 1 1 0 0.31147541 2.55 3.45
## 4 0.25409836 62 0.25409836 4 1 1 0 0.25409836 3.55 4.45
## colour fill size linetype alpha
## 1 NA grey35 0.5 1 NA
## 2 NA grey35 0.5 1 NA
## 3 NA grey35 0.5 1 NA
## 4 NA grey35 0.5 1 NA
125 / 136

histogram

ggplot(data = tips, aes(x = totbill , y =..density..)) +
geom_histogram()

126 / 136

themes

  • Themes allow to control every aspect of non-data related aspects of a plot

  • theme gives you control of fonts, background, tick marks, etc.

  • Two pre-defined themes: theme_grey (default), theme_bw

  • Use theme_set if you want it to apply theme to every future plot, e.g. theme_set(theme_bw())

  • ggthemes package defines additional themes library(help = "ggthemes") lists all themes
127 / 136

Element

  • You can also make your own theme, or modify and existing.

  • Themes are made up of elements which can be one of:

  • element_line, element_text, element_rect, element_blank

  • Gives you a lot of control over plot appearance.

128 / 136

Element

  • Axis: axis.line, axis.text.x, axis.text.y, axis.ticks, axis.title.x, axis.title.y

  • Legend: legend.background, legend.key, legend.text, legend.title

  • Panel: panel.background, panel.border, panel.grid.major, panel.grid.minor

  • Strip (facetting): strip.background, strip.text.x, strip.text.y

    for a complete overview see ?theme

129 / 136

theme

Let's do this plot!

130 / 136

theme

Let's do this plot!

ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) +
geom_point() + theme(aspect.ratio = 1, legend.position = "bottom",
panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey92"),
panel.border = element_rect(colour = "grey20", fill = NA),
legend.key = element_rect(fill = "white"))
131 / 136

theme

Let's do this plot!

132 / 136

theme

Let's do this plot!

ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) +
geom_point() + theme(aspect.ratio = 1, legend.position = "bottom",
panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey92"),
panel.border = element_rect(colour = "grey20", fill = NA),
legend.key = element_rect(fill = "white"),
axis.text.x = element_text(size =20),
axis.text.y = element_text(size = 20),
axis.title = element_text(size = 30),
legend.text = element_text(size = 20),
legend.title = element_text(size = 20)
)
133 / 136

CC licence

135 / 136

References

Anscombe F, J. (1973). "Graphs in statistical analysis". In: The American Statistician.

Chambers, J. M. (2017). Graphical Methods for Data Analysis: 0. Chapman and Hall/CRC.

Cleveland William, S. and R. McGuill (1985). " Graphical perception and graphical methods for analyzing scientific data ".

Cleveland, W. S. (1993). Visualizing data. Vol. 2. Hobart Press Summit, NJ.

Gelman, A, C. Pasarica and R. Dodhia (2002). "Let's practice what we preach: turning tables into graphs". In: The American Statistician 56.2, pp. 121-130.

Matejka, J. and G. Fitzmaurice (2017). "Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing". In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM. , pp. 1290-1294.

Robbins, N. B. (2012). Creating more effective graphs. Wiley.

Tukey, J. W. (1977). "Exploratory Data Analysis".

Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.

Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.

Anscombe F, J. (1973). "Graphs in statistical analysis". In: The American Statistician.

Chambers, J. M. (2017). Graphical Methods for Data Analysis: 0. Chapman and Hall/CRC.

Cleveland William, S. and R. McGuill (1985). " Graphical perception and graphical methods for analyzing scientific data ".

Cleveland, W. S. (1993). Visualizing data. Vol. 2. Hobart Press Summit, NJ.

Gelman, A, C. Pasarica and R. Dodhia (2002). "Let's practice what we preach: turning tables into graphs". In: The American Statistician 56.2, pp. 121-130.

Matejka, J. and G. Fitzmaurice (2017). "Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing". In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM. , pp. 1290-1294.

Robbins, N. B. (2012). Creating more effective graphs. Wiley.

Tukey, J. W. (1977). "Exploratory Data Analysis".

Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.

Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.

136 / 136

About me

  • Assistant professor, Instituto de Estadística-Universidad de la República (IESTA-UDELAR), Montevideo Uruguay.

  • PhD and Msc. in Statistics from Iowa State University, USA

  • Interests: supervised learning methods, computational statistics, visualization and meta-analysis

  • Co-founder of R-Ladies Ames, R-Ladies Montevideo and GURU::MVD

  • Contact info: natalia@iesta.edu.uy, @pacocuak, http://natydasilva.com

2 / 136
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow