Assistant professor, Instituto de Estadística-Universidad de la República (IESTA-UDELAR), Montevideo Uruguay.
PhD and Msc. in Statistics from Iowa State University, USA
Interests: supervised learning methods, computational statistics, visualization and meta-analysis
Co-founder of R-Ladies Ames, R-Ladies Montevideo and GURU::MVD
Contact info: natalia@iesta.edu.uy, @pacocuak, http://natydasilva.com
Why we use Visualization?
EDA, type of variables and viz examples
Ideas for an effective visualization
Why to use ggplot2?
Grammar of graphics
Examples
"The greatest value of a picture is when it forces us to notice what we never expected to see." Tukey (1977)
"Graphs provide powerful tools both for analyzing scientific data and for communicating quantitative information" Cleveland William and McGuill (1985)
The numerical calculations are exact, but graph are rough
For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis
Performing intrincate calculations is virtuous, whereas actually looking at the data is cheating Anscombe F (1973)
Visualization plays an important role in all the stages of the statistical analysis.
Initial Exploration: To find general and specific patterns in the data.
Models: Check the data assumptions before to run a model.
Exploratory data analysis (EDA), it is an iterative process to explore broadly different aspects of the data.
Origins: John Tukey encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments .
help us to understand our data
we need to define questions to guide our research
each questions focused on something specific about the data and this defines the type of data the model and possible transformations
key point, define interesting questions for our problem
X | Y |
---|---|
1.972 | 1.236 |
1.112 | 1.994 |
0.000 | 1.009 |
0.665 | 1.942 |
0.235 | 0.356 |
0.247 | 1.658 |
1.275 | 1.961 |
0.702 | 0.045 |
1.760 | 0.350 |
1.691 | 0.277 |
1.628 | 1.778 |
1.957 | 1.290 |
Statisticians recommend graphical displays but often do not follow this recommendation in presenting their own research.
They analyze some papers in JASA and shows some examples from table to plots
They said there is a good reason to be lazy, it takes a lot of work to make a good visualization
Nice graphs are possible, especially when we think hard about why we want to display these numbers in the first place
If the world’s leading statistical journal doesn’t do it right, there is obviously still room for progress
Gelman, Pasarica, and Dodhia (2002)
\(\bar x =\) 54.27 and \(\bar y =\) 47.83
\(\bar x =\) 54.27 and \(\bar y =\) 47.84
\(\bar x =\) 54.26 and \(\bar y =\) 47.83
Same statistical summaries
Very different distributions
Charles Minard, Napoleon’s Russian Campaign in 1812
Florence Nightingale, Coxcombs Causes of Mortality for the Army in the East (1858)
Data visualizations are based on data
Summarize information
Different type of visualizations emphasizes different aspects of data
Data contains cases and variables.
In tabular form cases are the rows and columns are usually the variables.
Variables can take different forms:
Key point to do visualization: understand the type of variables you have
Continuous variable: a variable with infinite number of values, like “time” or “weight”.
Discrete variable: numerical variable that can only take on a certain number of values. Example, “number of students in the class”
Categorical variable: variables than can be put into groups or categories. Example, hair color, dog breeds
and more...
Impossible to answer, but ... some important types are:
dotplots, histogram, density plot, time series, barcharts (all one dimensional)
scatterplot, side-by-side boxplots (two-dimensional) parallel coordinate plots, mosaic plots (higher dimensions)
maps (geographic information)
and more
Different data displays are due to number and type of the variables
emphasis: every chart shows a different aspect of the data
Not clear rules to do EDA but we can begin with some basic questions:
What is the variability in my data?
How is the distribution of each variable?
Is there any relationship between some selected variables?
How these two variables covariate?
To identify which tools to use in EDA we should identify the type of variables to analyze
Example:
Categorical variables, we can analyze the distribution using a bar charts
Continuous variables, we can analyze the distribution using an histogram
Variable: categorical/qualitative
Bins separated by equal space
Bins can be re-ordered, and should be to improve the viz
Parameter: none
Variable: continuous/quantitative
Bins adjacent
Bins can't be re-ordered
Parameter: bin width
Histogram, density plot, jittered dot plot and boxplot
Every plot emphasizes a different aspect of the data: skewness, modality, symmetry, gaps, outliers
There is not right plot all are useful
Covariation is the joint variation for two variables
The best way to detect the covariation is to visualize the relationship between two variables
The type of variable is what defines the way to visualize covariation
It is common to explore the distribution of a continuous variables according to a categorical variable
histograms or densities colored by a categorical variable to compare the distributions for each category
Boxplot:
We see the data distribution for continuous variables
based on the five-number summaries (min, Q1, Q2, Q3, max).
outliers( Q1–1.5 IQR and Q3 + 1.5 IQR)
Violin: it is a combination of boxplot and density plot
It is useful to describe the categorical variables, for each category the bar length is the proportion to the number of observation in each category
If we want to represent more than one categorical variable we can use stacked bar graph.
Simple stacked bar graph: place each value for the segment after the previous one. The total value of the bar is all the segment values added together. Ideal for comparing the total amounts across each group/segmented bar.
100% Stack Bar Graphs show the percentage-of-the-whole of each group and are plotted by the percentage of each value to the total amount in each group. This makes it easier to see the relative differences between quantities in each group.
Percentage of students that abandon first year of high school in Uruguay, 2016
Not all the visualizations are equally effective
There are different criteria to evaluate graphs (Cleveland, Tufte, Car, Wainer, etc )
Lets see a set of criteria to evaluate graphics
Tufte elaborate some guidelines for constructing graphics and we can use it as criteria for evaluating graphics
Show the data
Induce the viewer to think about the data
Avoid distorting what the data have to say
Present many numbers in a small space
Make large data sets coherent
Reveal the data at several levels of detail
Serve a reasonably clear purpose
Be closely integrated with the statistical and verbal descriptions of the data
What is the data in this picture?
Supporting structure: street map
Dan Carr: background+pale grid+dark data marks Sets plot off from page, and makes grey scale equivalent to text block of same size
What's the data?
\(\frac{\text{Size of effect shown in graphic}}{ \text{Size of effect in data}}\)
Should be close to 1
Fuel economy example:
Data: \(\frac{27.5-18.0}{ 18.0}*100 = 53\%\)
Graphic: \(\frac{5.3-0.6}{0.6}*100 = 783\%\)
Lie factor: \(\frac{783}{53}*100= 14.8\% >>> 1\) Huge!
Data-ink ratio (Tufte)
Divide the total ink used to draw the data by the total ink used to draw the graphic.
How do you calculate this? Not easily! Generally not possible in practice
How do you do this?
Cleveland’s graphical construction concerns primarily statistical plots of data, for a scientific audience.
The two over-reaching principles are:
Make the data stand out
Avoid superfluity
Clear vision/understanding
Use of guides: scales, axes, tick marks, grid lines, legends
Extensive captions
Aspect ratio: scale of horizontal to vertical
Based on graphical perception studies
The best visualizations are the ones that require the use of "pre-attentive" vision (instantly without apparent effort) Cleveland William and McGuill (1985).
When a person looks at a graph,the information visually decoded by the person's visual system.
A graphical method is successful only if the decoding is effective.
Good visualizations are the ones that optimize the human visual system
If this is true we should know how the human system decode a graph
Cleveland provides an order of the elementary tasks for the graphical perception of quantitative information
We should identify which is the most important comparison I want in my quantitative variables
We should codificate the most important comparison in the graphical elements in the table (1. position along a common scale )
Pie charts are always an error!!!
But now we know why
Because decodificate quantitative variables based on angles is more difficult than using other graphical elements in terms of perception
Use always a bar graph instead of a pie chart
Context (Rhetoric) :
Content (Aesthetic):
Perception (Perceptual)
To have more ideas about what plots to produce to answer the questions you are interested in: Robbins (2012), Cleveland (1993), Chambers (2017) and Tukey (1977)
ggplot2 an R package for producing statistical, or data, graphics developed by Wickham (2016).
differs with most other graphics packages because it has a deep underlying grammar.
This grammar, based on the Grammar of Graphics theory Wilkinson (2006).
It is the most used R package to do visualization and is 10 years old
The Grammar of graphics answer the following questions:
What is a statistical graphic?
How to describe a graph?
How to create a graph?
"A statistical graphic is a mapping from data to aesthetic attributes (colour, shape, size) of geometric objects (points, lines, bars)"
A statistic is a function of data, like the sample mean \(\bar x = \sum_{i=1}^n \frac{x_i}{n}\) or sample variance \(S^2 = \sum_{i=1}^n \frac{(xi- \bar x)^2}{n-1}\)
The grammar of graphics provides a tight connection between data and statistics.
Key point about ggplot2
, its makes plots another type of statistic.
It is a function of the data, a mapping from data to aesthetic attributes of geometric objects.
To do visualizations we already know
To create new visualization
To identify better graphs to visualize our data
The limit is your imagination!
ggplot2
Set of independent components, gives flexibility
Not limited to predetermined plots, you can create what you want
Defined in based to a set of principles, easy to learn
The graphs are easily reproducible
You can make publication quality graphs in a short time
Design to work in an iterative way based on layers
data: with a set of aesthetic mappings (aes) describing how variables in the data are mapped to aesthetic attributes that you can perceive.
layers: geometric elements (geoms, points, lines, polygons, text, ...) and statistical transformations summarize data in many useful ways (stats, identities, counts, bins,...)
scales: map values in the data space to values in an aesthetic space (ej. color, size, shape or position).Scales draw a legend or axes.
coord: describes how data coordinates are mapped to the plane of the graphic. Normally Cartesians, but example pie charts use polar coordinates
facetting: how to break up the data into subsets and how to display those subsets as small multiples.
theme controls the finer points of display, like the font size and background colour
install.packages("ggplot2")
ggplot2
but R packages are contributed and can change in future iterations
Developing version available in GitHub: https://github.com/tidyverse/ggplot2
install.packages("devtools")library(devtools)install_github("tidyverse/ggplot2")
ggplot2
is part of tidyverse
ggplot2
ayudamail list: http://groups.google.com/group/ggplot2
stackoverflow: http://stackoverflow.com
# cargamos los datos tips <- read_csv("http://www.ggobi.org/book/data/tips.csv")head(tips)
## # A tibble: 6 x 8## obs totbill tip sex smoker day time size## <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <int>## 1 1 17.0 1.01 F No Sun Night 2## 2 2 10.3 1.66 M No Sun Night 3## 3 3 21.0 3.5 M No Sun Night 3## 4 4 23.7 3.31 M No Sun Night 2## 5 5 24.6 3.61 F No Sun Night 4## 6 6 25.3 4.71 M No Sun Night 4
data: data to visualize
aes: A set of aesthetic mapping between variables in the data and visual properties (e.g color, size etc)
layer: At least one layer describing how to render each observation. Each layer is created with geom function .
data: tips
aes: totbill maping to x
position, tip to y
position.
layer: points with geom_point
.
Let's make our first plot!
Aspect ratio: ratio between the width and the height of a rectangle
Overplotting
What do you see?
Weak and lineal relationship between tip and total bill
A lot of variability
To include other variables, we can use other aes (color, shape, size)
aes(x = totbill, y = tip, colour = sex)aes(x = totbill, y = tip, shape = sex)aes(x = totbill, y = tip, size = size)
To fix the aesthetic color, outside aes
(outside layer) or use I('blue')
in aes
We can display additional categorical variables in a graph subseting the graphical display.
Create table of plots subseting the data and partitioning the graphical display.
Two types: facet_grid
y facet_wrap
facet_wrap
facet_grid
facet wrap()
and facet grid()
you can control whether the position scales are the same in all panels (fixed) or allowed to vary between panels (free) with the scales parameter:
Fixed scales make it easier to see patterns across panels; free scales make it easier to see patterns within panels.
If we substitute geom_pont()
with another geom
we get a different visualization .
Most common geoms
:
geom_smooth()
geom_boxplot()
geom_histogram()
geom_bar()
geom_path()
y geom_lines()
each geom
its associate with particular geometric elements
geom
Examplesp <- ggplot(tips, aes(x = day, y = tip))
p
p + geom_point()
p + geom_boxplot()
p + geom_violin()
geom
, Example p <- ggplot(tips, aes(x = day, fill = smoker))
p + geom_bar()
p + geom_bar(position="stack")
p + geom_bar(position="dodge")
p + geom_bar(position="fill")
A scale controls the mapping from data to aesthetic attributes, and we need a scale for every aesthetic used on a plot.
We need to convert them from data units (totbill, sex, etc) to graphical units (color, shape, etc)
This conversion process is called scaling and performed by scales
Each scale operates across all the data in the plot, ensuring a consistent mapping from data to aesthetics.
colors
are represented by a six-letter hexadecimal string, sizes
by a number and shapes
by an integer
You can generate many plots without knowing how scales work, but understanding scales and learning how to manipulate them will give you much more control
The aesthetic mapping only said a variable is mapped to an aesthetic element but doesn't say how to be done.
When a variable is mapped to a shape
using aes(shape = x)
doesn't specify the specific shape (shape
) should take.
When we use aes(color = z)
we don't said the specific color
Describe the color, shape, size, etc (color, shape, size) is done using transformations in scale
color
and fill
size
shape
linetype
scales
modify a series functions with this structure scale_<aesthetic>_<type>
. See scale_<tab>
, list of scale
functions.
scales
disponiblesColors are controlled through scales
scale_colour_discrete
(scale_colour_hue) and scale_colour_continuous
(scale_colour_gradient) are the default choices for factor variables and numeric variables
We can change parameters to the default scale, or we can change the scale function
scale_colour_gradient (..., low = "#132B43", high = "#56B1F7", space = "Lab", na.value = "grey50", guide = “colourbar")
colors can be specified by hex code, name or through rgb()
Gradient goes from low to high - that should match the interpretation of the mapped variable
scale_colour_gradient2(..., low = muted("red"), mid = "white",
high = muted("blue"), midpoint = 0, space = "Lab", na.value = "grey50", guide = "colourbar")
midpoint is value of the ‘neutral’ color
gradient2 is a divergent color scheme
best matches a variable that goes from large negative to zero to large positive (or below mean, above mean)
scale_color_hue (..., h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1, na.value = "grey50")
uses hue, chroma and luminance (=value)
each level of a variable is assigned a different level of h
scale_colour_brewer(..., type = "seq", palette = 1, direction = 1)
brewer schemes are defined in RColorBrewer (Neuwirth, 2014)
palettes can be specified by name or index
see also http://colorbrewer2.org/ (Brewer et al 2002)
There are 3 types of palettes, sequential, diverging, and qualitative.
Cualitativa do not imply magnitude differences between legend classes, and hues are used to create the primary visual differences between classes. Qualitative schemes are best suited to representing nominal or categorical data.
Divergente put equal emphasis on mid-range critical values and extremes at both ends of the data range. The middle of the legend is emphasized with light colors and low and high extremes are emphasized with dark colors
RColorBrewer
provides color schemes for discrete variables
display.brewer.all()
While specified palette Set2
has 8 colors
Lack of colors in the palette triggers ggplot warnings (and invalidates plot as seen above):
RColorBrewer
gives us a way to produce larger palettes by interpolating existing ones with constructor function colorRampPalette
they build palettes with arbitrary number of colors by interpolating existing palette.
Select Set1
palette
Select Set1
palette
ggplot(data = tips) + geom_bar(aes(factor(round(tip)), fill = factor(round(tip)) )) + scale_fill_brewer( palette = "Set1") + labs(x = "Tip in USD", fill = "Tip")
usar colorRampPalette(brewer.pal(9, "Set1"))
usar colorRampPalette(brewer.pal(9, "Set1"))
getPalette = colorRampPalette(brewer.pal(9, "Set1"))ggplot(data = tips) + geom_bar(aes(factor(round(tip)), fill = factor(round(tip)) )) + scale_fill_manual( values = getPalette(10)) + labs(x = "Tip in USD", fill = "Tip")
Area plots use fill to map values to the fill color
only discrete color scales can be used: scale_fill_hue
, scale_fill_brewer
, scale_fill_grey
, ...
scale_fill_manual
(..., values)
values is a vector of color values. At least as many colors as levels in the variable have to be listed
Point shape
scale_shape_continuous()
, scale_shape_discrete()
,
scale_shape_manual()
scale_shape_manual(values=c(8, 15))
Previously we use labs
but the long way is...
ggplot(tips, aes(totbill, tip, colour = smoker)) +
geom_point() + theme(aspect.ratio = 1)
gsave("ppt123.png")
# png (pixelated raster image)
ggsave("ppt123.pdf")
# pdf (scalable vector image)ggplot(tips, aes(totbill, tip, colour = smoker)) + geom_point() + theme(aspect.ratio = 1)
Each observation is represented as a point which position is associate with to variables (horizontal position and vertical)
Each point has size, color, shape, called aesthetic elementsaes
aes
are properties that can be seen in the plot. Each
aes
can be mapped to a variable or to be fix to a constant value
total
is mapped to an horizontal position, propina
to the vertical position and fuma
to color. size and shape are not mapped to variables (default values)
## # A tibble: 3 x 3## totbill tip smoker## <dbl> <dbl> <chr> ## 1 17.0 1.01 No ## 2 10.3 1.66 No ## 3 21.0 3.5 No
x | y | colour |
---|---|---|
17.0 | 1.01 | No |
10.3 | 1.66 | No |
21.0 | 3.5 | No |
The data, aesthetic mapping, geometric objects and statistical transformations define a layer
We can define a graphic with multiple layers
The grammar of layers define the components of a graphic:
data and subset of variable mapping to a aesthetic elements
one or more layers, each layer has a geometric element, a statistical transformation, a position and optional data and aes
ggplot() + layer( data = tips, mapping = aes(x = totbill, y = tip), geom = "point", stat = "identity", position = "identity" ) + scale_x_continuous() + scale_y_continuous() + coord_cartesian()
Equivalent to :
ggplot(data = tips, aes(x = totbill, y = tip)) + geom_point()
More than one data set:
ggplot() + geom_point(data = tips, aes(x = totbill, y = tip)) + geom_point(data = data.frame(x = 30, y = 6), aes(x, y), color = "red", size = 10)
Layers in a bar graph
p1 <- ggplot() + layer( data = tips, mapping = aes(x = day , y = ..prop.., group = 1), geom = "bar", stat = "count", position = "identity" ) + scale_x_discrete() + scale_y_continuous() coord_cartesian()
Equivalent to :
p1 <- ggplot(data = tips, aes(x = day, y =..prop.., group = 1)) + geom_bar()
ggplot_build(p1)$data[[1]]
## y count prop x group PANEL ymin ymax xmin xmax## 1 0.07786885 19 0.07786885 1 1 1 0 0.07786885 0.55 1.45## 2 0.35655738 87 0.35655738 2 1 1 0 0.35655738 1.55 2.45## 3 0.31147541 76 0.31147541 3 1 1 0 0.31147541 2.55 3.45## 4 0.25409836 62 0.25409836 4 1 1 0 0.25409836 3.55 4.45## colour fill size linetype alpha## 1 NA grey35 0.5 1 NA## 2 NA grey35 0.5 1 NA## 3 NA grey35 0.5 1 NA## 4 NA grey35 0.5 1 NA
ggplot(data = tips, aes(x = totbill , y =..density..)) + geom_histogram()
Themes allow to control every aspect of non-data related aspects of a plot
theme
gives you control of fonts, background, tick marks, etc.
Two pre-defined themes: theme_grey
(default), theme_bw
Use theme_set
if you want it to apply theme to every future plot, e.g. theme_set(theme_bw())
ggthemes
package defines additional themes
library(help = "ggthemes")
lists all themesYou can also make your own theme, or modify and existing.
Themes are made up of elements which can be one of:
element_line
, element_text
, element_rect
,
element_blank
Gives you a lot of control over plot appearance.
Axis:
axis.line
, axis.text.x
, axis.text.y
, axis.ticks
, axis.title.x
, axis.title.y
Legend:
legend.background
, legend.key
, legend.text
, legend.title
Panel:
panel.background
, panel.border
, panel.grid.major
, panel.grid.minor
Strip (facetting):
strip.background
, strip.text.x
, strip.text.y
for a complete overview see ?theme
Let's do this plot!
Let's do this plot!
ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) + geom_point() + theme(aspect.ratio = 1, legend.position = "bottom",panel.background = element_rect(fill = "white"),panel.grid = element_line(colour = "grey92"),panel.border = element_rect(colour = "grey20", fill = NA),legend.key = element_rect(fill = "white"))
Let's do this plot!
Let's do this plot!
ggplot(data = tips, aes(x = totbill, y = tip, colour = sex)) + geom_point() + theme(aspect.ratio = 1, legend.position = "bottom",panel.background = element_rect(fill = "white"),panel.grid = element_line(colour = "grey92"),panel.border = element_rect(colour = "grey20", fill = NA),legend.key = element_rect(fill = "white"),axis.text.x = element_text(size =20),axis.text.y = element_text(size = 20),axis.title = element_text(size = 30),legend.text = element_text(size = 20), legend.title = element_text(size = 20))
Anscombe F, J. (1973). "Graphs in statistical analysis". In: The American Statistician.
Chambers, J. M. (2017). Graphical Methods for Data Analysis: 0. Chapman and Hall/CRC.
Cleveland William, S. and R. McGuill (1985). " Graphical perception and graphical methods for analyzing scientific data ".
Cleveland, W. S. (1993). Visualizing data. Vol. 2. Hobart Press Summit, NJ.
Gelman, A, C. Pasarica and R. Dodhia (2002). "Let's practice what we preach: turning tables into graphs". In: The American Statistician 56.2, pp. 121-130.
Matejka, J. and G. Fitzmaurice (2017). "Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing". In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM. , pp. 1290-1294.
Robbins, N. B. (2012). Creating more effective graphs. Wiley.
Tukey, J. W. (1977). "Exploratory Data Analysis".
Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.
Anscombe F, J. (1973). "Graphs in statistical analysis". In: The American Statistician.
Chambers, J. M. (2017). Graphical Methods for Data Analysis: 0. Chapman and Hall/CRC.
Cleveland William, S. and R. McGuill (1985). " Graphical perception and graphical methods for analyzing scientific data ".
Cleveland, W. S. (1993). Visualizing data. Vol. 2. Hobart Press Summit, NJ.
Gelman, A, C. Pasarica and R. Dodhia (2002). "Let's practice what we preach: turning tables into graphs". In: The American Statistician 56.2, pp. 121-130.
Matejka, J. and G. Fitzmaurice (2017). "Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing". In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM. , pp. 1290-1294.
Robbins, N. B. (2012). Creating more effective graphs. Wiley.
Tukey, J. W. (1977). "Exploratory Data Analysis".
Wickham, H. (2016). ggplot2: elegant graphics for data analysis. Springer.
Wilkinson, L. (2006). The grammar of graphics. Springer Science & Business Media.
Assistant professor, Instituto de Estadística-Universidad de la República (IESTA-UDELAR), Montevideo Uruguay.
PhD and Msc. in Statistics from Iowa State University, USA
Interests: supervised learning methods, computational statistics, visualization and meta-analysis
Co-founder of R-Ladies Ames, R-Ladies Montevideo and GURU::MVD
Contact info: natalia@iesta.edu.uy, @pacocuak, http://natydasilva.com
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |