For all examples the movies data set contained in the package will be used.
library(UpSetR)
movies <- read.csv(system.file("extdata", "movies.csv", package = "UpSetR"),
header = T, sep = ";")
The set.metadata
parameter is broken up into 3 fields: data
, ncols
, and plots
.
data
: takes a data frame where the first column is the set names, and the following columns are attributes of the sets.
plots
: is a list that takes a list of parameters that are used to generate the plots. These parameters include column
, type
, assign
, and colors
.
column
: is the column of the dataframe that should be used for the specified plot.
type
: is what type of plot should be used to display the data from the specified column. If the data in the column is numeric, then the plot type can be either a bar plot ("hist"
), or heat map ("heat"
). If the data in the column is boolean, then the plot type can be a "bool"
heat map. If the data in the column is categorical (character), then the plot type can either be a heat map ("heat"
) or text ("text"
). Additionally, if the data in the column is ordinal (factor), then the plot type can be either a heat map or text. There is also a type called "matrix_rows"
which allows us to use apply colors to the matrix background using categorical data. This type is useful for identifying characteristics of sets using the matrix.
assign
: is the number of the columns that should be assigned to the specific plot. For instance if you’re plotting 2 set metadata plots then you may choose one plot to take up 20 columns and other plot 10 columns. Since the UpSet plot is typically plotted on a 100 by 100 grid, the grid will now be 100 by 130 where roughly 1/4 ofthe plot is assigned to the metadata plots.
colors
: is used to specify the colors used in the metadata plots. If the plot type is a bar plot then the parameter only takes one color for the whole plot. If the plot type is "heat"
or "bool"
, then a vector of colors can be provided where there is one color for each unique category (character). However, if the data type is ordinal (factor) there is no colors
input and the heat map works on a color gradient rather than applying different colors to each level. Lastly, if the plot type is “text"
then a vector of colors can be provided where there is one color for each unique string. If not colors are provided, a color palette will be provided for you.
In this example, the average Rotten Tomatoes movie ratings for each set will be used as the set metadata. This may help us draw more conclusions from the visualization by knowing how professional movie reviewers typically rate movies in these categories.
sets <- names(movies[3:19])
avgRottenTomatoesScore <- round(runif(17, min = 0, max = 90))
metadata <- as.data.frame(cbind(sets, avgRottenTomatoesScore))
names(metadata) <- c("sets", "avgRottenTomatoesScore")
When generating a bar plot using set metadata information it is important to make sure the specified column is numeric.
is.numeric(metadata$avgRottenTomatoesScore)
## [1] FALSE
The column is not numeric! In fact it is a factor, so we must coerce it to characters and then to integers.
metadata$avgRottenTomatoesScore <- as.numeric(as.character(metadata$avgRottenTomatoesScore))
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "hist",
column = "avgRottenTomatoesScore", assign = 20))))
In this example we will make our own data on what major cities these genres were most popular in. Since this is categorical and not ordinal we must remember to change the column to characters (it is a factor again). To make sure we assign specific colors to each category you can specify the name of each category in the color vector, as shown below. If you don’t care what color is assigned to each category then you don’t have to specify the category names in the color vector. R will just apply the colors to each category in the order they occur. Additionally, if you don’t supply anything for the colors
parameter a default color palette will be provided for you.
Cities <- sample(c("Boston", "NYC", "LA"), 17, replace = T)
metadata <- cbind(metadata, Cities)
metadata$Cities <- as.character(metadata$Cities)
metadata[which(metadata$sets %in% c("Drama", "Comedy", "Action", "Thriller",
"Romance")), ]
## sets avgRottenTomatoesScore Cities
## 1 Action 13 NYC
## 4 Comedy 50 Boston
## 7 Drama 84 NYC
## 13 Romance 50 Boston
## 15 Thriller 65 NYC
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "heat",
column = "Cities", assign = 10, colors = c(Boston = "green", NYC = "navy",
LA = "purple")))))
Now lets also use our numeric critic values!
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "heat",
column = "Cities", assign = 10, colors = c(Boston = "green", NYC = "navy",
LA = "purple")), list(type = "heat", column = "avgRottenTomatoesScore",
assign = 10))))
As a side note, the way the numerical heat map is handled is similar to how the ordinal heat maps are handled.
Now suppose we have metadata that tells us whether or not these genres are well accepted overseas. This could be used as a categorical column where there are only two categories, but for this example we will assume that your data is coded in 1’s and 0’s. It is important to keep in mind that if you run a “heat” with 0’s and 1’s instead of a “bool” the binary data will be treated as numerical values, and a color gradient will be used to show the relative differences.
accepted <- round(runif(17, min = 0, max = 1))
metadata <- cbind(metadata, accepted)
metadata[which(metadata$sets %in% c("Drama", "Comedy", "Action", "Thriller",
"Romance")), ]
## sets avgRottenTomatoesScore Cities accepted
## 1 Action 13 NYC 1
## 4 Comedy 50 Boston 0
## 7 Drama 84 NYC 0
## 13 Romance 50 Boston 0
## 15 Thriller 65 NYC 1
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "bool",
column = "accepted", assign = 5, colors = c("#FF3333", "#006400")))))
Let’s see what happens when we choose a “heat” instead of a “bool” for our binary data column.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "heat",
column = "accepted", assign = 5, colors = c("red", "green")))))
Lets say we prefer to show text instead of a heat map for the cities these genres were most popular in.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "text",
column = "Cities", assign = 10, colors = c(Boston = "green", NYC = "navy",
LA = "purple")))))
In some cases we may just want to incorporate categorical set metadata directly into the UpSet plot to easily identify characteristics of the sets via the matrix. To do this we need to specify the type as "matrix_rows"
, what column we’re using to categorize the sets, and the colors to apply to each category. There is also an option to change the opacity of the matrix background using alpha
. To change the opacity of the matrix background without applying set metadata see the shade.alpha
parameter in the upset()
function documentation.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "hist",
column = "avgRottenTomatoesScore", assign = 20), list(type = "matrix_rows",
column = "Cities", colors = c(Boston = "green", NYC = "navy", LA = "purple"),
alpha = 0.5))))
Now lets sum up all of our metadata information together on one plot!
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "hist",
column = "avgRottenTomatoesScore", assign = 20), list(type = "bool", column = "accepted",
assign = 5, colors = c("#FF3333", "#006400")), list(type = "text", column = "Cities",
assign = 5, colors = c(Boston = "green", NYC = "navy", LA = "purple")))))
Finally, lets include functionalities discussed in all of the other UpSetR Vignettes! This gives us a very in depth look at information about our sets, intersections, and specific elements.
upset(movies, set.metadata = list(data = metadata, plots = list(list(type = "hist",
column = "avgRottenTomatoesScore", assign = 20), list(type = "bool", column = "accepted",
assign = 5, colors = c("#FF3333", "#006400")), list(type = "text", column = "Cities",
assign = 5, colors = c(Boston = "green", NYC = "navy", LA = "purple")),
list(type = "matrix_rows", column = "Cities", colors = c(Boston = "green",
NYC = "navy", LA = "purple"), alpha = 0.5))), queries = list(list(query = intersects,
params = list("Drama"), color = "red", active = F), list(query = intersects,
params = list("Action", "Drama"), active = T), list(query = intersects,
params = list("Drama", "Comedy", "Action"), color = "orange", active = T)),
attribute.plots = list(gridrows = 45, plots = list(list(plot = scatter_plot,
x = "ReleaseDate", y = "AvgRating", queries = T), list(plot = scatter_plot,
x = "AvgRating", y = "Watches", queries = F)), ncols = 2), query.legend = "bottom")