Visualizations for comparing datasets is a topic in all my data viz classes. Current solutions for comparing 2,3, 4 and more datasets are diverse and some are controversial. A one-fits-all solution does not exist, but there are well-working solutions, and some that should be avoided.
Comparing two or three datasets works well in Venn diagrams. Most people learn them in school, and if not they are intuitive*. Each dataset is shown as a circle and they are arranged such that all overlaps are shown. Done.
Things get problematic when comparing more than three datasets. Mathematically, it is not possible to show all overlaps of four or more datasets with circles. One possibility is to leave out some overlaps, as is often done in Euler diagrams. In the example below the overlap between “oocyte stage 2-7” and “oocyte stage 9” is for instance not visualized (RNAs localized in the oocytes across development, see publication). I find it however confusing when data is left out and sometimes “no overlap” is an important information itself.
Venn himself devised the diagram comparing four and more datasets by switching from a circles to ellipses. Branko Grünbaum developed the ellipsoid representation further for comparison of five dataset by . Their strategies are used by the online tool Draw Venn (Yves Vandepeer, Univ of Gent) where you can make Venn plots by simply uploading your data there. A variation is used by Heberle et al here (publication). There is also an R package by Victor Quesada.
I find there are two problems for Venn diagrams with more than three datasets. First, it takes long to read them and extract all information: comparing four datasets gives a diagram with 15 regions/11 overlaps, five datasets gives a diagram with 31 regions/26 overlaps! I invariably end up writing the numbers down into my own table. Secondly, the areas can’t possibly be representative for the overlap size – and this is a lost information.
New: upset plots
An alternate solution, the upset plot, was developed by Niels Gehlenborg and Jake Conway. Presence of dataset elements in a given intersect is shown with a dot in a simple table. The size of the intersect is represented with a bar chart. Both are simple visuals that are easy to consume. Their package is available in R and simple to use.
Customising upset plots
While the upset plots are simple, I think they can be improved. In upset plots the intersect is shown above the actual datasets, that serve as the legend. Basically, one is forced to read the upset from the bottom up. By flipping the plot horizontally this caveat is overcome: now the datasets are on the left, where we typically read first, and the bar is shown on the right nicely accompanying the respective set. Another improvement is to clearly label the intersects e.g. “present in one set”, “two sets” and to group them visually. Additionally, I have also color-coded the datasets to provide a quicker way of orienting the reader.
Depending on your message, you will have to find the optimal ordering strategy. I visualized the subcellular enrichments of RNAs and how they change localization during the development of the fruit fly oocyte. I would want to learn e.g. what happens to the hundreds of specific RNAs that enrich at early stages? Do they remain localized at all stages? It turns out the majority gives up their specific subcellular enrichment and instead become distributed inside the cell while other RNAs (not visualized here) take their place (more information on the biology).
Note, I did all the fine-tuning of the upset plot with illustrator but most likely it is also possible in R directly.
* Be aware that more people than you expect do not know Venn diagrams & require an introduction!