Structuring data according to the ‘tidy data‘ standard simplifies data analysis and data visualisation. But understanding the structure of tidy data does not come naturally (in my experience), since it is quite different from the structure of data in spreadsheets or tables. Here, I explain how to convert typical spreadsheet data to tidy data to facilitate data visualisation.
When I started to use ggplot2 for data visualisation, I was impressed by its power and elegance. With ggplot2, a free, open source package for R, you can make high-quality graphs. For instance, the graphs that I made for a previous blog advocating the importance of showing the actual data instead of summaries were made with ggplot2 (scripts available here). Before I could make those graphs, however, I was struggling with the tidy data structure, which is a requirement for using ggplot2.
The reason that I was struggling is that as a researcher I am used to collect, process and summarise my data in spreadsheets. In addition, I am used to read and present data in a tabular format. Tidy data contains exactly the same information as non-tidy, spreadsheet data, but it is organised in a different way. The tidy data standard defines a consistent way to structure datasets, facilitating data manipulation and visualization (Wickham, 2014). So, once the data is organised according to the tidy data principles, making the graphs is relatively straightforward.
Below, I will explain how data in the typical spreadsheet format (and I’m not naming names here) can be converted to tidy data using R. I will use two examples of data in a spreadsheet format that I often encounter during my research. The first example is a dataset in which a single parameter is compared across different conditions. The second example is a dataset from a time-series experiment with different conditions.
Before we can perform the conversion to the tidy format, we need to define what types of information are stored in a dataset. I will use the nomenclature that is used in Tidy Data by Hadley Wickham. Let’s say that you have measured cell lengths under a number of different experimental conditions (A,X,Y,Z). An example dataset in the typical non-tidy spreadsheet format would look like this:
A X Y Z
0.8 7.1 9.9 5.1
1.9 6.7 9.5 4.5
2.6 6.3 8.5 4.4
3.6 6.2 8.7 4.3
4.4 4.5 7.3 3.3
5.7 4.3 6.2 3.1
6.3 4.1 5.5 2.5
7.1 4.8 6.8 3.4
8.2 5.1 7.3 3.6
9.2 5.5 7.9 3.9
The dataset consists of values and these are usually numbers or text. In this dataset, all the measured values represent the same characteristic (cell length) and therefore belong to a single variable. Here, we indicate that variable as “Length”. Since the values that represent cell length are experimentally determined, “Length” is a measured variable. The different conditions (A,X,Y,Z) are also assigned to a variable; “Condition”. Since this variable was known when the experiment was designed it is a fixed variable.
The most important requirement for tidy data is that each variable occupies only a single column. This rule is violated in the example above, because values that belong to the same variable (“Length”) are distributed over different columns. The tidy version of the dataset would have two columns, one with the fixed variable “Condition’ and one with the measured variable “Length”. Below, I will explain how to use R to convert the spreadsheet data into tidy data.
In the next part, which will be a tutorial on tidying data in R, I assume some basic knowledge of R, including setting the working directory and installing packages. The packages that need to be installed are tidyr and ggplot2. The example dataset is available here.
Example 1: Tidying a spreadsheet with different conditions
> data_spread <- read.csv('test-data.csv')
To verify this step, you can print the first six lines of the dataframe by using the function head() with the name of the dataframe as the argument:
Which will return:
A X Y Z 1 0.8 7.1 9.9 5.1 2 1.9 6.7 9.5 4.5 3 2.6 6.3 8.5 4.4 4 3.6 6.2 8.7 4.3 5 4.4 4.5 7.3 3.3 6 5.7 4.3 6.2 3.1
The first row of the dataframe is the header, which specifies the experimental condition of each column (A,X,Y,Z). All other rows contain the observed cell length.
To convert the dataframe to a tidy format I use the function gather(), which is part of the tidyr package. Using gather(), I specify that “Condition” is taken from the header and should go in the first column and that all the values (which are all cell lengths) should be gathered in the second columns and that this variable should have the name “Length”. The result is assigned to the dataframe ‘data_tidy’:
> data_tidy <- gather(data_spread, Condition, Length)
To show the contents of the entire dataframe, enter the name of the dataframe at the prompt (the result is not shown here):
To show the first six lines of the tidy dataframe ‘data_tidy’, use the function head():
> head(data_tidy) Condition Length 1 A 0.8 2 A 1.9 3 A 2.6 4 A 3.6 5 A 4.4 6 A 5.7
To show the last six lines of the tidy dataframe, use the function tail():
> tail(data_tidy) Condition Length 35 Z 3.3 36 Z 3.1 37 Z 2.5 38 Z 3.4 39 Z 3.6 40 Z 3.9
With the new, tidy dataframe ‘data_tidy’ it is straightforward to plot the data with ggplot2:
> ggplot(data_tidy, aes(x = Condition, y = Length)) + geom_jitter(position=position_jitter(0.3), cex=1, color="grey40")
In this example we have converted a spreadsheet with measurements, taken under different conditions, into tidy data. In the tidy data structure, only two columns are used, one for the fixed variable “Condition” and another one for the values that belong to the measured variable “Length”. This format, that looks like a list, may seem odd. It also uses more storage space (or memory), since the “Condition” is listed for every value. Still, this format makes perfect sense in R and it can be used to plot the data with ggplot2.
Example 2: Tidying a spreadsheet with time-dependent data and different conditions
Let’s take this a step further with the same non-tidy spreadsheet data “data_spread”. Now, suppose that the data is from a time series and column A represents time. To indicate that column A is actually “Time” you can change the name of the first column using the function colnames():
> colnames(data_spread) <- "Time"
You can verify that the name of the first column has been changed by using the function head():
> head(data_spread) Time X Y Z 1 0.8 7.1 9.9 5.1 2 1.9 6.7 9.5 4.5 3 2.6 6.3 8.5 4.4 4 3.6 6.2 8.7 4.3 5 4.4 4.5 7.3 3.3 6 5.7 4.3 6.2 3.1
In this dataset cell length has been measured at different times and for different conditions (X,Y,Z). In addition to the variables “Length” and “Condition” a third variable “Time” is present in this dataset. Again, this dataset is not tidy since values that belong to the same variable (cell length) are spread over different columns. Note that the first column contains only values that belong to the variable “Time”, which should remain like that. So, for this dataset we want to gather all the values that represent cell lengths in one column, but these should not be mixed with the variable “Time”. To achieve this, we will use the same function gather() and add that “Time” should be excluded from gathering:
> data_tidy_time <- gather(data_spread, Condition, Length, -Time)
Let’s check the structure of the new dataframe:
> head(data_tidy_time) Time Condition Length 1 0.8 X 7.1 2 1.9 X 6.7 3 2.6 X 6.3 4 3.6 X 6.2 5 4.4 X 4.5 6 5.7 X 4.3
The new dataframe ‘data_time_tidy’ can be used as input for ggplot2 to plot a “Length” versus “Time” plot for three conditions:
> ggplot(data_tidy_time, aes(x=Time)) + geom_line(aes(x= Time, y=Length, color=Condition), size=0.5, alpha=1)
In this example we have converted a spreadsheet with time-series data in tidy data. The tidy data structure consists of three columns for the three variables “Time’, “Condition” and “Length”. Each column contains only values that belong to the indicated variable. This long format with the values for “Time” repeated several times look atypical. But this format offers flexibility, for instance when the measurements for the conditions are taken at different times. In addition, it is the only structure for packages that use tidy data as input (tidy tools). Finally, the tidy data structure enables plotting the data with ggplot2 and grouping the data or coloring the data according to the variable “Condition”.
If you do not comprehend the concept of tidy data right away, don’t worry. It took me actually quite a while to grasp it and I’m still not confident that I fully understand it. I can highly recommend reading the paper Tidy Data by Hadley Wickham, which offers a thorough and clear explanation. In my experience, the best way to learn how to tidy your data is by doing it. Start out easy, with the example dataset, or with some ‘simple’ data of your own. From there on, you can work on more complicated datasets. Many examples of tidying more complex data are out there, including a very nice tutorial on Data Tidying by Garrett Grolemund. If you are stuck, try to find solutions online, use a community site, ask a colleague (this is the right moment for me to thank Katrin Wiese) or post a tweet. I hope that this tutorial will help people that are used to working with data in spreadsheets and tables to take full advantage of the power of ggplot2, tidy tools and R.
Acknowledgments: A shout-out to the twitter community, to anyone sharing code and Jakobus van Unen, Katrin Wiese and all other colleagues for their input.