Visualizing data through graphs, charts, and other tools is crucial for understanding the complete narrative within the data. To underscore the significance of graphical representations in data exploration, commonly used techniques involve datasets that share numerous statistical properties yet produce distinct graphs.
The Complete Tableau Bootcamp for Data Visualization course provides three well-known and exemplary examples that illustrate the importance of data visualization:
- Anscombe’s Quartet
- Datasaurus
- Datasaurus Dozen
These examples highlight the importance of not relying solely on summary statistics and emphasize the critical role of data visualization tools in accurate data analysis.
Anscombe’s Quartet
The first set, called Anscombe’s Quartet, developed by the statistician Francis Anscombe, comprises four distinct datasets. Each dataset contains eleven (x, y) points, with the x values being identical across the first three datasets.
Before creating scatter plots, let’s showcase the summary statistics for each dataset.
Upon reviewing basic summary statistics, it’s apparent that all four datasets are identical. Each dataset shares identical summary statistics (X/Y mean, X/Y standard deviation, and Pearson’s correlation) up to the second decimal point.
Hence, following standard convention, we would anticipate significant similarities among these datasets.
Let’s explore their behavior through graphical representation. Along with generating scatter plots for each dataset, we included a linear regression line and displayed its corresponding equation.
Despite having identical summary statistics and being designed to produce nearly identical linear regression lines, accurate to 2 decimal places for the intercept and 3 decimal places for the slope, the graphical representations of the datasets reveal clear and distinct differences.
When plotted, the datasets exhibit the following characteristics:
- AQ-I: Shows a simple linear relationship with a strong upward trend, indicating two correlated variables.
- AQ-II: Displays a different narrative, showing a non-linear relationship between the variables.
- AQ-III: Exhibits a perfect linear relationship, except for one outlier that greatly impacts the results, reducing the correlation coefficient from 1 to 0.816.
- AQ-IV: Illustrates a scenario where a single high-leverage point can lead to a high correlation coefficient, even though the other data points do not suggest any relationship between the variables.
This example highlights that descriptive statistics and numerical information alone may not provide a comprehensive understanding of the data.
Datasaurus
Data visualization expert Alberto Cairo created the Datasaurus dataset in 2016. In a tweet, he advocates for the idea that one should
“never trust summary statistics alone; always visualize your data.“
This dataset comprises 142 data points with seemingly normal statistics:
However, when plotted, these points reveal the distinctive shape of a dinosaur.
Datasaurus Dozen
For decades, Anscombe’s Quartet has been a widely recognized and effective tool for emphasizing the importance of visualizing data. However, the origin of Anscombe’s datasets remains unknown.
Justin Matejka and George Fitzmaurice introduced a versatile method for transforming any dataset into a target shape of choice while preserving specified summary statistics (up to two decimal places). Drawing inspiration from the Datasaurus dataset (Dino), they used it as the initial point to generate a set of diverse datasets, resulting in The Datasaurus Dozen.
The Datasaurus Dozen consists of 13 datasets, which include the original Datasaurus dataset along with 12 others. While each data set appears drastically different from the other data sets, they all have the same summary statistics (X/Y mean, X/Y standard deviation, and Pearson’s correlation) to the second decimal point.
Instead of plotting all datasets in a single figure, we aimed to plot each dataset with its own statistics in a separate figure. To avoid creating a total of 13 cells to plot all datasets, I implemented a few ways of displaying Datasaurus Dozen plots in the notebook Importance of Data Visualization.ipynb in this GitHub repository:
- Interactive controls: Utilizing a dropdown menu to select any of the 13 datasets for plotting, without the need to rewrite or rerun the code.
- Image animation: Creating the illusion of plot animation.
By downloading the notebook and running it on your local machine, you will be able to display and explore them.
As the final result, the notebook produces this GIF animation:
In addition to the visualizations displayed in the notebook, you can also view previously created “datasaurus” plots in Tableau using the same data:
- Datasaurus
- Datasaurus Dozen (use the slider to switch among 13 datasets)
Conclusion
Statistics focuses on employing objective, quantitative measures to understand data. However, these examples demonstrate that relying solely on summary statistics is insufficient, and data visualization tools are critical for proper data analysis. Through data visualization, similarities and differences among datasets become apparent.
What better way to conclude the blog than with a quote:
“Make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.” – F.J. Anscombe, 1973