Whether it was Otto Von Bismarck, John Godfrey Saxe, or Mark Twain who said it first, you really don’t want to see how these things are made.
“Laws, like sausages, cease to inspire respect in proportion as we know how they are made.”
Same goes for data visualizations, or just about any processing of large sets of data from sometimes messy, unreliable sources.
While everyone likes the final product, most people who don’t work directly with data have no clue how much effort goes into turning a set of dirty data into a beautiful and reliable visualization. Beautiful is pretty easy, actually, compared to what it takes to scrub and prepare data. Once a dataset is made into a graphic, the data tends to be given more weight, which is why it’s so crucial to make sure you’re working with data you can trust.
We’ve been speaking to some of our active users, and you’ve all mentioned data quality and the time it takes to scrub the stuff as key problems in your workflows. One of you even called yourselves “overpaid information janitors” (although we wouldn’t agree you’re overpaid -- this isn’t an easy job), and it’s entirely true. In fact, data science professionals spend between 50 and 80 per cent of their time cleaning and checking data.
If you’re working with data, you already know that much of the value comes from the people who clean the data so that it can be useful to the algorithms whose builders get all the credit. Of course, we automate as much as possible, there’s always manual work to be done, which anyone outside the task may not understand. The business value comes from the people, not the numbers.
So we thought we’d make a little checklist you can use with people who’ve asked you to just whip something up. We can help you with the last-mile problem of visualizing what you’ve cleaned, but we know that’s only one of the hard parts of working with data.
Should you make a data visualization?
If you answered no to too many of these, then consider making some quick visualizations in Excel (or GetBulb), to help you learn the shape of the data and what’s in it. Then you’ll either be ready to make a visualization, or you’ll realize you don’t need one.
While everyone likes the final product, most people who don’t work directly with data have no clue how much effort goes into turning a set of dirty data into a beautiful and reliable visualization. Beautiful is pretty easy, actually, compared to what it takes to scrub and prepare data. Once a dataset is made into a graphic, the data tends to be given more weight, which is why it’s so crucial to make sure you’re working with data you can trust.
We’ve been speaking to some of our active users, and you’ve all mentioned data quality and the time it takes to scrub the stuff as key problems in your workflows. One of you even called yourselves “overpaid information janitors” (although we wouldn’t agree you’re overpaid -- this isn’t an easy job), and it’s entirely true. In fact, data science professionals spend between 50 and 80 per cent of their time cleaning and checking data.
If you’re working with data, you already know that much of the value comes from the people who clean the data so that it can be useful to the algorithms whose builders get all the credit. Of course, we automate as much as possible, there’s always manual work to be done, which anyone outside the task may not understand. The business value comes from the people, not the numbers.
So we thought we’d make a little checklist you can use with people who’ve asked you to just whip something up. We can help you with the last-mile problem of visualizing what you’ve cleaned, but we know that’s only one of the hard parts of working with data.
Should you make a data visualization?
- Is the data from a trusted source?
- Can the datasets you’re using be aligned with the ones you already have?
- Are the errors relatively consistent (meaning your cleaning tasks will at least be a little more predictable)?
- Do you have access to the original raw data?
- Once you’ve scrubbed a sample of the data, is there enough reliable there to make an interesting visualization?
- Has the data been collected with a purpose that is compatible with the reason you’re using it?
- Are you allowed to use this data?
- Is there a compelling reason to visualize this data? (e.g., is it for an external audience you need to impress?
If you answered no to too many of these, then consider making some quick visualizations in Excel (or GetBulb), to help you learn the shape of the data and what’s in it. Then you’ll either be ready to make a visualization, or you’ll realize you don’t need one.