The hunt for data
Most people involved in data visualisation can relate to the challenges faced when on the hunt for data on the internet. In order to create a meaningful data visualisation or carry out a piece of analysis you often have to turn to multiple sources. As anyone who has gone through this process knows, the data you find will come in different shapes and formats, and you have to scroll through websites looking for relevant data and comparing different results. You sometimes find relevant open-source data nicely stored in a .csv file. There will usually be some type of data preparation required such as removing blank rows or pivoting the data. Reports are often designed to be easily consumed by humans but this isn’t usually the best format for loading into Tableau. When data is found in this format, you can be fairly confident you will be able to work with it and analyse it after spending some time transforming it. On several occasions I’ve come across really interesting articles or publications which lead me to others as I try to trace back to the original source. Often the data that is being referred to is not conveniently stored in a table such as a .csv or Excel file but rather in table in a PDF report. Of course, you could type these into a spreadsheet, but this would be both tedious and introduce the chance of manual error. Enter Tableau’s PDF connector (available since version 10.3) which enables you to access the data in these tables in a flash.
I saw this demoed by the one and only Andrew Beers at the Tableau Conference on Tour in London and was impressed at the ease at which he was able to access multiples tables within a lengthy PDF document. The sceptic in me wondered if it’s always that easy or if this was a carefully chosen document for the purposes of the demo in front of 1000+ people (quite likely!). A few weeks ago I trialled this myself and confirmed that it really was that easy. I recalled one particular document that I had come across containing statistics about births and feeding trends in Ireland. There had been several walls of figures spanning pages which I struggled to glean insights from so I used this as a test case. I recently published the visualisation I subsequently created for World Breastfeeding Week which contained data from this and other sources. I really saw the value of the PDF connector while creating this as it allowed me access various tables across different reports with ease. The visualisation can be seen above and the interactive version is available on Tableau Public here.
Firstly, here are some useful links if you want to read
Three different cases I encountered
Case 1 – Table on one page
When you’re connecting to data you just need to select ‘PDF file’ and then you have an option to scan the whole document or specify which pages you’re interested in (1 below). One thing to note is that Tableau counts the actual pages in the PDF document, not the numbers printed in the document itself. I then selected ‘Use Data Interpreter’ so that the column headings i.e. the years would be recognised as such (2 below). I subsequently pivoted the columns so that years were not going across but were stored as one field. This should be it in terms of connecting if your table is printed on one page in the document.
Case 2 – Table spanning multiple pages with headers repeated
This particular table spanned two pages and contained column headers in the first part of the table which were repeated in subsequent pages. In order to append the two parts together I needed to do a union. Again this was quite straightforward and the full table was available to analyse once I had created the union.
Case 3 – Table spanning multiple pages with no headers repeated
There is an extra step required if the headers are not repeated. You need to first do a union as in the above case.
In the absence of headers, Tableau doesn’t know that the fields in the second part of the table are the same as as in the first part, so it appends the rows at the end but in new columns. To match them up, simply highlight the columns that belong together, right click and select to merge mismatched fields (see below where the first field with the county stopped after Louth and Mayo in the first column and then the additional counties Meath and Monaghan were in a new column).
By merging you see they then continue along in the first column. This needs to be repeated for every pair of columns.
As with any data source you need to make sure you validate the data that has been brought into Tableau by cross-checking figures and totals. I found I had to pay particular attention to blank rows in the data which produced additional rows of NULLs in Tableau. I simply filtered these out. I also had to make sure that I was capturing complete text values as in some cases these were spread over two rows when the values were too long to fit. While these aspects were easily overcome, they are areas to be aware of when connecting to PDF files.
That’s my two cents this Tuesday!
Thanks for reading.