I’ve been having a lovely time playing with a dataset containing 60,000 UFO sightings. They come with locations (city/state), and a free-text description field that’s pretty fascinating to read.
To get an idea of the distribution of sightings, I plotted the frequency of UFO sightings by state, normalized by the number of residents. There’s a high number of sightings in northern New England, but the west coast seems to be the biggest epicenter of UFO sightings. Washington state in particular has vastly more UFO sightings per capita than any other state.
Clearly this indicates that when aliens visit Earth, they want to visit the space needle. To get a better sense of what the people reporting the sightings were experiencing, I made a word cloud of the free-text descriptions accompanying the reports from Washington.
Lots of bright lights moving around in the sky!
I’ve been using the amazing knitr
package for the past year or so, to generate HTML reports of almost everything I do in R. It’s a great way to keep all the parts of an analysis together, in a convenient format that I can easily share with coworkers and collaborators. Maybe this makes me old-fashioned, but I also print these reports and keep hard copies on file (in addition to the copy on my hard drive, the copy on the Time Machine drive, and the copy on the off-site network drive. I like backups.)
Using knitr is ridiculously easy; knitr support is built into Rstudio, and I can write the reports in markdown, the most delightfully simple and readable language for creating HTML. I usually keep the Rmd file I’m working on visible in one pane of Rstudio, while I hack things together at the prompt in another pane. I copy lines of code up into the Rmd file as soon as I have them the way I want them. Rinse and repeat until science.
The key items I include in an analysis report are:
- The data. If it’s a small amount of data, I include print statements in the code to print the numbers directly into the file. In a more typical case where the data set is much too large to include in the document, I add comments describing the data source: a URL if it’s public data from the web, unambiguous file names and paths to help find the associated files on the network drive, and notes about data generation (“You’ll find the original data in my notebook #9, pg. 35.”, or “Dr. Pineapples generated this data set in 2007 by dropping milkshakes out of the window and measuring the radius of the splat.”).
- The code. All the code that generated the analysis. If some parts take a long time to run, knitr makes it easy to cache parts of the analysis. That way the slow parts of the code don’t have to run every time the file is knitted to HTML.
- Test/validation code. I like to prove that the analysis is working as expected by checking values and data types along the way. The assertthat package is great for dropping in some automated testing so that the outcome of the tests is baked right into the report.
- The results, usually numbers and plots. Knitr can embed images in HTML as base64-encoded data URIs, so it’s easy to include graphs in the report without having multiple files to keep track of. I like to insert the images into the file as URIs, and also include statements to print the images as separate files (in a format/resolution suitable for publication).
When it’s this painless to make research reproducible, there’s no excuse not to do it. For more on using R Markdown and knitr in Rstudio, check out the documentation