As a long-time fan of the package knitr for reproducible, human-readable research in R, I’m surprised it took me as long as it did to discover the wonder that is IPython Notebook, which is an equivalent way of logging methods, data, code, and output in a single human-readable file (but, obviously, using Python instead of R). Oh yes, it’s good. Very good.
The notebooks can be saved as JSON files that are perfect for sharing as Github gists, and the JSON can be converted to HTML using the IPython Notebook viewer. Here’s a quick demo notebook that I put together using scikit-learn to do K-means clustering on the iris data set and plot the results, and the accompanying JSON-formatted gist.
This example works with one of my favorite historic data sets, known as either Fisher’s or Anderson’s iris data set depending on who you want to give credit to: Ronald Fisher, the statistician who analyzed the data, or Edgar Anderson, who actually collected and measured all those flowers. (Personally, I’d argue that Fisher already gets Fisher’s linear discriminant named after him, so we ought to let Anderson take credit for the data set!)
The data set contains measurements of the length and width of petals and sepals for three different iris species found in Canada: Iris setosa, Iris virginica, and Iris versicolor. For this example, I’ve used K-means clustering, since we know a priori that there are 3 species to identify so it’s easy to choose 3 clusters. Results are shown in a 3D plot with the color indicating the label given to the sample by the clustering algorithm.
This very basic analysis is able to achieve pretty good discrimination of the groups in this data set, giving 134/150 (89%) correctly identified samples.