The most irritating differences between R and Python

Lately I’ve been switching back and forth between R and Python for data analysis, so I’ve been getting used to keeping track of the many differences in syntax and behavior between the two languages. I don’t think I’d argue that either way is right or wrong most of the time; they’re just different due to their different histories and conventions.

In most cases where the languages differ, a valid R expression will produce an error in Python or vice versa, making it easy to tell that something is wrong. What makes these three differences irritating is that they’re all instances where the syntax looks the same but the meaning is different, making it easy to write code that fails silently. This makes it easy to get an incorrect result with no indication that anything is amiss.


1. R’s arrays are 1-indexed, while Python’s arrays are 0-indexed.

In R, element 1 is bubbles:




In Python, element 1 is turtles:


There’s a fascinating history behind the origin of 0-indexed arrays, by the way.

2. In Python, assignment between two names binds both names to the same object, while in R, assignment between two names creates a new object.

In R, if I assign “b = a” and then change b, it doesn’t affect a because R creates a copy:




In Python, if I assign “b = a” and then change b, the value of a changes too because both names point to the same object. (I used a numpy array here, but the same is true with other data types.)




To achieve the R-style behavior in Python, use copy() to create a copy:


If you want to achieve the Python behavior in R, too bad, you’re out of luck. As far as I know there’s no way to make two different names point to the same object (although I can’t say I’ve ever desired that behavior in R anyway).


3. The colon operator’s end location is included in the results in R, but not included in the results in Python.

In R, when you subset a data frame using the colon operator, the result includes both the start and end values. In Python, when you slice an array using the colon operator, the result includes the start value but not the end value.

Thus, a[1:3] in R returns 3 elements:


but in Python it only returns 2 elements:



This makes it easy to end up with off-by-one errors if you’re not careful (which are, of course, one of the two hard things in computer science).


Martian vacation hotspots: why aliens love Seattle.

I’ve been having a lovely time playing with a dataset containing 60,000 UFO sightings. They come with locations (city/state), and a free-text description field that’s pretty fascinating to read.

To get an idea of the distribution of sightings, I plotted the frequency of UFO sightings by state, normalized by the number of residents. There’s a high number of sightings in northern New England, but the west coast seems to be the biggest epicenter of UFO sightings. Washington state in particular has vastly more UFO sightings per capita than any other state.


Clearly this indicates that when aliens visit Earth, they want to visit the space needle. To get a better sense of what the people reporting the sightings were experiencing, I made a word cloud of the free-text descriptions accompanying the reports from Washington.


Lots of bright lights moving around in the sky!

Continue reading

Using knitr for reproducible research in R

I’ve been using the amazing knitr package for the past year or so, to generate HTML reports of almost everything I do in R. It’s a great way to keep all the parts of an analysis together, in a convenient format that I can easily share with coworkers and collaborators. Maybe this makes me old-fashioned, but I also print these reports and keep hard copies on file (in addition to the copy on my hard drive, the copy on the Time Machine drive, and the copy on the off-site network drive. I like backups.)
Using knitr is ridiculously easy; knitr support is built into Rstudio, and I can write the reports in markdown, the most delightfully simple and readable language for creating HTML. I usually keep the Rmd file I’m working on visible in one pane of Rstudio, while I hack things together at the prompt in another pane. I copy lines of code up into the Rmd file as soon as I have them the way I want them. Rinse and repeat until science.
The key items I include in an analysis report are:
  • The data. If it’s a small amount of data, I include print statements in the code to print the numbers directly into the file. In a more typical case where the data set is much too large to include in the document, I add comments describing the data source: a URL if it’s public data from the web, unambiguous file names and paths to help find the associated files on the network drive, and notes about data generation (“You’ll find the original data in my notebook #9, pg. 35.”, or “Dr. Pineapples generated this data set in 2007 by dropping milkshakes out of the window and measuring the radius of the splat.”).  
  • The code. All the code that generated the analysis. If some parts take a long time to run, knitr makes it easy to cache parts of the analysis. That way the slow parts of the code don’t have to run every time the file is knitted to HTML. 
  • Test/validation code. I like to prove that the analysis is working as expected by checking values and data types along the way. The assertthat package is great for dropping in some automated testing so that the outcome of the tests is baked right into the report.
  • The results, usually numbers and plots. Knitr can embed images in HTML as base64-encoded data URIs, so it’s easy to include graphs in the report without having multiple files to keep track of. I like to insert the images into the file as URIs, and also include statements to print the images as separate files (in a format/resolution suitable for publication). 
When it’s this painless to make research reproducible, there’s no excuse not to do it. For more on using R Markdown and knitr in Rstudio, check out the documentation.