The most irritating differences between R and Python

Lately I’ve been switching back and forth between R and Python for data analysis, so I’ve been getting used to keeping track of the many differences in syntax and behavior between the two languages. I don’t think I’d argue that either way is right or wrong most of the time; they’re just different due to their different histories and conventions.

In most cases where the languages differ, a valid R expression will produce an error in Python or vice versa, making it easy to tell that something is wrong. What makes these three differences irritating is that they’re all instances where the syntax looks the same but the meaning is different, making it easy to write code that fails silently. This makes it easy to get an incorrect result with no indication that anything is amiss.

 

1. R’s arrays are 1-indexed, while Python’s arrays are 0-indexed.

In R, element 1 is bubbles:

 

Image

 

In Python, element 1 is turtles:

Image

There’s a fascinating history behind the origin of 0-indexed arrays, by the way.

2. In Python, assignment between two names binds both names to the same object, while in R, assignment between two names creates a new object.

In R, if I assign “b = a” and then change b, it doesn’t affect a because R creates a copy:

Image

 

 

In Python, if I assign “b = a” and then change b, the value of a changes too because both names point to the same object. (I used a numpy array here, but the same is true with other data types.)

Image

 

 

To achieve the R-style behavior in Python, use copy() to create a copy:

Image

If you want to achieve the Python behavior in R, too bad, you’re out of luck. As far as I know there’s no way to make two different names point to the same object (although I can’t say I’ve ever desired that behavior in R anyway).

 

3. The colon operator’s end location is included in the results in R, but not included in the results in Python.

In R, when you subset a data frame using the colon operator, the result includes both the start and end values. In Python, when you slice an array using the colon operator, the result includes the start value but not the end value.

Thus, a[1:3] in R returns 3 elements:

Image

but in Python it only returns 2 elements:

Image

 

This makes it easy to end up with off-by-one errors if you’re not careful (which are, of course, one of the two hard things in computer science).

 

IPython Notebook, where have you been all my life?

 

 

 

 

As a long-time fan of the package knitr for reproducible, human-readable research in R, I’m surprised it took me as long as it did to discover the wonder that is IPython Notebook, which is an equivalent way of logging methods, data, code, and output in a single human-readable file (but, obviously, using Python instead of R). Oh yes, it’s good. Very good. 

The notebooks can be saved as JSON files that are perfect for sharing as Github gists, and the JSON can be converted to HTML using the IPython Notebook viewerHere’s a quick demo notebook that I put together using scikit-learn to do K-means clustering on the iris data set and plot the results, and the accompanying JSON-formatted gist.

This example works with one of my favorite historic data sets, known as either Fisher’s or Anderson’s iris data set depending on who you want to give credit to: Ronald Fisher, the statistician who analyzed the data, or Edgar Anderson, who actually collected and measured all those flowers. (Personally, I’d argue that Fisher already gets Fisher’s linear discriminant named after him, so we ought to let Anderson take credit for the data set!) 

The data set contains measurements of the length and width of petals and sepals for three different iris species found in Canada: Iris setosa, Iris virginica, and Iris versicolor. For this example, I’ve used K-means clustering, since we know a priori that there are 3 species to identify so it’s easy to choose 3 clusters. Results are shown in a 3D plot with the color indicating the label given to the sample by the clustering algorithm.

This very basic analysis is able to achieve pretty good discrimination of the groups in this data set, giving 134/150 (89%) correctly identified samples. 

 

Image

Freeing my Fitbit data

I love my Fitbit, but I want to play with the data myself, not just look at the graphs the fine folks at Fitbit have decided I should see. Fortunately, there’s a great step-by-step guide with a script to download Fitbit data into a Google Drive spreadsheet, which handles interfacing with the Fitbit API. 

I downloaded my data beginning with April 2012 when I started using Fitbit. There were 18 days without any steps recorded (days when I forgot to wear the Fitbit or it ran out of batteries, plus a few days when it broke and I had to have it replaced), which I eliminated from the analysis. Other than those few days, I have a complete record of my activity for the past 21 months, which is pretty cool. 

I used Python to plot the daily step counts along with a time-smoothed rolling average to make it easier to see long-term trends. 

Image

The most obvious features to me are the high-step-count spikes during the summer, when I’m most likely to go adventuring outdoors on the weekends. To get a better sense of the distribution of my steps, I plotted a histogram:

Image

I average about 10,000 – 11,000 steps per day, but there’s a lot of variation. I usually end up recording at least 12,000 steps on days when I bike to work, because I find that each pedal rotation registers as a step. I suspect that the tail on the left side of the histogram (<5000 step days) are mostly days when I was sick or otherwise feeling down, because I generally find it downright difficult to walk fewer than 5000 steps in a day unless I have a cold or similar. 

Finally, I broke the histogram down by days of the week. It got a little messy to look at, so I used gaussian kernel smoothing to turn the histogram into a density plot that’s easier on the eyes: 

Image

Most of the days are similar to each other, but Saturday clearly has the most variation, and the most days with very high step counts. In fact, I walk 15,000+ steps 17% of the time on Saturdays, but less than 10% of the time on other days of the week. And on that note, the dog and I are going for a walk. 

Here’s the code I used to generate the graphs: 

[gist https://gist.github.com/8090081 /]

Dental year-in-review

Earlier this year in May, I started tracking my dental hygiene habits by giving myself a sticker on a calendar every time I flossed my teeth. This started because I really hate flossing — there’s nothing inherently fun or rewarding about it, so I wondered if there was anything I could do to incentivize it or make it more fun.

I just kept going with it, and now I have eight months of flossing history to look at. I entered the data into Excel (giving myself a ‘1’ on days where I had a sticker on the calendar and a ‘0’ otherwise), exported it as a CSV, and used python to graph the results broken down by days of the week.

I used python’s matplotlib to make the graph, then I also used RPy to make the graph using R since I’ve never really directly compared graphs made using the two methods. Here’s the matplotlib plot:

Image

and here’s the RPy plot:

Image

I prefer the bars of the RPy plot because they line up nicely without any effort, but I do enjoy how matplotlib turned the y-axis labels horizontal for me. Of course for any more complex plotting I think the advantages of RPy would become much more apparent.

But more importantly: what can I learn about my oral health from this analysis? My flossing habits during the week are quite good, with >80% success rate regardless of the day, but on the weekends I tend to sleep in, fail to follow a standardized morning routine, and consequently end up promising myself I’ll floss at night (hint: it never happens). Don’t tell my dentist.

Here’s the code I used to generate these images: