Python vs (and) R for Data Science

Brian Ray
9 min readJun 10, 2018

--

As requested, I’m publishing this guide for those wishing to choose between Python and R Programming languages for Data Science. You may be new to Data Science or you need to pick one choice on a project, this guide will help you.

Not a disclaimer: I am a manager of Data Scientists for one of the largest employer of Data Scientists (Deloitte). These are my opinions. I’ve also consulted with R and Python for several decades. I’m language agnostic, but have been heavily involved with the Python community for 15 years or so.

There may be a third choice

Hadley Wickham https://twitter.com/hadleywickham, Chief Data Scientists of RStudio ,had replied “Replace ‘vs’ with ‘and’.” Prompted by this, using Python/R together is a third choice I will cover. This option intrigues me and I will cover this toward the end of this article.

How we compare R and Python

Not an exhaustive list by any means, here are some factors worth comparing between the two languages:

  1. History: R and Python have distinctly different histories that sometimes crossed paths.
  2. Community: many complex sociological anthropological factors observed through field work.
  3. Performance: a careful comparison and why it is so hard to compare.
  4. Third Party Support: modules, code bases, visualizations, repositories, organizations, and development environments.
  5. Use Case: some tasks and types of work may lend themselves to one or the other.
  6. Can’t we all just get along? Using Python with R and R with Python.
  7. Predicting R vs Python A telling exercises of eating our own dogfood
  8. Preference: the ultimate answer.

History

A brief history:

  • ABC -> Python Invented (1989 Guido van Rossum) -> Python 2 (2000) -> Python 3 (2008)
  • Fortan -> S (Bell Labs) -> R Invented(1991 Ross Ihaka and Robert Gentleman) -> R 1.0.0 (2000) -> R 3.0.2 (2013)

Community

The first thing to keep in mind when comparing the users of Python vs R, is that:

Only 50% of the users of Python overlap with R

That is assuming that all of R programmers would call there use “Scientific and Numeric”. We also determined this distribution is true regardless of the level of the programmer.

To further dive into the Python “Hype” read my article on my Python Hype Survey Results:

If we only look at scientific and numeric community, that brings us to our second, which community? There are several sub-communities within the overall scientific and numeric communities. Although there may be some overlap as you would suspect they really behave differently how they interact with the larger R/Python communities within.

Some examples of sub-communities using Python/R:

  • Deep Learning
  • Machine Learning
  • Advanced Analytics
  • Predictive Analytics
  • Statistics
  • Exploration and Data Analysis
  • Academic Scientific Research
  • An almost endless list of Computation Fields of Study

While each domain seems to serve a specific community, you would find R more prevalent in places like Statistics and Exploration. Not so long ago, you could be up-and-running and doing some fairly meaningful exploration with R in far less time it would take to install Python and do similar exploration.

All that‘s changed by the disruptive technology called Jupyter Notebooks and Anaconda

note: Jupyter Notebooks: adds ability to code Python/R in the browser; Anaconda: allows easy install and package managing for Python and R

Now that you can get up and running in an environment friendly to providing reporting and analysis out of the box, there has been a barrier removed that sat between those who wish to do the task and they language they love. Python now can come packaged in a platform independent way and provide quick-down-and-dirty analysis quicker then ever before.

Another distinction in community that impacts language choice is idea of “open source”. Not just open source’d libraries, but the impact of collaborative communities contributing to open source. Ironically, open source licensed software like Tensorflow to GNU Scientific Library (Apache and GPL, respectively) both seem to have both Python and R bindings. Despite the copy leftness of R, there still seems to be more support by purist for the Python community. On the flip side, there seems to be more Enterprise support for R especially those with history in Statistics.

Lastly, regarding community and collaboration, there is far more support on Github for Python. If I look at the latest trending packages for Python I will see projects like Tensorflow with over 35K Stars. In turn, if I look at the latest trending packages for R, packages like Shiny, Stan, … all have fewer than 2K Stars.

Performance

This never goes well. The reasons are that there are too many metrics and situations to test. It’s hard to test on any one particular hardware. Some operations are optimized in one language and not the other. Surely, you will miss something, someone will complain, friends will be lost, and the whole analysis will be tossed away with gusto! Regardless of that, here we go…

Looping ~ Silly

Before we go there let’s think about how Python is used VS R. Do you really want to do a lot of looping over things in R? My guess is the intent of the language may be slightly different.

0.000037 sec for Python, 0.00158 sec for R

As a sanity check, including the load time and just running on the command line: R was real 0m0.238s, Python real 0m0.147s. Again, not scientific test.

A quick test shows Python is significantly faster. Usually, it just does not matter.

What does matter to a Data Scientist regarding speed? The emerging trend found in both languages is their ability to be used as a command language. For example, most of those programming Python rely heavily on Pandas for their work. This moves the topic to what modules and libraries exist in each language and how they perform. That is a more meaningful comparison.

Third Party Support

Package Managers

Python has PyPI, R has CRAN, both have Anaconda.

CRAN uses it’s internal `install.packages` command built into the distribution. On this date there are around 12K packages available on CRAN. Scrolling through the list it appears over 1/2 or more of all packages has something to do with Data Science. Roughly 6K or more.

PyPi has over 10X the number of packages, 141K packages. There are 3.7K packages labeled as Scientific Engineering specific. There are many found that are indeed scientific and are just not labelled as such.

In both cases there seems neither suffers from gross over duplication of efforts. Sure I get 170 projects in PyPi when I search for “Random Forest,” however the packages within seemingly are different.

Although Python has 10X the number of packages, the number of Scientific Data Science packages are about the same if not slightly fewer for Python

Availability of third party packages is a very big deal. Having to write something from scratch just so it will run in your language of choice is a bummer. Likewise, I do hope if you do do that you contribute that work back to the Open Source Community.

The Speed on Stuff that actually matters

DataFrames vs Pandas is probably a much more meaningful comparison and one that really matters.

We conducted an experiment: compare the execution times on a complex exploratory effort while mirroring each part. Here are the results:

Python was quicker at most tasks.

Source code: http://nbviewer.jupyter.org/gist/brianray/4ce15234e6ac2975b335c8d90a4b6882

As we see, Python+Pandas than the native R DataFrames was largely quicker. Please note this does not mean Python is a quicker runtime. Pandas is built mostly on Numpy written in C.

Visualize this!

What I am really saying is ggplot2 vs matplotlib. Disclaimer: matplotlib was written one of the people I valued most in the Python community and one who taught me Python, John D. Hunter.

Matplotlib is a 800lb gorilla and customizing can be done although not easily learned but can be very extensible. Customization on ggplot is not easily either and some would say it is even more difficult.

If you like pretty plots and you don’t need to customize at all, R is my pick. If you need to do a lot more then Matplotlib and possibly even the interactive bokeh would be helpful. Similarly, ShinnyR for R would add that interactivity you may be seeking.

Can’t we all just get along?

One would ask, why can’t you just use both at the same time.

There are times you can use the two together. Times when:

  • your group or organization allows you.
  • you can get both set up and maintained easily in your enviroment.
  • your code does not need to go into another system.
  • you aren’t creating a confusing mess for someone else.

Some ways to use the 2 together are:

Then we can actually pass the pandas data frame and it is automatically (by rpy2) converted into a R Dataframe, passed with the “-i df” switch:

sources: http://nbviewer.jupyter.org/gist/brianray/734bd54f468d9a6db9171b2cfc98405a

Predicting R vs Python

Someone on Kaggle wrote a Kernel on Predicting whether a developer uses R or Python. He came up with some interesting observations based on the data:

  • If you’re looking to move towards Linux next year, you’re more likely a Python user
  • If you studied statistics you’re more likely R, and if computer science then Python
  • If you’re young (18–24 years old), you’re more likely Python user
  • If you do code competitions, you’re more likely a Python user
  • If you want an android next year, you’re more likely a Python user
  • If you want to learn SQL next year, more likely R user
  • If you user MS office, you’re more likely an R user
  • If you want an Rasperry Pi next year, you’re more likely an Python user
  • If you’re a full time student, you’re more likely to be a Python user
  • If you’re using Agile methodology, you’re more likely to be a Python user
  • If you’re more worried than excited about AI, then you’re more likely to be an R user

Preference

When I had corresponded with Alex Martelli, Googler and Stack Overflow lord, he had explained to me why Google had started with a few languages they officially supported. Even in the free spirited innovated space like Google, there seems to be some restrictions. That is a preference that goes into play here as well, corporate preference.

Aside from corporate preference, someone in an organization is usually the first. I know who the first was at Deloitte to use R. He’s still with the firm and is the now the Lead Data Scientists. Point being, and my general advice in all things, follow what you love, love what you follow, lead the pack, and love what you do.

One qualifying statement, although I’ve never been a tool first thinker, if you are working on something important it may not be the best time to experiment. Mistakes are possible. However, every well designed Data Science project leaves some head room for the Data Scientists. Use a portion of that to learn and experiment. Keep an open mind and embrace diversity.

In closing, I’m sticking mostly with Python but am looking forward to learning more R, with and without Python.

📝 Read this story later in Journal.

🗞 Wake up every Sunday morning to the week’s most noteworthy Tech stories, opinions, and news waiting in your inbox: Get the noteworthy newsletter >

--

--

Brian Ray

Long time Python-isto, Inquisitor, Solver, Data Science in Cognitive/AI/Machine Learning Frequent Flyer