Skip to content

Latest commit

 

History

History
183 lines (149 loc) · 13.9 KB

coming-from-r.md

File metadata and controls

183 lines (149 loc) · 13.9 KB
jupytext kernelspec
cell_metadata_filter formats text_representation
-all
md:myst
extension format_name format_version jupytext_version
.md
myst
0.8
1.5.0
display_name language name
Python 3.10.12 64-bit ('codeforecon': conda)
python
python3

Coming from R

Python and R have strong similarities when it comes to economics and data science. If you're coming from R, then you're probably already familiar with coding and with integrated developer environments. And because there are Python packages that replicate the major R packages, there is probably no easier switch in high-level programming languages than the one from R to Python.

There are, however, some fundamental differences between the two languages. The biggest difference between Python and R is that Python is a general purpose programming language, while R began as a statistical language. But you won't really notice this unless you're writing something that looks more like production-grade software. (And, despite being a general language, Python has really fantastic support for statistics.) A second difference is that Python is more object-oriented while R is more functional. These are two different programming paradigms. Actually, both languages do a bit of both and, again, you're unlikely to notice any difference most of the time.

Actually, the biggest practical difference in the context of packages for economics and data science is that Python has more of a flavour of the bazaar—there are lots of people, and you can find everything under the Sun, but it can be a little bit chaotic—while R has the feel of a curated garden—there is a chief gardener (Posit, formerly RStudio, but renamed to focus more on Python) tending a smaller number of more beautiful things, but the garden has boundaries. As an example of this dynamic, there are around 20,000 packages on CRAN, R's official and strict package server; there are 350,000+ on PyPI, the Python equivalent (but it's easier to get a package accepted).

For those coming from the 'tidyverse' set of packages produced by Posit/RStudio, there are very direct Python equivalents. For example, Python has lets-plot which has almost exactly the same syntax to R's ggplot2. There's also plydata, which has the same syntax as R's dplyr package, and polars, which is our recommended substitute for dplyr as it has a consistent syntax and, wow, is it fast. In Python, matplotlib and pandas are more popular packages for plotting and data analysis, respectively, but those R-style packages are absolutely there if you prefer them. More on other similar packages below.

There's a list of more fundamental differences between R and Python as programming languages at the end of this chapter, but a couple of important gotchas to be aware of up front: first, R uses vectors, arrays, etc., that are indexed from 1. Like C++, Python is numbered from zero, with, eg, a[0] as the first element. Second, <- is the preferred assignment operator in R but in Python it's = (and <- isn't used). In fact, in R a<-5 assigns a the value 5, while a<-5 in Python would return True or False based on whether a was less than -5 or not!

Tools with similar syntax to those found in an R workflow

If you are coming from R, you're likely familiar with dplyr for data analysis and ggplot2 for plotting. There are Python equivalents that have very similar syntax to these that you can use to help you to become productive quickly in Python--though these libraries are not so popular in Python. Here are the Python equivalents to those R libraries:

  • dplyr: This book recommends learning at least some pandas for data analysis in Python because it is a package that is comprehensive and ubiquitous, and it's the most popular Python library that performs similar functions to dplyr. pandas also has unrivaled documentation. However, if you want something closer to dplyr in terms of philosophy and syntax, there are a host of options available to you:

    • the package we recommend is the blisteringly fast polars. It has different syntax to dplyr but has a very consistent API that will be familiar to those coming from the R package.
    • tidypolars combines the syntax of dplyr with the best-in-class speed of Python package polars
    • another library that gets close to dplylr is datar. Because it uses a consistent framework for data piping under the hood called pipda, datar seems highly extensible too. It also integrates with plotnine for visualisation.
    • siuba is a port of dplyr, tidyr, and other R libraries for data analysis. It has the nice property that it can use SQLite or DuckDB as the back-end.
    • pyjanitor builds a range of extra features on top of pandas. Two of the things it adds are likely to make anyone coming from dplyr feel a bit more at home: better support for method chaining and some functions with the same names as the ones in dplyr and which do the same things.
  • ggplot2: the most similar Python version of this library is lets-plot, and it is indeed extremely similar. So much so that you'll be able to write in it in about five seconds if you're already familiar with ggplot2. This isn't the only option for you: plotnine is another choice for declarative plotting, as is seaborn's object API. R doesn't have a core imperative plotting package, but in Python this role is fulfilled by the astonishingly versatile matplolib.

  • data.table: if you use this library instead of dplyr, have no fear as there's an almost identical library in Python called datatable. It's not nearly as popular in Python as data.table is in R, but it's a very high quality library.

  • here: Lots of people switching from R to Python ask what the equivalent of the here() function is. The "best practice" answer is that you shouldn't need one! It's good practice to have your Visual Studio Code (or other IDE) console and interactive Python window automatically start within the directory of your project; that is, you should always be "here" automatically. In Visual Studio Code, you can ensure that the interactive window starts in the root directory of your project by setting "Jupyter: Notebook File Root" to "${workspaceFolder}" in the Settings menu. For the integrated command line, change "Terminal › Integrated: Cwd" to "${workspaceFolder}" too. If you still need a replacement for here, then the pyprojroot package has you covered.

What is the Python package equivalent to...?

In this section we show a list of some of the most popular packages in R along with the Python package(s) with the most similar functionality.

R Python
dplyr pandas
purrr pandas
readr/vroom pandas
lubridate pandas
stringr pandas
sf geopandas
ggplot2 (declarative)1 matplotlib (imperative) or lets-plot (declarative)
mlr3 / caret scitkit-learn
tidymodels scitkit-learn / statsmodels
knitr and r markdown quarto and Jupyter Notebooks or quarto markdown
rvest beautifulsoup
testhat pytest
shiny streamlit or Shiny for Python
skimr skimpy
gt great-tables

Need a specific library that's in R but not in Python?

You can run a full R session from Python (if you already have R installed). Here's an example:

import rpy2.robjects as ro
from rpy2.robjects.packages import importr

base = importr('base')

fit_full = ro.r("lm('mpg ~ wt + cyl', data=mtcars)")
print(base.summary(fit_full))

To install R packages, use this:

from rpy2.robjects.packages import importr
utils = importr('utils')
utils.install_packages('packagename')

R <==> Python

Here are some tables of translations between base R and Python code. For more, see hyperpolyglot.

General

R Python
new_function <- function(a, b=5) {
return (a+b)
}
def new_function(a, b=5):
    return a+b
for (val in c(1,3,5)){
print(val)
}
for val in [1,3,5]:
    print(val)
a <- c(1,3,5,7) a = [1,3,5,7]
a <- c(3:9) a = list(range(3,9))
class(a) type(a)
a <- 5 a = 5
a^2 a**2
a%%5 a%5
a & b a and b
`a b`
rev(a) a[::-1]
a %*% b a @ b
paste("one", "two", "three", sep="") 'one' + 'two' + 'three'
substr("hello", 1, 4) 'hello'[:4]
strsplit('foo,bar,baz', ',') 'foo,bar,baz'.split(',')
paste(c('foo', 'bar', 'baz'), collapse=',') ','.join(['foo', 'bar', 'baz'])
`gsub("(^[\n\t ]+ [\n\t ]+$)", "", " foo ")`
sprintf("%10s", "lorem") 'lorem'.rjust(10)
paste("value: ", toString("8")) 'value: ' + str(8)
toupper("foo") 'foo'.upper()
nchar("hello") len('hello')
substr("hello", 1, 1) 'hello'[0]
a = rbind(c(1, 2, 3), c('a', 'b', 'c')) a = zip([1, 2, 3], ['a', 'b', 'c'])
d = list(n=10, avg=3.7, sd=0.4) d = {'n': 10, 'avg': 3.7, 'sd': 0.4}
quit() exit()

Dataframes

Assuming the use of pandas in Python, and the dplyr and tidyr packages in R.

R Python
head(df) df.head()
tail(df) df.tail()
nrow(df) df.shape[0] or len(df)
ncol(df) df.shape[1] or len(df.columns)
df$col_name df['col_name'] or df.col_name
None df.info()
summary(df) df.describe() (not exactly the same)
df %>% arrange(c1, desc(c2)) df.sort_values(by=['c1','c2'], ascending=[True, False])
df %>% rename(new_col = old_col) df.rename(columns={'old_col': 'new_col'})
df$smoker <- mapvalues(df$smoker,
 from=c('yes', 'no'),
 to=c(0,1))
df['smoker'] = df['smoker'].map({'yes':0, 'no':1})
df$c1 <- as.character(df$c1) df['c1'] = df['c1'].astype(str)
unique(df$c1) df['c1'].unique()
length(unique(df$c1)) len(df['c1'].unique())
max(df$c1, na.rm = TRUE) df['c1'].max()
df$c1[is.na(df$c1)] <- 0 df['c1'] = df['c1'].fillna(0)
col_a <- c('a','b','c')
col_b <- c(1,2,3)
df <- data.frame(col_a, col_b)
df = pd.DataFrame(dict(col_a=['a', 'b', 'c'], col_b=[1, 2, 3]))
df <- read.csv("input.csv",
  header = TRUE,
  na.strings=c("","NA"), sep = ",")
df = pd.read_csv("input.csv")
write.csv(df, "output.csv", row.names = FALSE) df.to_csv("output.csv", index = False)
df[c(4:6)] df.iloc[:, 3:6]
mutate(df, c=a-b) df.assign(c=df['a']-df['b'])
distinct(select(df, col1)) df[['col1']].drop_duplicates()

Object types

R Python
character string, aka str
integer integer, aka int
logical boolean, aka bool
numeric float or double
complex complex
Single-element vector Scalar
Multi-element vector List
List of multiple types Tuple
Named list Dict
Matrix/Array numpy ndarray
NULL, TRUE, FALSE None, True, False
Inf inf
NaN nan

Other important differences

R Python
<- works as an assignment operator = is the assignment operator
Dots are valid in variable names, eg var.iable Dots precede methods, eg ' strip whitespace '.strip()
use of $, eg df$col_name Equivalent is usually ., eg df.col_name
Does not have compound assignments +=, -=, *=, etc. are compound assignment operators
FALSE, F, 0, and 0.0 are false False, None, 0, 0.0, '', [], and {} are false
Tends to fail silently, eg a = c(), a[10] evaluates as NA Python tends to fail loudly, eg a=[], a[10] throws an error
No built-in decorator operator, but see decorator Function decorator, @
Walrus operator, :=, used in quasiquations Walrus operator, =:, combines an expression with an assignment (Python 3.8+)
Pipe operator, %>% No built-in pipe operator. Method chaining and .pipe used as partial replacements for dataframes, and there are pipe extensions like pipetools and sspipe, but they're not widely used.

Footnotes

  1. Imperative programming is saying how to do something, and as a result what you want to happen will happen. Declarative programming is saying what you would like to happen, and letting the computer figure out how to do it.