-
Notifications
You must be signed in to change notification settings - Fork 0
/
preface.qmd
79 lines (70 loc) · 5.03 KB
/
preface.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# Preface {.unnumbered}
I don't like Python. Or rather, don't like using it to analyze data. I believe
that certain design choices were made that make Python a subpar language for
analyzing data. That being said, Python is currently the most popular general
purpose programming language, and also the most popular language to analyze
data, especially in machine learning and AI. As such, it is a good idea to at
least know your way around it even if, like me, you prefer using the R
programming language.
I state that I don't like Python because I want to make something very clear,
right from the start. I am not an expert in Python. I know enough to get things
done, but it is not a language that I know well. I consider myself an expert in
R, but definitely not in Python. So why write a book on Python in that case?
Well, in truth, this is not a book about Python.
This book is about building reproducible pipelines. And in order to build
reproducible pipelines, you need to apply certain general ideas. This book will
discuss these ideas, and illustrate them using the Python programming language.
I wrote such a book already: you can read it
[here](https://www.raps-with-r.dev)^[https://www.raps-with-r.dev]. The book was well received:
lots of people contacted me to thank me for having written it, I was invited to
give talks several times on the book already, give some workshops and some
people are already asking me for a second edition. I thought it would be an
interesting challenge to write a Python edition. It would be a good excuse to
dive back into Python after having barely touched it since my Phd days, more
than 10 years ago, where I used it alongside R for a project. It is also a
good opportunity to see what is currently available in the Python programming
language, and if it would be possible to reproduce the way I work with R.
There are currently some interesting packages available for Python that are
quite close to some available for R. But R and Python have some very fundamental
differences in their design that make using Python feel awkward if you're used
to R, especially if take full advantage of R's functional programming features.
But it's not just these design choices that differentiate both languages, there
are also differences in culture, in how people use them. For example, notebooks
are very popular in the Python world, but it's quite rare to see someone share R
code through a notebook. In general, R users tend to write code as plain-text
scripts or in Markdown using Rmarkdown or Quarto. Throughout this book, I will
be writing Python with a heavy R accent, mainly for two reasons: first, because
I'm fluent in R, but not in Python, as I've stated above. Second, because I also
think that some of R's idioms actually make it more readable than equivalent
Python code, and thus we can make Python more readable if we make it look more
like R.
If the previous sentence didn't scare you into continuing and you are actually
curious to see what I mean, then read on, you're in the perfect state of mind to
approach this book!
Let me finish this section by talking a little bit about the history of these
books (the original R edition and this one). It all started when a former
colleague of mine contacted me in the summer of 2022 to ask me if I was
interested in teaching a course for the Data Science Master degree at the
University of Luxembourg. Long story short, I've decide to teach students how to
set up data science projects in a way that they're reproducible. I wrote my
course notes into a [freely available bookdown](https://rap4mads.eu/)^[https://rap4mads.eu/] that I
used for teaching. When I started compiling my notes, I discovered the concept of
*Reproducible Analytical Pipelines* as developed by the [Office for National
Statistics](https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/)^[https://analysisfunction.civilservice.gov.uk/support/reproducible-analytical-pipelines/]
(henceforth ONS). I found the name "Reproducible Analytical Pipeline"
(henceforth RAP) really perfect for what I was aiming at. The ONS team
responsible for evangelising RAPs also published a free
[ebook](https://ukgovdatascience.github.io/rap_companion/)^[https://ukgovdatascience.github.io/rap_companion/]
in 2019 already. Another big source of inspiration is [Software
Carpentry](https://software-carpentry.org/)^[https://software-carpentry.org/] to
which I was exposed during my PhD years, around 2014-ish if memory serves. While
working on a project with some German colleagues from the University of Bonn,
the principal investigator made us work using these concepts to manage the
project. I was really impressed by it, and these ideas and techniques stayed
with me since then. This was also the project where I've used Python.
The bottom line is: the ideas I’m presenting here are nothing new. It’s just
that I took some time to compile them and make them accessible and interesting
(at least I hope so) to users of the Python programming language, even if I'm
not one.
If you have feedback, drop me an email at bruno [at] brodrigues [dot] co.
Enjoy!