Skip to content

Commit

Permalink
adds new post on regex vs split
Browse files Browse the repository at this point in the history
  • Loading branch information
jontsai committed Jul 13, 2023
1 parent 32082c5 commit a97d856
Showing 1 changed file with 108 additions and 0 deletions.
108 changes: 108 additions & 0 deletions _posts/2023-07-13-embrace-the-power-of-regex.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
layout: post
title: "Embrace the power of Regex"
description: "In with regex, out with split"
category: "programming"
tags: [programming]
---
{% include JB/setup %}

Too often, while reviewing code, I'll see examples like:

```
def extract_id_and_env(key: str) -> dict:
"""Extracts the object ID from `key`
`key` is a string like 'namespace_prefix_12345'
In some cases, `key` could also look like `namespace_prefix_12345_environment`
Returns a dict with the object ID, an integer
"""
parts = key.split('_')
parsed = {
'id': int(parts[2]),
'environment': parts[3] if len(parts) == 4 else None
}
return parsed
```

When I see this, I ask, "Why?"

Instead, this is my preferred way of handling this is to use a regex
with named capture groups:

```
import re
KEY_PATTERN = re.compile(r'(?<namespace>[a-z]+)_(?<prefix>[a-z]+)_(?<object_id>\d+)(?:_(?P<environment>[a-z]+))?
def extract_key_components(key: str):
m = KEY_PATTERN.match(str)
parts = ['namespace', 'prefix', 'object_id', 'environment', ]
values = [m.group(part) for part in parts]
return values
```

In another example (contrived, but modified from a real world
application), from a Django which serves both students and educators,
and displays two different landing pages depending on the intent:

```
def login_view(request):
url = request.GET.get('next')
last_word = url.split("/")[-1]
is_student = True if last_word == 'scholarship' else False
template = 'login/student.html' if is_student else 'login/educator.html'
response = render_to_response(request, template)
return response
```

The problem with this code is not immediately apparent. It
works. However, this code lacks robustness.

An arguably better approach:

```
import re
STUDENT_LOGIN_INTENT_PATTERNS = [
re.compile(r'^/path/to/(?P<some_id>\d+)/scholarship$'),
]
def is_login_intent_student(request):
is_student = False
next = request.GET.get('next')
for pattern in STUDENT_LOGIN_INTENT_PATTERNS:
if pattern.match(next):
is_student = True
break
return is_student
def login_view(request):
is_student = is_login_intent_student(request)
template = 'login/student.html' if is_student else 'login/educator.html'
response = render_to_response(request, template)
return response
```

In addition to the readability and maintainability of the regex
approach, it is overall more robust, allowing the programmer to
extract multiple components from the string all at once! This
mitigates the need for updating the function in the future, if other
parts of the string are needed later on (and it's quite often that it
would be the case!).

My preference for Regex over Split stems from:

- Somewhat related to the principle of https://www.joelonsoftware.com/2005/05/11/making-wrong-code-look-wrong/
- If code is wrong, it should fail catastrophically and loudly, not subtly or obscurely
- It's hard to make a regex that looks _maybe right_? Either a regex is right, or obviously wrong. (It could also be that I have lots of experience using regexes, and can write them without lookup up references)
- OTOH, while `split` is conceptually easier to learn, for me, it's hard or nearly impossible to see **at a glance** whether the code is write or wrong. For example, if you look at a block of code using `split` and various indexes, how would you instantly detect a possible OB1 (aka off-by-one error; https://en.wikipedia.org/wiki/Off-by-one_error)? Not possible. OB1s bugs are prevalent in software because the learning curve, and therefore barrier to entry, is low, so bugs are more likely to be introduced.
- Regexes, OTOH, have a slightly higher learning curve, slightly higher barrier to entry, so those who use it tend not to make trivial mistakes
- **If** the code never has to update ever again, then, great! `split` is sufficient. If the next engineer has to update it, they would not necessarily benefit from the existing code, and would have to re-evaluate all of the code in their head to make sure indexes are right.
- Maintaining a list of patterns, or regexes, encourages a **Solve for N** mentality, whereas using `split` encourages a "solve it quick and dirty mindset"

0 comments on commit a97d856

Please sign in to comment.