Skip to content

Regexfinder is a set of tools to evaluate and manipulate regular expressions. In particular, to optimize the tradeoff between complexity and size.

License

Notifications You must be signed in to change notification settings

clockworkmicro/regexfinder

Repository files navigation

regexfinder

Introduction

Regexfinder is a set of tools to evaluate and manipulate regular expressions. In particular, the package can:

  • optimize a size/complexity tradeoff for a regular expression
  • visualize a regular expression as a graph
  • compute the number of string paths allowed by a regular expression
  • generate random strings satisfying a regular expression

Size/complexity tradeoff

A regular expression defines a set of strings. Our general intuition is that we want the regex to include as few strings as possible (i.e. be as small as possible) but also not be too long or complex. Regexfinder formalizes this tradeoff by using two ideas from information theory. The size of the regular expression is the information theoretic entropy of a distribution on the regular expression (we start with the uniform distribution). And the complexity of the regular expression is the length of the regular expression which is motivated by Kolmogorov Complexity.

Using a weight parameter α, ent for entropy and K for complexity, regexfinder minimizes

over regular expressions that contain a provided set of strings.

For example, if the set of strings all have length 1 and contain the digits 0,1,2,4,5,7,8,9, then an admissible regular expression is RE1 = [01245789]. If α = 1, then

φ(RE1) = log2(8) + 10 = 13.

Another regular expression that contains all of these strings is RE2 = \d, and

φ(RE2) = log2(10) + 2 = 5.52.

So, in this case, \d would be chosen over [01245789].

Graph Representation

The tool creates a graph representation of a regular expression. For example, the regular expression

r'\d{2}([a-m]{2}[n-z]{2}|[A-Z]{4})\d'

is represented as:

About

Regexfinder is a set of tools to evaluate and manipulate regular expressions. In particular, to optimize the tradeoff between complexity and size.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published