-
Notifications
You must be signed in to change notification settings - Fork 0
Args Invariant
This is based off of this comment in PR 2095.
What should the invariant for SymPy functions regarding args
be? The basic invariant is expr == expr.func(*expr.args)
. However, there are some questions about leaf nodes.
Basically, it boils down to, what is a leaf?
- A non-Basic
- An instance of Atom
- A Basic with empty .args
I used to say 2, but I now think that it is wrong. Atom is a convenient core class, but there is really no reason to use it instead of 3. At any rate, Atom is definitely a special case of 3. But requiring every class with empty args to be an Atom would be difficult, lead to unexpected corner cases, and wouldn't even make sense (a good example here is Tuple()
, the empty Tuple).
Now, here would be the exact invariants in each case
expr == expr.func(*args)
expr.args == () or expr == expr.func(*expr.args)
Additionally, each instance of Basic would also satisfy
- All elements of
expr.args
are instances of Basic. In code,all(isinstance(arg, Basic) for arg in expr.args)
.
Some examples of where this comes up (all leaf objects, or near leaf objects if we choose 1)
Object | example | args
-------------|-------------------------|-------------------------------
Symbol | Symbol('x') | ()
Integer | Integer(2) | ()
Rational | Rational(2, 3) | ()
NumberSymbol | pi | ()
MatrixSymbol | MatrixSymbol('x', 2, 3) | ('x', 2, 3) where 2 and 3
| | are Integers and x is a string
TODO: The below lists might be useful to have in a table, because many of the bullet points in on section correspond to bullet points in the other. But to do tables in Markdown, you have to use html.
- Easier to conceptualize, in the sense that all parts of an expression that define that expression are in .args
- Easier to write objects. You don't have to write custom
__eq__
or_hashable_content
if you want to make non-Basic objects part of the object. - Objects like
MatrixSymbol
might have part non-Basic and part Basic args, but 3 would require all or nothing (or a #2093 type workaround that uses a nesting the intuitive feeling of the object). - The invariant looks simpler.
- non-Basics are treated as part of the expression tree. Any expression tree manipulation will also affect them.
- People might take it too far. If there is a Basic version of an object, you really should be using that object (e.g., use
Integer(1)
instead of1
orTuple
instead oftuple
). Otherwise you miss the whole point of having those objects in the first place. - Code that walks expression trees always has to be special-cased to look out for non-Basics.
- It's a nice assumption to make (all elements of
.args
are Basic). It means you can writefor arg in expr.args
and know that you can call any method of Basic onarg
. - Existing code assumes this.
- Treating non-Basics as part of the expression tree can be confusing. For example,
Symbol('x').xreplace('x', 'y')
will returnSymbol('y')
, but not because it replaced'x'
with'y'
, but because'x'
is sympified intoSymbol('x')
and'y'
is sympified intoSymbol('y')
and those are replaced (and there is nothing special about the non-Basicstr
in this case). Doing 1 consistently would require changing the whole way that sympification is handled throughout SymPy. - With 1, there can be non-leaf nodes without any leaves, because there will
still be objects with empty args (
Tuple()
is again a classic example). In some sense, these objects will still have to be leaves, so perhaps 1 should really beobj.args == () or not isinstance(obj, Basic)
. However, to be sure, the invariant does not need to include this because ifobj.args == ()
, then it must be the case thatobj == obj.func()
.
But I think an important thing to consider is what expression traversal functions would look like with each of the invariants.
To be fair, there is a difference between 1 algorithms and 3 algorithms, in that the 1 algorithms treat the non-Basics
Note: to make these generators in their current form, you need Python 3.3's yield from
. However, as we see below, the usual use case of the pre-order traversal is to use it to actually recurse through an expression, rebuilding at some part, and returning the result. Thus, for simplicity, this version just prints each node.
def preorder_traversal(expr):
print expr
if isinstance(expr, Basic):
for arg in expr.args:
preorder_traversal(arg)
def preorder_traversal(expr):
print expr
for arg in expr.args:
preorder_traversal(arg)
def xreplace(expr, old, new):
if expr == old:
return new
if isinstance(expr, Basic):
return expr.func(*[xreplace(arg, old, new) for arg in expr.args])
return expr
def xreplace(expr, old, new):
if expr == old:
return new
if expr.args:
return expr.func(*[xreplace(arg, old, new) for arg in expr.args])
return expr
def atoms(expr, classes):
ret = set()
if isinstance(expr, classes):
ret |= set([expr])
if isinstance(expr, Basic):
for arg in expr.args:
ret |= atoms(arg, classes)
return ret
def atoms(expr, classes):
ret = set()
if isinstance(expr, classes):
ret |= set([expr])
for arg in expr.args:
ret |= atoms(arg, classes)
return ret
I prefer 3. Here are my issues with 1:
-
It appears simpler, because the invariant is simpler (no check for empty args), but it's deceptive. As the above algorithms show, 3 is actually the simpler version. With 1, every algorithm has to nest its expression recursion under an
if isinstance(expr, Basic)
block. With 3, empty args being leaves falls out naturally as an empty loop.There are cases where you do have to check this condition with 3, such as in
xreplace
, but you also have to check theBasic
condition in the same place with 1, so nothing is gained or lost in that case. -
I think the bullet point under the cons for 1 about sympification is not to be considered lightly. Going with 1 would require a huge refactoring of the way sympification is handled in SymPy. The usual way is to automatically coerce non-SymPy types into SymPy types, but with 1, it is not clear when this should be done. By default, in the Basic constructor, we would have to not sympify arguments. The burden for a class to specify that all its arguments should be Basic lies on the individual classes (or at least super classes). Since this is by far the common case, it would make the current code much more complicated, and lead to many bugs because people don't expect to need to write
args = sympify(args)
at the top of their constructors.Actually, this is the current behavior (
Basic
does not call sympify in its constructor). That's why classes exist that have non-Basic args. But we have this exact problem. Code is littered with seemingly unnecessaryargs = sympify(args)
calls, and code that forgets to include it has subtle bugs when someone calls the class with0
instead ofx
.Alternately, we could keep sympification as the default, and add a flag to Basic that subclasses could use to signify that they allow non-Basic args. But this leads to two issues. First, this becomes an inheritable property of the class, meaning it will be easier to obtain Liskov violations (this may actually be a non-issue, I'm not sure). Second, this gives the feeling that objects with non-Basic args are somehow second-class citizens. In fact, that's my next point
-
No matter how we support 1, objects with non-Basic args will become like second-class citizens, whether it's de jure from the implementation or de facto from people forgetting to add the
if not isinstance(expr, Basic)
to their algorithms. -
I should point out that leaf nodes are free to define equality in a natural way. That is, even if something like
MatrixSymbol('x', 1, 2).args
is()
, we can still haveMatrixSymbol('x', 1, 2) != MatrixSymbol('y', 1, 2) != MatrixSymbol('y', 2, 2)
.
I mostly prefer 1
-
Small point but while it's fresh in your mind,
MatrixSymbol('x', 1, 2).args
can't be () easily. This is because the expression could beMatrixSymbol('x', n, k+m)
. The second two arguments are the shape and in general areExpr
s. We'll need to traverse down these so they'll either need to be args or we'll need to do a lot of special casing withinMatrixSymbol
. A pull request exists to replace'x'
withSymbol('x')
. This seems wrong to me because'x'
here is a name, not a mathematical scalar. The right way to do this under 3 would be to create aString(Basic)
. This also feels wrong to me. -
In general, I think that having all identifying information within args will simplify the code in the long term. It also opens various performance improvements. See the following timings from my laptop
In [2]: timeit Add(x, y, z)
100000 loops, best of 3: 2.15 us per loop
In [3]: timeit Add(x, y, z, evaluate=False) # no idea why this is slower
100000 loops, best of 3: 4.14 us per loop
In [4]: timeit Basic.__new__(Add, x, y, z)
1000000 loops, best of 3: 616 ns per loop
In [5]: timeit (Add, x, y, z)
10000000 loops, best of 3: 96.7 ns per loop
Having all information in args turns the SymPy Basic into something like an s-expression (a glorified tuple). Certainly we don't want to go this far (we like lots of things that the Basic object gives to us), but having that simplicity enables more advanced manipulations of SymPy expressions with greater trust in their accuracy. If someone wrote code that required really intense expression manipulation they could switch into tuple mode (or list mode for mutability), do lots of manipulations quickly, then switch back and re-evaluate. If all information is in args then they could perform these transformations with high fidelity. I'm not proposing that this is the reason we should switch but rather that this is an example of something that is easy with an invariant like "All identifying information is in .args
"
- Personally I manipulate SymPy expressions quite a bit. I often do this from outside SymPy (Theano, LogPy, Strategies). Having the "all identifying information is in
.args
" invariant makes SymPy significantly more interoperable if you don't have access to the codebase (or don't want to fuss with the codebase). Classes like Symbol, Rational, AppliedPredicate all end up requiring special attention in each of these endeavors. I'm mostly willing to put up with it because I'm familiar with them but potential collaborators are often turned off. I believe that interoperability is more important than we realize for long term growth.
I believe that interoperability is the most important direction for SymPy right now (personal opinion). I believe that the "all identifying information is in .args
" invariant is essential for that goal.
It seems to me that this is a discussion about either
-
- Simplifying our data structure
-
- Restricting the things that our data structure can hold to simplify our traversal code.
I think that 3 is short-term more convenient but that 1 is long-term more convenient and much more convenient for interoperation and new developers. To me 3 is insular.