-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Modelling of the Standard Library #16840
base: yoff-python-stop-extracting-std-lib
Are you sure you want to change the base?
Python: Modelling of the Standard Library #16840
Conversation
Two of the generated summaries have been excluded: - ["re", "Member[split]", "Argument[0,pattern:]", "ReturnValue", "taint"] From the documentation, it is not clear why pattern should figure in the return value, as that is the part denoting split point and thus all those instances are filtered out. From the implementation Spit function: https://github.com/python/cpython/blob/3.12/Lib/re/__init__.py#L199 _compile function being called by split: https://github.com/python/cpython/blob/3.12/Lib/re/__init__.py#L280 We see that in case the pattern is already a compiled `Pattern`, it is returned directly from _compile and could thus be part of the return value from split. This is probably not possible to arrange for an attacker, and so an FP in practice. - ["urllib2", "Member[unquote]", "Argument[0,string:]", "ReturnValue", "taint"] urllib2 seems to be only in Python2 (e.g. https://docs.python.org/2.7/library/urllib2.html) and I cannot locate the function unquote.
Sorry, I somehow pushed to the wrong branch. The force-push should just reset the world without any change.. |
I have set no change note. I suppose a change note will be added when this branch is merged into main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks OK to me, but it would have been nice to have some tests highlighting why flow through some of the more non-obvious methods are good. Value-flow for copy.copy
makes 100% sense, but things like taint-flow for subprocess.Popen
simply didn't make sense to me 🤷
I assume bc55117 was caused by some code using keywords which we didn't model previously?
@@ -7,19 +7,107 @@ extensions: | |||
- addsTo: | |||
pack: codeql/python-all | |||
extensible: sinkModel | |||
data: [] | |||
data: | |||
- ["subprocess.Popen!","Subclass.Call.Argument[0,args:]", "log-injection"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this 😕 can you the reasoning behind this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only that a path was observed where calling the constructor led to logging. I have trouble justifying that from the implementation, though, so perhaps we just leave it out...
# See | ||
# - https://docs.python.org/3/library/html.html#html.escape | ||
# - https://docs.python.org/3/library/html.html#html.unescape | ||
- ["html", "Member[escape,unescape]", "Argument[0,s:]", "ReturnValue", "taint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thing we have this modeled as a sanitizer already 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that sanitizers win in case there is both a summary and a sanitizer :-)
# See https://docs.python.org/3/library/multiprocessing.html#multiprocessing.connection.Listener | ||
- ["multiprocessing.connection.Listener!", "Subclass.Call", "Argument[3,authkey:]", "ReturnValue", "taint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look right to me. please explain :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If authkey is given and not None, it should be a byte string and will be used as the secret key for an HMAC-based authentication challenge." So the authkey
is stored on the constructed object somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will have been a path where it has reached a sink..
# See | ||
# - https://docs.python.org/3/library/asyncio-queue.html#asyncio.Queue.put | ||
# - https://docs.python.org/3/library/asyncio-queue.html#asyncio.Queue.put_nowait | ||
- ["queue.Queue", "Member[put,put_nowait]", "Argument[0,item:]", "Argument[self]", "taint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also have value-flow with ListElement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could add that, and then add models for get
and get_nowait
.
# See https://docs.python.org/3/library/subprocess.html#subprocess.Popen | ||
- ["subprocess.Popen!", "Subclass.Call", "Argument[0,args:]", "ReturnValue", "taint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one doesn't make any sense to me, please explain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The constructor somehow stores the argument on the object. A common use is as a context manager, meaning the object will be accessed in various ways.
# See https://docs.python.org/3/library/shutil.html#shutil.which | ||
- ["shutil", "Member[which]", "Argument[0,cmd:,2,path:]", "ReturnValue", "taint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's technically true that using shutil.which("codeql")
will return a string that also contains "codeql"
, I don't really feel this is a useful taint-step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere it has resulted in an alert, though..?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose such technically true -- but innocuous looking -- things are what hackers exploit...
# - https://docs.python.org/3/library/tarfile.html#tarfile.open | ||
# - https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.open | ||
- ["tarfile", "Member[open]", "Argument[0,name:,2,fileobj:]", "ReturnValue", "taint"] | ||
- ["tarfile.TarFile", "Member[open]", "Argument[0,name:,2,fileobj:]", "ReturnValue", "taint"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite know what to think of these steps. I guess it does make sense to say that opening a tarfile at a user-controlled path means the tarfile is tainted as well 🤔 I think it would overall be better if we had a content type representing that the contents of an open file is tainted 🤷
Overall, this just turned out to be me thinking out loud 😅
this = | ||
DD::selfTracker(subclassRef() | ||
.getAValueReachableFromSource() | ||
.asExpr() | ||
.(ClassExpr) | ||
.getInnerScope()) | ||
or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a test indicating we can now do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point...this is one bit I am somewhat unsure of, do we want this? Also, it should trigger a performance check.
Co-authored-by: Rasmus Wriedt Larsen <[email protected]>
It is not clear from the code how this could happen and I do not remember the path I saw, perhaps it was unreasonable.
It would be nice to simplify to a single sequence content type..
MaD row numbers in provenance column
These are the models gathered so far during experiments not extracting the standard library by default.
A good part of the models are generated by special automation. Where possible, I have made manual adjustments to make them align with their documentation and not be overly specific.
Commit-by-commit review should be beneficiel.