Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Modelling of the Standard Library #16840

Open
wants to merge 13 commits into
base: yoff-python-stop-extracting-std-lib
Choose a base branch
from

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Jun 25, 2024

These are the models gathered so far during experiments not extracting the standard library by default.
A good part of the models are generated by special automation. Where possible, I have made manual adjustments to make them align with their documentation and not be overly specific.
Commit-by-commit review should be beneficiel.

yoff added 6 commits June 25, 2024 14:40
Two of the generated summaries have been excluded:
 - ["re", "Member[split]", "Argument[0,pattern:]", "ReturnValue", "taint"]
   From the documentation, it is not clear why pattern should figure in the return value, as that is the part denoting split point and thus all those instances are filtered out.
   From the implementation
     Spit function: https://github.com/python/cpython/blob/3.12/Lib/re/__init__.py#L199
     _compile function being called by split: https://github.com/python/cpython/blob/3.12/Lib/re/__init__.py#L280
   We see that in case the pattern is already a compiled `Pattern`, it is returned directly from _compile and could thus be part of the return value from split. This is probably not possible to arrange for an attacker, and so an FP in practice.

 - ["urllib2", "Member[unquote]", "Argument[0,string:]", "ReturnValue", "taint"]
   urllib2 seems to be only in Python2 (e.g. https://docs.python.org/2.7/library/urllib2.html) and I cannot locate the function unquote.
@yoff yoff requested a review from a team as a code owner June 25, 2024 23:06
@yoff
Copy link
Contributor Author

yoff commented Jun 26, 2024

Sorry, I somehow pushed to the wrong branch. The force-push should just reset the world without any change..

@yoff yoff mentioned this pull request Jun 26, 2024
@yoff yoff added the no-change-note-required This PR does not need a change note label Jun 26, 2024
@yoff
Copy link
Contributor Author

yoff commented Jun 26, 2024

I have set no change note. I suppose a change note will be added when this branch is merged into main.

Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks OK to me, but it would have been nice to have some tests highlighting why flow through some of the more non-obvious methods are good. Value-flow for copy.copy makes 100% sense, but things like taint-flow for subprocess.Popen simply didn't make sense to me 🤷


I assume bc55117 was caused by some code using keywords which we didn't model previously?

@@ -7,19 +7,107 @@ extensions:
- addsTo:
pack: codeql/python-all
extensible: sinkModel
data: []
data:
- ["subprocess.Popen!","Subclass.Call.Argument[0,args:]", "log-injection"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this 😕 can you the reasoning behind this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only that a path was observed where calling the constructor led to logging. I have trouble justifying that from the implementation, though, so perhaps we just leave it out...

# See
# - https://docs.python.org/3/library/html.html#html.escape
# - https://docs.python.org/3/library/html.html#html.unescape
- ["html", "Member[escape,unescape]", "Argument[0,s:]", "ReturnValue", "taint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thing we have this modeled as a sanitizer already 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that sanitizers win in case there is both a summary and a sanitizer :-)

Comment on lines +67 to +68
# See https://docs.python.org/3/library/multiprocessing.html#multiprocessing.connection.Listener
- ["multiprocessing.connection.Listener!", "Subclass.Call", "Argument[3,authkey:]", "ReturnValue", "taint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right to me. please explain :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If authkey is given and not None, it should be a byte string and will be used as the secret key for an HMAC-based authentication challenge." So the authkey is stored on the constructed object somehow.

Copy link
Contributor Author

@yoff yoff Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will have been a path where it has reached a sink..

# See
# - https://docs.python.org/3/library/asyncio-queue.html#asyncio.Queue.put
# - https://docs.python.org/3/library/asyncio-queue.html#asyncio.Queue.put_nowait
- ["queue.Queue", "Member[put,put_nowait]", "Argument[0,item:]", "Argument[self]", "taint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this also have value-flow with ListElement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could add that, and then add models for get and get_nowait.

Comment on lines +94 to +95
# See https://docs.python.org/3/library/subprocess.html#subprocess.Popen
- ["subprocess.Popen!", "Subclass.Call", "Argument[0,args:]", "ReturnValue", "taint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one doesn't make any sense to me, please explain

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor somehow stores the argument on the object. A common use is as a context manager, meaning the object will be accessed in various ways.

Comment on lines +92 to +93
# See https://docs.python.org/3/library/shutil.html#shutil.which
- ["shutil", "Member[which]", "Argument[0,cmd:,2,path:]", "ReturnValue", "taint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's technically true that using shutil.which("codeql") will return a string that also contains "codeql", I don't really feel this is a useful taint-step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere it has resulted in an alert, though..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose such technically true -- but innocuous looking -- things are what hackers exploit...

Comment on lines +97 to +100
# - https://docs.python.org/3/library/tarfile.html#tarfile.open
# - https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.open
- ["tarfile", "Member[open]", "Argument[0,name:,2,fileobj:]", "ReturnValue", "taint"]
- ["tarfile.TarFile", "Member[open]", "Argument[0,name:,2,fileobj:]", "ReturnValue", "taint"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite know what to think of these steps. I guess it does make sense to say that opening a tarfile at a user-controlled path means the tarfile is tainted as well 🤔 I think it would overall be better if we had a content type representing that the contents of an open file is tainted 🤷

Overall, this just turned out to be me thinking out loud 😅

Comment on lines +284 to +290
this =
DD::selfTracker(subclassRef()
.getAValueReachableFromSource()
.asExpr()
.(ClassExpr)
.getInnerScope())
or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test indicating we can now do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point...this is one bit I am somewhat unsure of, do we want this? Also, it should trigger a performance check.

yoff and others added 4 commits June 28, 2024 14:39
Co-authored-by: Rasmus Wriedt Larsen <[email protected]>
It is not clear from the code how this could happen and
I do not remember the path I saw, perhaps it was unreasonable.
It would be nice to simplify to a single sequence content type..
@yoff yoff requested a review from RasmusWL June 28, 2024 13:25
MaD row numbers in provenance column
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-change-note-required This PR does not need a change note Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants