Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracts web domains and IP address and implements tests #2031

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

aaronatp
Copy link
Contributor

@aaronatp aaronatp commented Mar 19, 2024

This PR partially resolves #1907. It extracts web domains and IP addresses, and implements rendering functions and tests.

These changes likely don't require updates to the documentation, but if some users want to, they should be able to repurpose many of the extraction functions fairly easily.

In (-d) mode, this pull request extracts web domains and IP addresses from files and sandbox traces and presents them to the user. In (-v) and (-vv) modes, this pull request also tells the user how many times each web domain and IP address occur, and tries to identify a WinAPI networking function acting on a web domain and IP address for every time they occur. (-v) and (-vv) modes are currently the same.

This PR also implements tests for the part of the code that checks valid web domains, valid IP addresses, and potential WinAPI networking functions.

Example output:

Default (-d) output

Screenshot 2024-03-19 at 10 38 21 AM Screenshot 2024-03-19 at 10 46 57 AM

Default (-d) output when there are no domains found

Screenshot 2024-03-19 at 10 41 49 AM

Verbose (-v) output

Screenshot 2024-03-19 at 10 37 28 AM Screenshot 2024-03-19 at 10 48 59 AM

Very verbose (-vv) output

Screenshot 2024-03-19 at 10 39 20 AM

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed

…d sandbox traces and presents them to the user.

In (-v) and (-vv) modes, this pull request also tries to identify how web domains and IP addresses are used.
It checks whether a WinAPI networking function is called soon after the web domain or IP address appears.
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

@github-actions github-actions bot dismissed their stale review March 19, 2024 18:47

CHANGELOG updated or no update needed, thanks! 😄

raise NotImplementedError("Confirm that argv is an attribute of doc.meta")

else:
print("in 'get_sigpaths_from_doc', run in debug (-d) mode")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was for debugging? Why are you not using logger here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @VascoSch92 yes trying to figure out why it's failing to build with PyInstaller 3.11

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok easy... I just saw it and I wanted to check :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yes thank you for the question though, I appreciate the comments and feedback you're giving me

capa/capabilities/extract_domain_and_ip.py Outdated Show resolved Hide resolved
"""same as 'formatted_domain_verbose' but without 'ip_address_statement'"""
return (
f"{ip_addr}\n"
+ f" |---- {networking_functions_statement(doc, ip_addr)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
+ f" |---- {networking_functions_statement(doc, ip_addr)}"
+ f" |---- {networking_functions_statement(doc, ip_addr)}\n"

or?

return statement

else:
raise LengthError("'api_functions' contains unexpected data!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can never have this error or?

Because the three cases above are already covering all the output of the len function.

If api_functions doesn't implement the length functions, an error will occur on the first if or?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Yes I think you are right, I have just taken that out!

capa/capabilities/extract_domain_and_ip.py Outdated Show resolved Hide resolved


def ceil(num):
if isinstance(num, float):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if num is integer? The return is None. Do you want this behaviour?

but there is a function in the math default package of python which do that not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question - it shouldn't be an issue because the input to ceil will always be some multiple of 1/3, so the input's type should always be float, but you're right that it's not explicit - I will add a type hint here. I thought it might be a bit unnecessary to import the math package just for a simple function lol but what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updated, please let me know what you think!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the method is better :)

I have a problem with the fact that you implement the same method in two different scripts. If you have a bug in one method you have the same bug in the other, i.e, two bugs.

However, it is improbable that you have a bug in this method ;)

Using math.ceil is not a problem, math is a standard library in python.

But up to you ;)

@aaronatp
Copy link
Contributor Author

Hey @VascoSch92 looks like most of these tests are passing but GitHub is saying it's missing a code license agreement from you - do you think you could check about that? Would love to commit this code together :)

@VascoSch92
Copy link
Contributor

VascoSch92 commented Mar 20, 2024

Hey @VascoSch92 looks like most of these tests are passing but GitHub is saying it's missing a code license agreement from you - do you think you could check about that? Would love to commit this code together :)

Done!

Let me know if there are still problems ;)

You should probably trigger the cla/google tests. I can not do that

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work with many good parts, let's see how we can incorporate this as part of the main extraction phase and extend the result document with NBI/HBI information initially

yield feature, addr


def default_extract_domain_names(doc: ResultDocument) -> Generator[str, None, None]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid reprocessing the file here (in generate_insns_from_doc etc.), let's discuss how to move this into the core extraction phase and how we can extend the ResultDocument with HBI/NBI data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-tz Thank you for your feedback! Yes, this is something I thought about too. I'll look into restructuring this program to move the domain/IP extraction to the core extraction phase and then extend ResultDocument with HBI/NBI data

(str): either the formatted IP address, or an error message
"""
try:
ip_address = socket.gethostbyname(domain)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to do this and in most offline analysis systems it won't work anyway

return f"Could not get IP address for {domain.split('/')[0]}\n"


def networking_functions_statement(doc: ResultDocument, domain_or_ip: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be simplified to called networking APIs in "close proximity" to the found string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-tz I think this suggestion could be a good idea. However, when I previously tried simple ways of extracting networking APIs in "close proximity" (like skipping over "push" statements until we get to APIs), I actually got reduced performance. I'll have another look at this and see if I can find a way to make the "close proximity" searching work

yield "Not able to identify the calling function"


def potential_winapi_function(string: str) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a finite list we could generate and embed instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-tz thank you for this! Yes, I will do that

capa/render/verbose.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests look great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-tz thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extract indicators (HBI/NBI) around capability detections
3 participants