Add a Feature Extractor for the Drakvuf Sandbox #2143

yelhamer · 2024-06-11T00:34:10Z

Hello! This PR tries to add a dynamic feature extractor for the Drakvuf sandbox as part of a GSoC project I am working on.

As of now, the code still runs a bit slow on actual Drakvuf output and that is because Drakvuf captures output from all of the processes running on the system, and not just the submitted sample. This results in analysis files (in JSON Lines format) that are 2 GB.

In order to overcome the previous overhead, I have added support only for the apimon and syscall modules, which respectively capture WinAPI calls and Windows system calls. Additionally, I have kept the Pydantic models light and concise since otherwise they would consume a lot of memory.

Despite this however, running capa on an actual analysis still consumes a lot of memory and time. A sample's report of size 2GB took up around 6GB in memory before the feature extraction and matching began, and another 6 once the feature extraction was taking place. In order to fix this I could think of the two following possibilities:

use a faster alternative to Pydantic (such as msgspec maybe?) at the cost of lesser features.
add an option to match only against a single process (or its children), which would allow us to easily pick which process to analyze; in this case, the malware sample. This could also be extrapolated to static capa, so maybe something like capa --faddr=0xffffffff sample.exe or capa --pid=3584 drakmon.log

(note: I didn't implement 1. because drakvuf returns syscall arguments in the same JSON object at the same level of other important keywords like the syscall's name and timestamp)

Also, the general report file (drakmon.log) which I am envisioning will be passed onto capa does not provide the sample's hashes unfortunately, while some other file the sandbox returns does indeed return a sha256 hash. Because of this, this feature extractor does not fetch the sample's hash and does not display it.

Updates:

I have opened a PR for (2): Add the ability to select which functions or processes you which to extract capabilities from #2156
As for (1), I am unsure if Pydantic validation/initialization being slow is the direct issue. I ran some tests with py-spy and it seems that most of the slow down happens well after the Pydantic models have been validated/initialized. This slowdown however might be ignored for now (imo) if we agree to get the PR above pushed, since the processes that take long to analyze (just from observation) are the system ones and most users could/would skip over analyzing them and analyze only the malware ones. Here's the profile for a sample (A), as well as the profile for that sample's associated report (B):

(A):

(B):

Checklist

No CHANGELOG update needed

No new tests needed

No documentation update needed

github-actions

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

CHANGELOG updated or no update needed, thanks! 😄

williballenthin · 2024-06-11T07:31:06Z

So cool!

I think both (1) and (2) are reasonable.

For (1), I think its important to use a profiler to drive the optimization. I have a few ideas about how to make things faster (and I'm sure you do too), but I recommend starting with a base case and collecting some profiles before further development. I like py-spy.

I wonder if you could use raw string matching to filter out the lines that don't have relevant messages, and only afterwards decode each relevant line via pydantic. If pydantic is still the bottleneck, then maybe raw json module, or perhaps msgspec. We've used msgspec elsewhere (FLOSS) and it works well; I only have a minor hesitation about introducing another dependency.

For (2), I think specifying the process (tree) via argument makes a lot of sense. Otherwise I can imagine there's just too much noise. Does it make sense for this to be part of capa? or for Drakvuf to provide such a utility, since other tools might use this too?

Would you mind keeping the top post updated with a list of pending tasks/ideas? And maybe using draft/ready states to indicate when this needs feedback? Excited to help land this new feature!

yelhamer · 2024-06-13T00:44:48Z

So cool!

I think both (1) and (2) are reasonable.

For (1), I think its important to use a profiler to drive the optimization. I have a few ideas about how to make things faster (and I'm sure you do too), but I recommend starting with a base case and collecting some profiles before further development. I like py-spy.

I wonder if you could use raw string matching to filter out the lines that don't have relevant messages, and only afterwards decode each relevant line via pydantic. If pydantic is still the bottleneck, then maybe raw json module, or perhaps msgspec. We've used msgspec elsewhere (FLOSS) and it works well; I only have a minor hesitation about introducing another dependency.

For (2), I think specifying the process (tree) via argument makes a lot of sense. Otherwise I can imagine there's just too much noise. Does it make sense for this to be part of capa? or for Drakvuf to provide such a utility, since other tools might use this too?

Would you mind keeping the top post updated with a list of pending tasks/ideas? And maybe using draft/ready states to indicate when this needs feedback? Excited to help land this new feature!

Sorry for the late reply.

Currently I am decoding the JSON lines into dictionaries using msgspec, and filtering using that before storing them as Pydantic models. I assume I can filter the text directly without decoding if adding an extra dependency is not desired however.

As for (2), I think I can do the filtering from within Drakvuf as well. I just assumed that it might be a feature that capa might want, and also because it would make the Drakvuf code a bit neater.

williballenthin · 2024-06-13T04:36:07Z

Sure, both those work 😄

capa/features/extractors/drakvuf/call.py

capa/features/extractors/drakvuf/extractor.py

capa/features/extractors/drakvuf/call.py

capa/features/extractors/drakvuf/file.py

capa/features/extractors/drakvuf/process.py

capa/helpers.py

Co-authored-by: Vasco Schiavo <[email protected]>

williballenthin · 2024-06-19T06:59:44Z

@yelhamer is this PR ready for a full review and potential merge? or are there other pending changes or design decisions to be made?

yelhamer · 2024-06-19T14:09:21Z

@yelhamer is this PR ready for a full review and potential merge? or are there other pending changes or design decisions to be made?

I am done with all the code I want to add. Please feel free to go through it and review it.

mr-tz

Great work!! See my comments inline.

I'd also like to see some documentation updates with this (so at least let's create an issue to track this - or better yet do it as part of this PR).

capa/features/extractors/drakvuf/call.py

mr-tz · 2024-06-20T08:35:33Z

capa/features/extractors/drakvuf/file.py

+    Extract imported function names.
+    """
+    if report.loaded_dlls is None:


loaded DLLs means something else to me than imports - do they mean the same thing here?

My understanding from reading the comments on the relevant drakvuf source code is that the output of this plugin includes imported functions from DLLs loaded by the PE loader, as well as the ones that might be dynamically loaded by a process. I think this because the comments say that they are hooking some windows system calls in order to do this (I believe?), and if this is the case then I feel like this plugin is providing an extensive list of imports which includes static ones as well as dynamic ones that malware might try to load discretely which is why I added this here.

can you please add this documentation to the code?

Hmm, I just noticed that Drakvuf reports the imported functions for each process. Should I extract the imported functions in the process scope instead? this way if a user is analyzing only a specific process then they wouldn't get false results from an import originating from another process.

For the file scope extractors we're only interested in the imports of the target file.

This is an artifact of the static analysis module and likely differs in dynamic analysis and across sandboxes - so maybe we need a new way to handle these?

This thread needs to be resolved.

At the very least, I think we should only yield the imports for the input file.

Optionally, if we can come up with some good motivation and test cases, then we could also extend the sandbox extractor API to cover the recursively imported DLLs/names.

I can confirm that DRAKVUF outputs only execution trace (including loaded DLLs and imported functions) and doesn't concern itself with static analysis.

Can I help with resolving it somehow?

@yelhamer please emit only the import names from the target DLL or none at all. I understand that there's maybe another way to interpret these imports (such as all imports seen in the address space), but this would be inconsistent with other feature extractors, and will be difficult to keep straight and reason about.

I suspect that these import features won't be commonly used, so emitting none at all is usually going to be fine. If we can come up with some specific problematic cases, then we can reassess.

capa/features/extractors/drakvuf/file.py

mr-tz · 2024-06-20T08:38:19Z

capa/features/extractors/drakvuf/global_.py

+    # drakvuf sandbox currently supports only windows as the guest: https://drakvuf-sandbox.readthedocs.io/en/latest/usage/getting_started.html
+    yield Format(FORMAT_PE), NO_ADDRESS


is there still a way to check (if this changes in the future)?
what about shellcode or other formats?

Drakvuf on its own supports both Linux and Windows, which makes determining the OS difficult since it does not give any specific information about the format/OS, and we'd need to make some heuristics to determine whether the analysis is Windows or Linux, as well as determine the format.

However, I have written this extractor with the Drakvuf Sandbox in mind, and that one supports only Windows. I suppose I can bring this to the attention of the devs so that if they add support for Linux they'd notify capa.

capa/helpers.py

capa/loader.py

tests/fixtures.py

Co-authored-by: Moritz <[email protected]>

…le features

msm-cert

Hi!

As a DRAKVUF-Sandbox maintainer I decided to pop in and maybe help with the review. I hope you don't mind.

I avoid nitpicking code style - it's already very good - and I focus strictly on DRAKVUF-related things.

Maybe a word of explanation, since I'm not sure how clear it is to Capa maintainers: there are two DRAKVUFs at play:

DRAKVUF (https://github.com/tklengyel/drakvuf), a blackbox VM analysis tool
DRAKVUF Sandbox (https://github.com/CERT-Polska/drakvuf-sandbox/), an open-source sandbox that utilises DRAKVUF for malware analysis. This project is also GSOC host for @yelhamer this year, and I'm its maintainer.

Having said that, there's - in principle - nothing DRAKVUF Sandbox specific here. drakvuf.log is purely a standard output of DRAKVUF (the vm monitor), which is parsed both by DRAKVUF Sandbox and - hopefully - soon by Capa.

It would be slightly easier to make this PR DRAKVUF Sandbox specific (for example, there would be no problem with getting sha256 of a file, since it's a well defined artifact of DRAKVUF sandbox). But I think it's nice to make this more generic. And it is - everything here works in general case for any DRAKVUF trace (with some nitpicks, like fragments of extract_* functions assume the binary is a x64bit windows PE - only true for DRAKVUF sandbox).

msm-cert · 2024-07-19T18:02:46Z

capa/features/extractors/drakvuf/extractor.py

+    def __init__(self, report: DrakvufReport):
+        super().__init__(
+            # DRAKVUF currently does not yield hash information about the sample in its output
+            hashes=SampleHashes(md5="", sha1="", sha256="")


Yeah, unfortunately DRAKVUF is primarily a full VM monitor. In DRAKVUF sandbox it's (ab)used to function as a malware sandbox, but drakmon.log is the output directly from DRAKVUF.

Which is good! It makes this integration more generic (works with DRAKVUF, not just with DRAKVUF sandbox). But that purpose mismatch causes glitches like this.

I think it's possible to send a PR to DRAKVUF that adds logging of sample hashes to the DRAKVUF's injector output. If this is valuable I can take a look at this (I can't promise it gets merged, though). But we can't have this in the GSOC timeline, so I hope PR can progress without it.

capa/features/extractors/drakvuf/call.py

msm-cert · 2024-07-19T18:22:41Z

capa/features/extractors/drakvuf/call.py

+            yield Number(str_to_number(arg_value)), ch.address
+        except ValueError:
+            # yield argument as a string
+            yield String(arg_value), ch.address


Again, if I understand the code correctly, and this iterates over arguments from apimon, arg_value won't be a string. Instead, parsed values look like "0xc6f217efe0:\"ntdll.dll\"" in the JSON. Is that OK?

@yelhamer please comment or address

I'm now yielding the "ntdll.dll" part of the argument in addition to the entire string (we yield the entire string just in case of unexpected argument formats).

@yelhamer can you show some examples from show-features.py? I'm not quite following what you mean by this formatting.

@williballenthin I meant that for "0xc6f217efe0:\"ntdll.dll\"" for example it would yield String("ntdll.dll") and String("0xc6f217efe0:\"ntdll.dll\""), but looking at show-features.py it does give misleading results (displays same argument twice):

yacine@y:~/src/capa/scripts$ python3 show-features.py small_drakmon.log -d DEBUG:capa:skipping library code matching: only supported by the vivisect backend global: global: format(pe) global: global: os(windows) global: global: arch(amd64) proc: \Device\HarddiskVolume2\Windows\System32\conhost.exe (ppid=4852, pid=3564) proc: \Device\HarddiskVolume2\Windows\System32\conhost.exe: string(\\Device\\HarddiskVolume2\\Windows\\System32\\conhost.exe) thread: 6592 call 0: LdrLoadDll(440203471832, "api-ms-win-core-fibers-l1-1-1", 0x667e2beb90:"api-ms-win-core-fibers-l1-1-1", 0, 2049)

With this in mind I think I might just revert to just yielding "0xc6f217efe0:\"ntdll.dll\"" as we originally planned, since it would show up in show-features.py and might give analysts more insights, and it also wouldn't be misleading like yielding just "ntdll.dll", and finally I don't imagine we would be missing any rule matches by yielding "0xc6f217efe0:\"ntdll.dll\"" because the relevant api function would be expecting a memory address so I wouldn't imagine any rules basing any logic on that. Thoughts?

msm-cert · 2024-07-19T18:26:42Z

capa/features/extractors/drakvuf/global_.py

+
+
+def extract_format(report: DrakvufReport) -> Iterator[Tuple[Feature, Address]]:
+    # drakvuf sandbox currently supports only Windows as the guest: https://drakvuf-sandbox.readthedocs.io/en/latest/usage/getting_started.html


Suggested change

# drakvuf sandbox currently supports only Windows as the guest: https://drakvuf-sandbox.readthedocs.io/en/latest/usage/getting_started.html

# DRAKVUF sandbox currently supports only Windows as the guest: https://drakvuf-sandbox.readthedocs.io/en/latest/usage/getting_started.html

For consistency, as suggested somewhere else

Though actually, this is a bit mixed up (for a lack of a better word). This comment is technically true - DRAKVUF Sandbox (https://github.com/CERT-Polska/drakvuf-sandbox/) only supports x64 windows and PE files.

But - in general - this PR should work for DRAKVUF-the-vm-monitor too. In this case, 32bit windows and ELF files are supported too:

https://github.com/tklengyel/drakvuf/blob/main/README.md?plain=1#L25

In case of DRAKVUF Sandbox (as a maintainer), we don't need Linux or 32bit binary support here. But I'm just pointing it out to Capa maintainers, as a future extension point.

One thing to consider is that capa tries to determine the sample's format and target OS (and architecture):

capa/capa/helpers.py

Line 120 in da6c6cf

def get_format(sample: Path) -> str:

For architecture I assume we can look at the addresses (32-bit or 64-bit), but for format and target OS I am not really sure how to do that. That's why I restricted this PR to DRAKVUF sandbox only, but maybe perhaps I should have asked whether there are any suggestions for how to do that (maybe ask for it explicitly via -f option)? thoughts?

capa/features/extractors/drakvuf/global_.py

capa/features/extractors/drakvuf/models.py

capa/helpers.py

tests/test_cape_features.py

msm-cert · 2024-07-19T18:52:46Z

capa/features/extractors/drakvuf/file.py

+    Extract imported function names.
+    """
+    if report.loaded_dlls is None:


I can confirm that DRAKVUF outputs only execution trace (including loaded DLLs and imported functions) and doesn't concern itself with static analysis.

Can I help with resolving it somehow?

CHANGELOG.md

Co-authored-by: msm-cert <[email protected]>

capa/helpers.py

williballenthin

see a few inline requests, then I think we're ready to merge!

yelhamer · 2024-07-23T13:53:30Z

@williballenthin I've addressed the requested inline comments I believe now, mind checking now?

williballenthin · 2024-07-24T07:18:39Z

@yelhamer (or @msm-cert) is there a public place to acquire DRAKVUF sandbox traces? I'd like to run the extractor against a few examples to see how it works (so far I've relied on your tests).

williballenthin · 2024-07-24T07:21:50Z

finally, would you update the readme to explain the DRAKVUF support, like we do for CAPE? don't have to duplicate all the screenshots, just explain what/how to invoke it.

lets merge this PR today regardless, and continue to add tweaks as necessary.

yelhamer · 2024-07-24T12:05:26Z

Should be good now!

Co-authored-by: msm-cert <[email protected]>

williballenthin · 2024-07-24T12:21:48Z

README.md

+Additionally, capa also supports analyzing sandbox reports for dynamic capability extraction.
+In order to use this, you first submit your sample to one of supported sandboxes for analysis, and then run capa against the generated report file.
+
+Currently, capa supports the [CAPE sandbox](https://github.com/kevoreilly/CAPEv2) and the [DRAKVUF sandbox](https://github.com/CERT-Polska/drakvuf-sandbox/). In order to use either, simply run capa against the generated file (JSON for CAPE or LOG for DRAKVUF sandbox) and it will automatically detect the sandbox and extract capabilities from it.


initial commit

a408629

github-actions bot previously requested changes Jun 11, 2024

View reviewed changes

update changelog

603d623

yelhamer mentioned this pull request Jun 11, 2024

Add a Test Sample for the Drakvuf Feature Extractor mandiant/capa-testfiles#240

Merged

yelhamer added 3 commits June 11, 2024 01:40

Merge branch 'master' into drakvuf-extractor

90ef348

Update CHANGELOG.md

1e8735a

Update pyproject.toml

d2cdccf

VascoSch92 reviewed Jun 16, 2024

View reviewed changes

capa/features/extractors/drakvuf/call.py Outdated Show resolved Hide resolved

VascoSch92 reviewed Jun 16, 2024

View reviewed changes

capa/features/extractors/drakvuf/extractor.py Show resolved Hide resolved

VascoSch92 reviewed Jun 16, 2024

View reviewed changes

capa/features/extractors/drakvuf/call.py Outdated Show resolved Hide resolved

VascoSch92 reviewed Jun 16, 2024

View reviewed changes

capa/features/extractors/drakvuf/file.py Outdated Show resolved Hide resolved

capa/features/extractors/drakvuf/file.py Outdated Show resolved Hide resolved

capa/features/extractors/drakvuf/process.py Outdated Show resolved Hide resolved

capa/helpers.py Outdated Show resolved Hide resolved

yelhamer and others added 2 commits June 17, 2024 01:46

Apply suggestions from code review: Typos

840f59f

Co-authored-by: Vasco Schiavo <[email protected]>

capa/helpers.py: update if/else statement

9e13362

Co-authored-by: Vasco Schiavo <[email protected]>

yelhamer mentioned this pull request Jun 19, 2024

Add the ability to select which functions or processes you which to extract capabilities from #2156

Merged

3 tasks

loader.py: replace print() statement with log.info()

2e408d8

williballenthin requested review from mr-tz, williballenthin and mike-hunhoff June 19, 2024 14:19

Merge branch 'master' into drakvuf-extractor

a73d16f

mr-tz reviewed Jun 20, 2024

View reviewed changes

yelhamer and others added 4 commits June 20, 2024 18:30

Update capa/features/extractors/drakvuf/models.py

b28e0d0

Co-authored-by: Moritz <[email protected]>

extractors/drakvuf/call.py: yield arguments right to left

c05b973

extractors/drakvuf/file.py: add a TODO comment for extracting more fi…

70d03eb

…le features

extractors/drakvuf/global_.py: add arch extraction

8d4f3c7

yelhamer added 2 commits July 17, 2024 10:42

drakvuf/models.py: remove need to empty report checking

93240f5

tests: add drakvuf models test

c08c5bf

yelhamer requested a review from williballenthin July 17, 2024 11:20

msm-cert reviewed Jul 19, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

yelhamer and others added 3 commits July 19, 2024 22:20

Update capa/features/extractors/drakvuf/global_.py

6e0a9eb

Co-authored-by: msm-cert <[email protected]>

Update tests/test_cape_features.py

2bb7f3c

Co-authored-by: msm-cert <[email protected]>

Update capa/features/extractors/drakvuf/models.py

c0e9150

Co-authored-by: msm-cert <[email protected]>

This comment was marked as resolved.

Sign in to view

Apply suggestions from code review: rename Drakvuf to DRAKVUF

897e98b

Co-authored-by: msm-cert <[email protected]>

williballenthin reviewed Jul 23, 2024

View reviewed changes

capa/helpers.py Outdated Show resolved Hide resolved

williballenthin approved these changes Jul 23, 2024

View reviewed changes

yelhamer added 3 commits July 23, 2024 12:20

drakvuf/call.py: use int(..., 0) instead of str_to_number()

e786552

remove str_to_number

4cab975

drakvuf/call.py: yield argument memory address value as well

2576aa1

yelhamer added 2 commits July 23, 2024 15:12

Update call.py: remove verbosity in yield statement

b5047a2

Update call.py: yield missing address as well

e26072e

yelhamer requested a review from williballenthin July 23, 2024 20:45

williballenthin added this to the v7.2 milestone Jul 24, 2024

yelhamer and others added 3 commits July 24, 2024 12:31

drakvuf/call.py: yield entire argument string only

d9e3ca1

update readme.md

3e3be41

Update README.md: typo

729679d

Update CHANGELOG.md

3fb0eaf

Co-authored-by: msm-cert <[email protected]>

williballenthin reviewed Jul 24, 2024

View reviewed changes

williballenthin approved these changes Jul 24, 2024

View reviewed changes

williballenthin merged commit cf3494d into mandiant:master Jul 24, 2024
8 of 9 checks passed

		# drakvuf sandbox currently supports only windows as the guest: https://drakvuf-sandbox.readthedocs.io/en/latest/usage/getting_started.html
		yield Format(FORMAT_PE), NO_ADDRESS



		def extract_format(report: DrakvufReport) -> Iterator[Tuple[Feature, Address]]:
		# drakvuf sandbox currently supports only Windows as the guest: https://drakvuf-sandbox.readthedocs.io/en/latest/usage/getting_started.html

Add a Feature Extractor for the Drakvuf Sandbox #2143

Add a Feature Extractor for the Drakvuf Sandbox #2143

Conversation

yelhamer commented Jun 11, 2024 • edited by mr-tz Loading

Updates:

Checklist

github-actions bot left a comment

Choose a reason for hiding this comment

williballenthin commented Jun 11, 2024

yelhamer commented Jun 13, 2024

williballenthin commented Jun 13, 2024

williballenthin commented Jun 19, 2024

yelhamer commented Jun 19, 2024

mr-tz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msm-cert left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yelhamer Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

williballenthin left a comment

Choose a reason for hiding this comment

yelhamer commented Jul 23, 2024

williballenthin commented Jul 24, 2024

williballenthin commented Jul 24, 2024

yelhamer commented Jul 24, 2024

Choose a reason for hiding this comment

yelhamer commented Jun 11, 2024 •

edited by mr-tz

Loading

msm-cert left a comment •

edited

Loading

yelhamer Jul 19, 2024 •

edited

Loading