Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misdetection] zip file misdetected as <any type you want> #765

Open
kazet opened this issue Oct 24, 2024 · 0 comments
Open

[Misdetection] zip file misdetected as <any type you want> #765

kazet opened this issue Oct 24, 2024 · 0 comments
Labels
misdetection This issue is about a misdetection on a content type currently supported needs triage This issue still needs triage by one of the maintainers

Comments

@kazet
Copy link

kazet commented Oct 24, 2024

Hello,

This Python script that modifies any zip file to be misdetected as any other file:

import binascii
import random
from magika import Magika

with open("test.zip", "rb") as f:
    x = f.read()

m = Magika()
prefix = b""

ZIP_FILE_FORMATS = ["zip", "jar", "rpm", "epub", "ods"]

for i in range(10000):
    old_res = m.identify_bytes(prefix + x)
    new_prefix = prefix

    operation = random.choice(
        ["REMOVE_FIRST", "REMOVE_LAST", "ADD_FIRST", "ADD_LAST"] + ["REPLACE"] * 2
    )
    if operation == "REMOVE_FIRST" and len(new_prefix) > 1:
        new_prefix = new_prefix[1:]
    elif operation == "REMOVE_LAST" and len(new_prefix) > 1:
        new_prefix = new_prefix[:-1]
    elif operation == "ADD_FIRST":
        new_prefix = bytes([random.randint(0, 255)]) + new_prefix
    elif operation == "ADD_LAST":
        new_prefix = new_prefix + bytes([random.randint(0, 255)])
    elif operation == "REPLACE" and len(new_prefix) >= 1:
        i = random.randint(0, len(new_prefix) - 1)
        new_prefix = (
            new_prefix[:i] + bytes([random.randint(0, 255)]) + new_prefix[i + 1 :]
        )
        assert len(new_prefix) == len(prefix)

    new_res = m.identify_bytes(new_prefix + x)
    if (
        new_res.output.ct_label not in ZIP_FILE_FORMATS
        and new_res.dl.ct_label not in ZIP_FILE_FORMATS
    ):
        print("success: prefix=", binascii.hexlify(new_prefix), "result=", new_res)
        break

    if new_res.output.score < old_res.output.score:
        prefix = new_prefix

with open("out.zip", "wb") as f:
    f.write(new_prefix + x)

The above script produces zips misdetected as jpegs, pcaps, etc., even if the magic numbers aren't proper jpeg, pcap magic numbers.

Example:

success: prefix= b'bed801' result= MagikaResult(path='-', dl=ModelOutputFields(ct_label='jpeg', score=0.8492175936698914, group='image', mime_type='image/jpeg', magic='JPEG image data', description='JPEG image data'), output=MagikaOutputFields(ct_label='unknown', score=0.8492175936698914, group='unknown', mime_type='application/octet-stream', magic='data', description='Unknown binary data'))

success: prefix= b'04b224' result= MagikaResult(path='-', dl=ModelOutputFields(ct_label='pcap', score=0.5953978300094604, group='application', mime_type='application/vnd.tcpdump.pcap', magic='pcap capture file', description='pcap capture file'), output=MagikaOutputFields(ct_label='unknown', score=0.5953978300094604, group='unknown', mime_type='application/octet-stream', magic='data', description='Unknown binary data'))
@kazet kazet added misdetection This issue is about a misdetection on a content type currently supported needs triage This issue still needs triage by one of the maintainers labels Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
misdetection This issue is about a misdetection on a content type currently supported needs triage This issue still needs triage by one of the maintainers
Projects
None yet
Development

No branches or pull requests

1 participant