Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace TCC Header Parser by Tree-Sitter based one #275

Closed
ghost opened this issue Dec 27, 2020 · 10 comments · Fixed by #1047
Closed

Replace TCC Header Parser by Tree-Sitter based one #275

ghost opened this issue Dec 27, 2020 · 10 comments · Fixed by #1047
Assignees
Labels
refactor Refactoring requests RzType
Milestone

Comments

@ghost
Copy link

ghost commented Dec 27, 2020

Current State

There are currently two parsers for different parts of C syntax in rizin:

Parsing of C headers for struct definitions, enums, etc. using tcc into base types for the database:

/* c */
// why we have analysis scoped things in rparse
RZ_API char *rz_parse_c_string(RzAnalysis *analysis, const char *code, char **error_msg);
RZ_API char *rz_parse_c_file(RzAnalysis *analysis, const char *path, const char *dir, char **error_msg);
RZ_API void rz_parse_c_reset(RzParse *p);

Parsing C Type expressions like char *[42] using a custom mpc-based parser into ctypes to be used on top of base types in variables, etc:

/* ctype */
// Parses strings like "const char * [0x42] const * [23]" to RzParseCTypeType
typedef struct rz_parse_ctype_t RzParseCType;
typedef enum {
RZ_PARSE_CTYPE_TYPE_KIND_IDENTIFIER,
RZ_PARSE_CTYPE_TYPE_KIND_POINTER,
RZ_PARSE_CTYPE_TYPE_KIND_ARRAY
} RzParseCTypeTypeKind;
typedef enum {
RZ_PARSE_CTYPE_IDENTIFIER_KIND_UNSPECIFIED,
RZ_PARSE_CTYPE_IDENTIFIER_KIND_STRUCT,
RZ_PARSE_CTYPE_IDENTIFIER_KIND_UNION,
RZ_PARSE_CTYPE_IDENTIFIER_KIND_ENUM
} RzParseCTypeTypeIdentifierKind;
typedef struct rz_parse_ctype_type_t RzParseCTypeType;
struct rz_parse_ctype_type_t {
RzParseCTypeTypeKind kind;
union {
struct {
RzParseCTypeTypeIdentifierKind kind;
char *name;
bool is_const;
} identifier;
struct {
RzParseCTypeType *type;
bool is_const;
} pointer;
struct {
RzParseCTypeType *type;
ut64 count;
} array;
};
};

The tcc one for base types has several limitations such as usage of global state and it also writes out directly the sdb records, which is super unflexible and hard to inspect. The mpc one for ctypes works ok, but mpc also has some aspects that make it annoying to use and extend (see the amount of strcmp in https://github.com/rizinorg/rizin/blob/a3a339b11b1e241ee95201d7b666ff291c8e2100/librz/parse/ctype.c).

Solution

tree-sitter-cpp should be used to replace tcc. The parser for base types should be written in such a way that it emits RzAnalysisBaseType structs and not raw sdb. Thus, the results will be able to be used more flexible or just stored into the database through the api.

Because a complete C(++) parser and thus presumably also tree-sitter-cpp must also parse cast expressions such as (char *[42])<something>, the parser for the char *[42] in this example could also be re-used directly for RzParseCTypeType, replacing mpc there.

Conflicts/Dependencies

This has a little bit of a conflict with #371 if the parser is done first because the new parser will generate RzAnalysisBaseType objects, but #371 will modify the structure of it to use RzParseCTypeType instead of strings, forcing the newly written parser to be adapted too.
If however the parser is done second and #371 first, then perhaps the old mpc parser will have to be integrated in many more places, making refactoring to the tree sitter parser a bit harder.
I think the best solution to solve this will be to implement the parser first, and in the parser creating a RzParseCTypeType whenever such a type is encountered, but only in the very last step when constructing the RzAnalysisBaseType, converting that to a string. As such, this issue depends on #370.

While writing the new parser, one will probably encounter the case of inline definitions of struct, enums and functions. For functions, this will depend on #373. For other inline types, I think with the current ctypes/base types separation it's something we cannot support and should postpone to later when all the refactoring is done and we might consider merging these two types of types.


Original issue contents:

Is your feature request related to a problem? Please describe.
Do I really need to explain why a custom parser is bad idea?

Describe the solution you'd like
Remove the custom parser and use some alternatives maybe libclang?

Describe alternatives you've considered
Put the custom parser in another repository and build a large (very large) CI.

Additional context
https://stackoverflow.com/questions/6673148/parsing-c-header-files-to-extract-information-about-data-types-functions-and-fu

@XVilka
Copy link
Member

XVilka commented Dec 28, 2020

We plan to employ tree-sitter-cpp for this, libclang is too heavy. Some customness might be required, something like __usercall func(int a<eax>, ...) would be a nice feature to have.
Since we use tree-sitter for command parsing already, it's an obvious choice: https://github.com/rizinorg/rizin/tree/dev/shlr/rizin-shell-parser

@XVilka XVilka added the refactor Refactoring requests label Dec 28, 2020
@ghost
Copy link
Author

ghost commented Dec 29, 2020

I create a small POC with libclang, and I identify 3 major advantages against the tree-sitter.

  1. run preprocessor instructions (needed if you want to parse real world header)
  2. detect type size and field offset (architecture dependent and hard to deduce)
    input
struct padding {
    char a;
    int b;
    float c;
};

struct packed {
    char a;
    long int b;
    float c;
} __attribute__((__packed__));

output:

struct padding {
        char a; # size = 1 byte, offset = 0 byte
        int b; # size = 4 byte, offset = 4 byte
        float c; # size = 4 byte, offset = 8 byte
}
struct packed {
        char a; # size = 1 byte, offset = 0 byte
        long b; # size = 8 byte, offset = 1 byte
        float c; # size = 4 byte, offset = 9 byte
}
  1. change target architecture (can replace the use of stdint library)

target i386

struct packed {
        char a; # size = 1 byte, offset = 0 byte
        long b; # size = 4 byte, offset = 1 byte
        float c; # size = 4 byte, offset = 5 byte
}

target x86_64

struct packed {
        char a; # size = 1 byte, offset = 0 byte
        long b; # size = 8 byte, offset = 1 byte
        float c; # size = 4 byte, offset = 9 byte
}

Perhaps the type definition system could depend on the architecture of the loaded binary?

source code:

#include "clang-c/Index.h"

#include <err.h>
#include <stdio.h>
#include <string.h>

#define PROJECT_NAME "test_libclang"

enum CXChildVisitResult display_struct_decl(
    CXCursor cursor,
    CXCursor parent,
    __attribute((unused)) CXClientData client_data) {
    CXString field_name = clang_getCursorSpelling(cursor);
    CXType struct_type = clang_getCursorType(cursor);
    CXType parent_type = clang_getCursorType(parent);

    CXString kind_name = clang_getTypeSpelling(struct_type);
    const char *kind_name_c = clang_getCString(kind_name);

    if (strlen(kind_name_c) == 0) {
        return CXChildVisit_Continue;
    }

    int field_size = clang_Type_getSizeOf(struct_type);

    const char *field_name_c = clang_getCString(field_name);
    int field_offset = clang_Type_getOffsetOf(parent_type, field_name_c) / 8;

    printf(
        "\t%s %s; # size = %d byte, offset = %d byte\n",
        clang_getCString(kind_name),
        field_name_c,
        field_size,
        field_offset);

    clang_disposeString(field_name);
    clang_disposeString(kind_name);

    return CXChildVisit_Continue;
}

void display_struct(CXCursor struct_cursor) {
    CXString struct_name = clang_getCursorSpelling(struct_cursor);

    printf("struct %s {\n", clang_getCString(struct_name));

    clang_visitChildren(struct_cursor, display_struct_decl, NULL);

    printf("}\n");

    clang_disposeString(struct_name);
}

enum CXChildVisitResult display_cursor_info(
    CXCursor cursor,
    __attribute((unused)) CXCursor parent,
    __attribute((unused)) CXClientData client_data) {
    enum CXCursorKind cursor_type = clang_getCursorKind(cursor);

    if (cursor_type == CXCursor_StructDecl) {
        display_struct(cursor);
    }

    return CXChildVisit_Continue;
}

void display_info(CXTranslationUnit unit) {
    CXCursor root = clang_getTranslationUnitCursor(unit);
    clang_visitChildren(root, display_cursor_info, NULL);
}

void parse_header(
    const char *filename,
    const char *const *arguments,
    int argc) {
    CXIndex index = clang_createIndex(0, 0);
    CXTranslationUnit unit = clang_parseTranslationUnit(
        index,
        filename,
        arguments,
        argc,
        NULL,
        0,
        CXTranslationUnit_None);
    if (unit == NULL) {
        errx(1, "Invalid header file.");
    }

    display_info(unit);

    clang_disposeTranslationUnit(unit);
    clang_disposeIndex(index);
}

int main(int argc, char **argv) {
    if (argc < 2) {
        printf("%s takes 1 arguments.\n", argv[0]);
        return 1;
    }

    parse_header(argv[1], argv + 2, argc - 2);

    return 0;
}

@XVilka
Copy link
Member

XVilka commented Dec 30, 2020

Perhaps the type definition system could depend on the architecture of the loaded binary?

It should depend on the architecture of the binary and OS, indeed. It
already depends in shlr/tcc after some of my modifications (remember, the used tcc code is diverged a lot from the vanilla one):

} else if (bt == VT_DOUBLE || bt == VT_INT64) {
		if (!strncmp (tcc_state->arch, "x86", 3) && tcc_state->bits == 32) {
			if (!strncmp (tcc_state->os, "windows", 7)) {
				*a = 8;
			} else {
				*a = 4;
			}

shlr/tcc/tccgen.c:539

Regarding the using of libclang as a dependency of such a core feature of Rizin - I am strongly against is. From my experience libclang has no stable API and it's a huge pain to maintain anything built upon it, migrating from version to version. Even worse if you need to support multiple versions of it or some differences on how different distributions packaged it. Using Clang just for a simple data types parsing is an overkill. It will make our CI cry as well.

I think using tree-sitter-cpp is a way to go. If their preprocessor support is insufficient we just could use old tccpp from shlr/tcc/tccpp.c. It is self-contained and doesn't depend much on other parts of tcc. It has proven itself on parsing real world code like Linux kernel or Glibc headers. Moreover, with other existing languages parsers for tree-sitter we could add support for data types of different languages like Go, Rust, etc.

But your libclang-based PoC might be useful for generating tests for the new C data types parser.

@ghost
Copy link
Author

ghost commented Dec 30, 2020

Does Rizin really need to be able to parse the c syntax (and header file)?

If C parsing is not the priority, we can use a simpler solution. We could use another language to represent the type system.
I was thinking using s-exp.

example:

(struct (name "test") (field (u8) "a") (field (u64) (offset 4) "b"))
(enum (name "hello") (field "a") (field "b" 1))

offset is optional, by default the structure must packed each field.

struct test {
    u8 a;
    u64 b;
};

enum hello {
    a,
    b = 1,
}

S-exp are easy to parse (we could use tree-sitter) and to serialize (to save configuration).
If the internal representation is good enough, i believe that SDB could be removed.
We need to use stdint type (u8, u16, ...) as default type.

And add a command to load arch/os type dependent.

> `cmd` x86_64.type

x86_64.type

(typedef (u8) (char))
(typedef (u64) (long))
...

What do you think about it?

@karliss
Copy link
Member

karliss commented Dec 30, 2020

For the purpose of parsing real life headers using clang it might be better have it as separate helper executable or script instead of linking libclang directly with into rizin. Linking to libclang has multiple potential problems:

  • packing issues, wanting to include or not include such a huge library (concerns devs, users and packagers), varying versions available on linux Distros
  • conflict with retdec decompiler
  • conflict with clang used by Linux graphic drivers, would cause problems when linking into Cutter libclang_a <-> gpu driver/libs <-> cutter <-> rizin <-> libclang_b for header parsing (RetDec already plugin already has this problem)

@karliss
Copy link
Member

karliss commented Dec 30, 2020

Another note about using libclang for parsing real life headers unless you are on Linux parsing Linux library headers is a challenging problem. Standard library and system library headers include a lot of compiler specific junk that often can't be easily parsed by other compiler parsers. And the standard library headers can't be simply replaced with standard library used by Clang because that could potentially change the meaning in main headers you are parsing, especially if you care about exact byte layout.

So a robust solution which can work with Windows and macOS headers will likely require combination careful flag tuning/maching with target platform, replacements for features that can't be dealt with, some header replacements. It is not impossible but it takes some effort and maintenance to keep it working.

Latest MSVC versions are somewhat easier to handle, but they still occasionally break things as demonstrated by Cutter's current need to use older MSVC version so that shiboken (which uses libclang) can deal with MSVC headers.

Even on macOS with XCode which also uses Clang things aren't perfect. It isn't quite the same as open source clang, they have their own private fork which is neither newer than older than open source Clang. It is also not the same LLVM as in their public swift repos. They have their own changes which may take months until they get open sourced or reimplemented. The opposite is also possible where latest open source clang have incompatible changes which aren't included in the latest XCode version.

At the same time if you manage clang to get through with parsing, having differences in layout of few structures is probably not too bad for reverse engineering purposes. That's still 10x more correct type definitions than if you had to manually define them. Not having to understand content of function bodies (almost because C++ can be complicated) also makes things a bit easier.

@XVilka
Copy link
Member

XVilka commented Dec 31, 2020

Does Rizin really need to be able to parse the c syntax (and header file)?

Yes. There is no way to avoid this.

S-exp are easy to parse (we could use tree-sitter)

If we use tree-sitter we can parse both C and C++ definitions too ;)

@XVilka XVilka added this to To do in Types via automation Jan 7, 2021
@thestr4ng3r thestr4ng3r changed the title Remove the custom C parser Replace TCC Header Parser by Tree-Sitter based one Jan 16, 2021
@thestr4ng3r thestr4ng3r moved this from To do and ready to start to Blocked in Types Jan 16, 2021
@thestr4ng3r
Copy link
Member

I've updated the description with a concrete plan.

@XVilka XVilka added the RzType label Jan 22, 2021
@XVilka XVilka self-assigned this Feb 13, 2021
@XVilka XVilka added this to To do in Tree Sitter Mar 12, 2021
@XVilka XVilka added this to the 0.3.0 milestone Apr 2, 2021
@XVilka XVilka moved this from Blocked to In progress in Types Apr 7, 2021
@XVilka
Copy link
Member

XVilka commented Apr 27, 2021

@ret2libc why removed, it's already WIP in #1047

Tree Sitter automation moved this from To do to Done Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Refactoring requests RzType
Projects
Status: Done
Types
  
Done
Development

Successfully merging a pull request may close this issue.

4 participants