Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction addrs, block labels & llvm disassembly #6

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ailrst
Copy link
Contributor

@ailrst ailrst commented Feb 15, 2024

Adds more information and structure to the json inserted into the gtirb output.
This sacrifices some compactness for readability and convenience.

This also adds llvm-16 as a dependency of this package.

For each instruction we now have an address (just block_address + i * opcode_size), the llvm disassembly, and the opcode itself.

For each block if a label can be found in any of the Module.symbols that refers to the block uuid then it is added.

The output now looks as below, so we will hold off on merging this until the basil gtirb loader is updated accordingly.

  "JgXUTTeoR9qjEjJwNU/5+w==": {
    "label": "main",
    "addr": 1812,
    "instructions": [
      {
        "assembly": "adrp x0, #131072",
        "addr": 1812,
        "opcode_le": "00 01 00 90",
        "opcode_be": "0x90000100",
        "semantics": [
          "Stmt_Assign(LExpr_Array(LExpr_Var(_R),Expr_LitInt(\"0\")),Expr_LitBits(\"0000000000000000000000000000000000000000000000100000000000000000\"))"
        ]
      },
      {
        "assembly": "add x0, x0, #28",
        "addr": 1816,
        "opcode_le": "00 70 00 91",
        "opcode_be": "0x91007000",
        "semantics": [
          "Stmt_Assign(LExpr_Array(LExpr_Var(_R),Expr_LitInt(\"0\")),Expr_TApply(add_bits.0,[(Expr_LitInt(\"64\"))],[(Expr_Array(Expr_Var(_R),Expr_LitInt(\"0\")));(Expr_LitBits(\"000000000000000000
0000000000000000000000000000000000000000011100\"))]))"
        ]
      },

@ailrst ailrst mentioned this pull request Feb 15, 2024
@l-kent
Copy link

l-kent commented Feb 15, 2024

I'm not sure any of this really serves much purpose. Readability of the internal JSON does not matter - no one needs to directly read the internal JSON, and we have debug-gts.py to create more human-readable outputs.

What does the LLVM disassembly look like? I don't see it in your example output. I also don't see the point in adding that and the opcode to the internal JSON for the semantics - those aren't going to be any use to BASIL so they will just take up space. If they are necessary for something else, it would make more sense to add those to debug-gts.py output.

The labels you add will in practice just be the function name for function entry blocks, so adding it to the internal semantics representation doesn't really serve much purpose when we already have access to that information.

We can already derive the address of blocks and instructions from the GTIRB data structures, so there is not much benefit to duplicating that information. It would be useful to have the addresses of instructions displayed in debug-gts.py though (the addresses of blocks are already displayed).

It would be easier to add support for changes like this to BASIL after the changes have been merged in - it is not necessary to wait for BASIL to be updated.

@ailrst
Copy link
Contributor Author

ailrst commented Feb 16, 2024

I think there is benefit to having a readable/debuggable format in the internal json.

What does the LLVM disassembly look like? I don't see it in your example output

This line is the LLVM disassembly

"assembly": "adrp x0, #131072",

@l-kent
Copy link

l-kent commented Feb 16, 2024

What is that benefit supposed to be? As I said, no one needs to directly read the internal JSON. The debug-gts.py output already contains most of this information, and would be a more suitable location for the rest of this.

@katrinafyi
Copy link
Member

As I see it, the goal of this PR is to consolidate the gtirb processing into one tool. debug-gts.py was a bit of a quick-fix solution to the unreadable gtirb problem.

Regarding space, increasing the gtirb size could lead to a faster increase in the git repository size. But I think a better solution here is to (eventually) move the gtirb files outside of git. Is this done for reproducibility of the results (i.e. avoiding compiler and ddisasm differences)? If so, this might be solvable by making the compiler->ddisasm pipeline deterministsic. Otherwise, the binary files could also be stored elsewhere with less space constraints.

One more advantage of Ocaml is we can use Aslp's pp_stmt method to obtain pretty-printed semantics. This can provide slightly nicer output such as, for adds x1, x2, x3,

constant bits ( 64 ) Cse0__5 = add_bits.0 {{ 64 }} ( __array _R [ 2 ],__array _R [ 3 ] ) ;
PSTATE . V = not_bits.0 {{ 1 }} ( cvt_bool_bv.0 {{  }} ( eq_bits.0 {{ 65 }} ( SignExtend.0 {{ 64,65 }} ( Cse0__5,65 ),add_bits.0 {{ 65 }} ( SignExtend.0 {{ 64,65 }} ( __array _R [ 2 ],65 ),SignExtend.0 {{ 64,65 }} ( __array _R [ 3 ],65 ) ) ) ) ) ;
PSTATE . C = not_bits.0 {{ 1 }} ( cvt_bool_bv.0 {{  }} ( eq_bits.0 {{ 65 }} ( ZeroExtend.0 {{ 64,65 }} ( Cse0__5,65 ),add_bits.0 {{ 65 }} ( ZeroExtend.0 {{ 64,65 }} ( __array _R [ 2 ],65 ),ZeroExtend.0 {{ 64,65 }} ( __array _R [ 3 ],65 ) ) ) ) ) ;
PSTATE . Z = cvt_bool_bv.0 {{  }} ( eq_bits.0 {{ 64 }} ( Cse0__5,'0000000000000000000000000000000000000000000000000000000000000000' ) ) ;
PSTATE . N = Cse0__5 [ 63 +: 1 ] ;
__array _R [ 1 ] = Cse0__5 ;

This is not trivial to replicate in other languages.

@l-kent
Copy link

l-kent commented Feb 16, 2024

The .gtirb and executable files are not included in the BASIL repo, but the .gts and .json (from debug-gts.py) files are. Reproducibility of the results is necessary - we want to be testing BASIL for specific .gts files, not testing the entire pipeline all at once. Making the pipeline deterministic and only storing the test source files is not a solution to this, because changes to the lifting/disassembly pipeline would then propagate to BASIL and it would be more difficult to identify the source of issues. It would also significantly slow the execution of tests.

Storing all this information directly in the .gts does nothing for convenience, as it will still be necessary to run a script to extract the json from the .gts file in order to actually read it, meaning we end up in exactly the same place except with extra redundancy.

Since the .json files are directly derivable from the .gts at present, it would be possible to not store the .json files and only create them locally (from the .gts files which are in the repo) in order to save space.

There is no point to adding further information to the .gts files unless it is necessary for BASIL, or not directly derivable from the information that is already there.

@katrinafyi
Copy link
Member

katrinafyi commented Feb 16, 2024

Well, the point of making the compile/ddisasm pipeline reproducible is to also pin the pipeline, thus avoiding changes, and producing byte-for-byte identical gts files from only the c source files - addressing the need for reproducible results. However, this approach is looking like a lot of work and might not be practical.

My second point was that storing binary files (of any kind) in git can quickly bloat the repository size. At the moment, the .gts files are 57MB and it will increase by this much every time the files are updated. Is this something on your mind? If size is a concern, it might be a good idea to avoid committing .gts files at all? The reproducible pipeline is only one such solution for this.

As for this PR, I would like to see the pretty-printed semantics included but I am otherwise neutral.

@l-kent
Copy link

l-kent commented Feb 16, 2024

We can't avoid committing the .gts files at present, and ultimately the reproducible pipeline is a separate issue to the question of what should be contained within the .gts files right now. I don't have a problem if you want to pursue a stable, reproducible pipeline, but we also aren't at a stage where the pipeline is stable.

I'm not necessarily opposed to including the pretty-printed semantics (though they aren't in this PR), they're something that may not be directly derivable from what's already there. I suppose it may be possible to replace debug-gts.py with something written in OCaml that serves the same purpose and can make use of the ASLp pretty printer but that's probably a lot of hassle - maybe worth pursuing though?

katrinafyi added a commit to katrinafyi/pac-nix that referenced this pull request Feb 25, 2024
katrinafyi added a commit to katrinafyi/pac-nix that referenced this pull request Mar 1, 2024
katrinafyi added a commit that referenced this pull request Sep 18, 2024
schema for semantics and successors are tweaked to avoid
storing data within JSON keys.

addresses are also now formatted as hex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants