Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using simd_masked_load for the Read Beyond of Death trick #82

Open
matthieu-m opened this issue Jun 2, 2024 · 6 comments
Open

Comments

@matthieu-m
Copy link

Reading past an array bounds is unsound

While you are correct that at the machine code level, one can read past an array bounds without invoking UB -- because at the machine code level, there is no array bounds, only linear memory -- this is not the case in higher level languages such as Rust, or even LLVM IR.

It appears this does not trigger any harmful optimization yet, but it may at any moment (especially while upgrading), so it would be best to look towards replacing the current implementation.

That's what intrinsics are for

It's so common for SIMD algorithms to wish to read beyond array bounds that the Rust programming language has included explicit support for it under the form of the simd_masked_load (and its counterpart, simd_masked_store) intrinsic.

Using the intrinsic guarantees that a Rust compiler, and its various backends, will correctly understand the developer's intent.

Unfortunately, documentation is sparse, so further testing (and discussion with developers) may be necessary to assert the precise safety guarantees of the intrinsic -- for example, whether it automatically handles reads beyond the end of an OS page, or not -- and double-check how good the generated code is.

Also unfortunately, it is an unstable intrinsic (requires nightly), which may make it unsuitable for use at the moment, though it could be enabled with a feature flag for nightly users' peace of mind.

Or use inline-assembly.

A last resort implementation would be to directly use inline assembly. The implementation is small enough, and with only 2 target instruction sets, that it may not prove unduly complex, nor too much of a maintenance burden.

@mert-kurttutan
Copy link

mert-kurttutan commented Jun 2, 2024

An alternative approach is to use dispatching load to load with constant length.

#[inline(always)]
pub unsafe fn get_partial_safe(data: *const State, len: usize) -> State {
    if len == 4 {
        let partial_vector = _mm_castps_si128(_mm_load_ss(data as usize as *const f32));
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(4 as i8));
    }
    if len == 8 {
        let partial_vector = _mm_castpd_si128(_mm_load_sd(data as usize as *const f64));
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(8 as i8));
    }
    if len == 16 {
        let partial_vector = _mm_loadu_si128(data);
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(16 as i8));
    }
 
   # quick solution for compilation, this is incorrect for len other than 4,8,16
        let partial_vector = _mm_loadu_si128(data);
        return _mm_add_epi8(partial_vector, _mm_set1_epi8(16 as i8));
}

The code above works only for len 4,8,16 (also for 12 if you combine 4+8 in another branch).

  • It gives the same performance as the current implementation (for input size 4,8,16).
  • This is done using only vector intrinsics and nothing else.
  • With this code you can also get rid of page related safety check
#[inline(always)]
pub unsafe fn get_partial(p: *const State, len: usize) -> State {
   // old version commented out
    // Safety check
    // if check_same_page(p) {
    //     get_partial_unsafe(p, len)
    // } else {
    //     get_partial_safe(p, len)
    // }
    get_partial_safe(p, len)
}

This obviously has some cons.

  • The implementations for non trivial input sizes might be cumbersome. For instance the size 5 would be a combination of 4+1. I haven't done the work to make sure these can be done (efficiently).
  • Another con is the code does not seem optimal if you don't want write/maintain branches for each input size (1,2,..,16).
  • Branching might lead to performance cost. But, I did not do any benchmark study where we have different condition for branch prediction. I only used your benchmark to measure performance.

So I am throwing this rather as an idea in case it is valuable to you based on pros/cons I gave.

Edit: I find it appropriate to write this here rather than as additional issue, you can tell me to move it to another issue if you want.

@ogxd
Copy link
Owner

ogxd commented Jun 12, 2024

Hello!

Reading past an array bounds is unsound

It may be unsafe, but I don't think it is unsound, because of the check performed. I understand your concern though that it might not work in the future and that it might not work in some very specific conditions. However, it currently works well (unless proven otherwise) and is one of the reasons this algorithm is an order-of-magnitude faster than other non-cryptographic algorithms. For most applications any other widespread algorithm should do the job just fine and won't be a bottleneck. For other applications, it is an informed choice to rely on GxHash.

That said, if there is a way to be safer without compromising on performance, we'd gladly take it. Here I tried using simd_masked_load as you suggested, however, performance is much worse and bytecode is heavier (done on ARM, maybe this is not helping).

With read-beyond:

Lloh3:
        ldr q1, [x8, lCPI1_0@PAGEOFF]
        cmgt.16b v1, v0, v1
        ldr q2, [x0]
        and.16b v1, v2, v1

With simd_masked_load:

Lloh5:
        ldr q1, [x8, lCPI1_0@PAGEOFF]
        cmgt.16b v1, v0, v1
Lloh6:
        adrp x8, lCPI1_1@PAGE
Lloh7:
        ldr q2, [x8, lCPI1_1@PAGEOFF]
        and.16b v1, v1, v2
        ext.16b v2, v1, v1, #8
        zip1.16b v1, v1, v2
        addv.8h h1, v1
        fmov w8, s1
        movi.2d v1, #0000000000000000
        tbnz w8, #0, LBB1_65
        tbnz w8, #1, LBB1_66
        tbnz w8, #2, LBB1_67
        tbnz w8, #3, LBB1_68
        tbnz w8, #4, LBB1_69
        tbnz w8, #5, LBB1_70
        tbnz w8, #6, LBB1_71
        tbnz w8, #7, LBB1_72
        tbnz w8, #8, LBB1_73
        tbnz w8, #9, LBB1_74
        tbnz w8, #10, LBB1_75
        tbnz w8, #11, LBB1_76
        tbnz w8, #12, LBB1_77
        tbnz w8, #13, LBB1_78
        tbnz w8, #14, LBB1_79
        tbz w8, #15, LBB1_40
        add x8, x0, #15
        ld1.b { v1 }[15], [x8]
LBB1_40:

@mert-kurttutan Unfortunately this approach will inevitably lead to a much bigger bytecode than the URBD and a lot more branching. I doubt it can be nearly as performant once finalized to handle all sizes.

@phip1611
Copy link

phip1611 commented Jun 13, 2024

It may be unsafe, but I don't think it is unsound, because of the check performed

I also support that it is unsafe but not unsound. Otherwise, if the original claim in the issue is correct, it would make every "external ptr + length = type/slice" unsound, right? However, this is a typical use case used in many crates, such as in multiboot2, thus must be known to rustc/llvm.

I think this is valid memory management.

EDIT/UPDATE: The longer I think about this, the more unsure I get - matthieu-m might be right with the unsoundness 🤔

@RalfJung
Copy link

RalfJung commented Jun 27, 2024

It is definitely unsound to load more bytes than what has been allocated / than the reference points to. Miri may even support enough Intel intrinsics to be able to tell you that. :)

@ogxd
Copy link
Owner

ogxd commented Jun 27, 2024

It would be interesting to try! However regarding soundness, from the Miri readme itself:

Moreover, Miri fundamentally cannot tell you whether your code is sound. Soundness is the property of never causing undefined behavior when invoked from arbitrary safe code, even in combination with other sound code.

@RalfJung
Copy link

RalfJung commented Jun 28, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants