Releases: simd-everywhere/simde-no-tests
v0.8.2
SIMDe 0.8.2
Summary
- Start of RISCV64 optimized implementation using the RVV1.0 vector extension! Thank you @eric900115 @howjmay @zengdage
- 62 of the ARM Neon intrinsics added in SIMDe 0.8.0 had to be removed for not exactly matching the specs and real hardware
(from the FCVTZS/FCVTMS/FCVTPS/FCVTNS families). This brings us down from 100% coverage of the NEON functions to 99.07%.
Details
Implementation of Arm intrinsics
NEON
- arm neon: disable some FCVTZS/FCVTMS/FCVTPS/FCVTNS family intrinsics 339ffe4 @mr-c
- arm neon sm3: check constant range 3d34fcd @mr-c
- arm 32 bits: native def fixes; workarounds for gcc 22900e6 @Cuda-Chen
- x86 implementations: allow _m128 access from SSE 114c3cd @mr-c
WASM intrinsics
- wasm x86 impl: some were incorrectly marked SSE instead of SSE2 fee149a @mr-c
x86 intrinsics
SVML
- SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c
XOP
- fix some native functions 608200b @mr-c
Arch support
arm / arm64
- arm platform: cleanup feature detection. 08c21f3 @mr-c
- arm: enable more intrinsic function for armv7 416091e @zengdage
RISCV64
- Initial Support for the RISC-V Vector Extension (RVV1.0) in ARM NEON (#1130) b4e805a @eric900115
- arm: fix some neon2rvv intrinsic function error 2a548e5 @zengdage
- arm: Add neon2rvv support in vand series intrinsics dac67f3 @howjmay
- arm: improve performance in vabd_xxx for risc-v b63ba04 @zengdage
- arm: improve performance in vhadd_xxx for risc-v a68fa90 @zengdage
Compiler Specific
Clang
- detect clang versions 18 & 19 ed4a5cd @mr-c
- arm neon clang: skip vrnd native before clang v18 e647f10 @mr-c
- apple clang arm64: ignore SHA2 be48ef8 @mr-c
Emscripten
- use
__builtin_roundeven{f,}
from version 3.1.43 onwards 4379740 @mr-c
MSVC
- x86 test msvc: really disable warning 4799,4730 487507d @mr-c
- sse2 MSVC
_mm_pause
implementaiton for x86 8d95f83 @mr-c - SSE is good enough for native m128i and m128d types & functions 9982b27 @mr-c
Testing with Docker/Podman & CI
- CI: don't run twice on dependabot branches 70748cd @mr-c
Cirrus CI
- upgrade to clang-17 7ab3240 @mr-c
GitHub Actions
- test Mac arm64 0080b28 @mr-c
- macos: report log if there is a configuration failure. df3e930 @mr-c
- build(deps): bump actions/checkout from 3 to 4 (#1149) 9605608 @dependabot[bot]
- build(deps): bump codecov/codecov-action from 3 to 4 25382c1 @dependabot[bot]
- codecov: use token 2c45dd4 @mr-c
- Add gcc arm 32bit armv8-a test in CI 72bde75 @Cuda-Chen
- build for AMD Buildozer version 2 9746537 @mr-c
Packit CI
- Drop i386 (i686) support. (#1155) cf68aaf @junaruga
Semaphore CI
- stop testing on GCC 5 & 6, clang 3.9 & 4 due to forced upgrade to Ubuntu 20.04 9982f10 @mr-c
Misc
- update list of fully implemented instruction sets (#1152) b568fcd @mr-c
- typo fixes from codespell 8639fef @mr-c
- README.md - move CLMUL to partial, list more of the CI.yml architectures 285b50d @Torinde
- Update README.md - link to VPCLMULQDQ; mention MSA (#1157) 517da84 @Torinde
- Update README.md (#1156) b88a66d @mr-c
- README: two more related projects 7429dff @mr-c
New Contributors
- @eric900115 made their first contribution in simd-everywhere/simde#1130
- @Cuda-Chen made their first contribution in simd-everywhere/simde#1116
- @Torinde made their first contribution in simd-everywhere/simde#1157
- @zengdage made their first contribution in simd-everywhere/simde#1172
- @howjmay made their first contribution in simd-everywhere/simde#1174
Full Changelog: v0.8.0...v0.8.2
v0.8.0
SIMDe 0.8.0
Summary
- Complete set of implementations for all NEON intrinsics have been finished, up from 56.46% in the previous release! (@yyctw @wewe5215)
- SIMDe PRs are tested using Fedora Rawhide (@junaruga)
For the entire project: 656 files changed, 202635 insertions(+), 1724 deletions(-)
For just the simde
folder: 295 files changed, 47053 insertions(+), 896 deletions(-)
X86
There are a total of 6876 SIMD functions on x86, 2930 (43.17%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5160 functions currently in AVX-512, SIMDe implements 1510 (29.26%).
Note: Intel has removed the intrinsics that were unique to Intel Xeon Phi (ER
, PF
, 4MAPS
, and 4VNNIW
) from their intrinsic list. SIMDe will retain those few implementations we already had, but this changes how our completeness statistics are calculated.
Newly added function families
- AES: 5 of 6 (83.33%)
Newly AVX512 added function families
- castph: 1 of 9 (11.11%) implemented.
- cvtus_storeu: 1 of 18 (5.56%) implemented.
- fpclass: 3 of 24 (12.50%) implemented.
- i32gather: 1 of 8 (12.50%) implemented.
- i64gather: 8 of 8 💯
- permutex: 3 of 12 (25.00%) implemented.
- rcp14: 1 of 24 (4.17%) implemented.
reduce - reduce_max: 7 of 31 (22.58%) implemented.
- reduce_min: 7 of 31 (22.58%) implemented.
- shufflehi: 1 of 7 (14.29%) implemented.
- shufflelo: 1 of 7 (14.29%) implemented.
Additions to existing families
- AVX512BW: 7 additional, 337 of 790 (42.66%)
- AVX512DQ: 5 additional, 112 total of 376 (29.79%)
- AVX512F: 48 additional, 1087 total of 2812 (38.66%)
- AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)
Neon
SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!
Newly added families
- abal
- abal_high
- abd
- abdh
- abdl_high
- addhn_high
- aes
- bfdot
- bfdot_lane
- cadd_rot
- cale
- calt
- cmla_lane
- cmla_rot_lane
- copy_lane
- cvt_high
- cvt_n
- cvta
- cvtn
- cvtp
- cvtx
- cvtx_high
- div
- dupb_lane
- duph_lane
- eor3
- fmlal
- fms
- fms_lane
- fms_n
- ld2_dup
- ld2_lane
- ld3_dup
- ld3_lane
- ld4_dup
- maxnmv
- minnmv
- mla_lane
- mla_high_lane
- mls_lane
- mlsl_high_lane
- mmla
- mull_high_lane
- mull_high_n
- mulx
- mulx_lane
- pmaxnm
- pminnm
- qdmlal
- qdmlal_high
- qdmlal_high_lane
- qdmlal_high_n
- qdmlal_lane
- qdmlal_n
- qdmlsl
- qdmlsl_high
- qdmlsl_high_lane
- qdmlsl_high_n
- qdmlsl_lane
- qdmlsl_n
- qdmlslh
- qdmlslh_lane
- qdmulhh
- qdmulhh_lane
- qdmull_high
- qdmull_high_lane
- qdmull_high_n
- qdmull_lane
- qdmull_n
- qdmullh_lane
- qmovun_high
- qrdmlah
- qrdmlah_lane
- qrdmlahh
- qrdmlahh_lane
- qrdmlsh
- qrdmlsh_lane
- qrdmlshh
- qrdmlshh_lane
- qrdmulhh_lane
- qrshl
- qrshlh
- qrshrn_high_n
- qrshrnh_n
- qrshrun_high_n
- qrshrunh_n
- qshl_n
- qshlh_n
- qshluh_n
- qshrn_high_n
- qshrnh_n
- qshrun_high_n
- qshrunh_n
- raddhn
- raddhn_high
- rax
- recp
- rnd32x
- rnd32x
- rnd32x
- rnd64z
- rnda
- rndx
- rshrn_high_n
- rsubhn
- rsubhn
- set_lane
- sha1
- sha1h
- sha256
- sha512
- shll_high_n
- shrn_high_n
- sli_n
- sm3
- sm4
- sqrt
- st1_x2
- st1_x3
- st1_x4
- st1q_x2
- st1q_x3
- st1q_x4
- subhn_high
- sudot_lane
- usdot
- usdot_lane
Finally complete families
- cvtn
- mla_lane
Details
- simde-f16: improve
_Float16
usage; better INFHF/NANHF defs 8910057 @mr-c - simde_float16: prefer
__fp16
if available aba26f6 @mr-c
Implementation of Arm intrinsics
NEON
- cvtn:
vcvtnq_{s32_f32,s64_f64}
: add SSE & AVX512 optimized implementations e134cc7 @mr-c - cvtn:
vcvtnq_u32_f32
is a V8 function 8432c70 @mr-c - min: Remove non-working MMX specialization from
simde_vmin_s16
6858b92 @M-HT - shll: Extend constant range in
simde_vshll_n_XXX
intrinsics (#1064) beb1c61 @M-HT - various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw
- qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314
- arm: use
SIMDE_ARCH_ARM_FMA
7198d6d @mr-c - arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215
- more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw
st1{,q}_*_x{2,3,4}
: initial implementation (#1082) 879d1a0 @yyctw- part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw
- Add AES instructions. 23adcd2 805ccd2 @yyctw
- Modified
simde_float16
tosimde_float16_t
(#1100) 8a05dc6 @yyctw - implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw
- add enable
vmlaq_laneq_f32
andvcvtq_n_f64_u64
c7d314b @yyctw - implement all bf16-related intrinsics (#1110) c59db7c @yyctw
- arm/neon abs: negating
INT_MIN
is undefined behavior in C/C++ c200c16 @mr-c
SVE Intrinsics
- Improve performance of
simde_mm512_add_epi32
(#1126) 6cde31c @AymenQ
WASM intrinsics
- simd128: fix altivec_p7 version of
wasm_f64x2_pmin
96d6e53 @mr-c - simd128: add missing unsigned functions ea5e283 @mr-c
- simd128
f{32x4,64x2}_min
: add workaround for a gcc<6 issue d5d6d10 @mr-c - detect support for Relaxed SIMD mode 2e66dd4 @mr-c
- simd128/relaxed: begin MIPS implementations db8ad84 @mr-c
- relaxed: add
f{32x4,64x2}_relaxed_{min,max}
9d1a34e @mr-c - relaxed: updated names; reordered FMA operations 8cc8874 @mr-c
x86 intrinsics
- sse{,2,4.1}, avx{,2}
*_stream_{,load}
: use__builtin_nontemporal_{load,store}
6ce6030 @mr-c
SSE*
- sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT
- sse: implement
_mm_movelh_ps
for Arm64 514564e @mr-c - sse
_mm_movemask_ps
: remove unused code fba97e4 @mr-c - sse2 mm_pause: more archs, add a basic test 692a2e8 @mr-
- sse4.1: use logical OR instead of bitwise OR in neon impl of
_mm_testnzc_si128
edd4678 @mr-c - sse4.1
_mm_testz_si128
: fix backwards short circuit logic f132275 @mr-c
AVX
- run test from #926 ce9708c @mr-c
simde_mm256_shuffle_pd
fix for natural vector size < 128 1594d7c @mr-c
AVX2
- correction of
simde_mm256_sign_epi{8,16,32}
(#1123) c376610 @Proudsalsa
AVX512
- fpclass: naive implementation 353bf5f @mr-c
- loadu: fix native detection 305f434 @mr-c
- set: add
simde_x_mm512_set_m256{,d}
67e0c50 @mr-c - gather: add MSVC native fallbacks 7b7e3f6 @mr-c
- AVX512FP16 / m512h initial support e97691c @mr-c
- fix many native aliases 75014b9 @mr-c
CLMUL
- fix natives, some require VPCLMULQDQ f819c52 @mr-c
SVML
- enable SIMDE_X86_SVML_NATIVE for MSVC 2019+ 593af95 @mr-c
AES
- aes: initial implementation of most aes instructions (#1072) 8632391 @Vineg
MIPS MSA intrinics
- msa neon impl:
float64x2_t
is not avail in A32V7 ae4c4ab @mr-c
Arch support
x86(-64)
- fix
SIMDE_ARCH_X86_SSE4_2
define 5e4b308 @cbielow
arm64
- x86 aes: add neon implementation using the crypto extension fb3554f @mr-
Altivec
- neon/st1: disable last remaining AltiVec implementation 0521245 @mr-c
Power
- sse2,wasm simd128: skip
SIMDE_CONVERT_VECTOR_
impementations on PowerPC 4de999a @mr-c - wasm simd128: more powerpc fixes 7cb5691 @mr-c
Compiler Specific
GCC
- GCC AVX512F:
SIMDE_BUG_GCC_95399
was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c - GCC x86/x64:
SIMDE_BUG_GCC_98521
was fixed in 10.3 edde42e @mr-c - GCC x86:
SIMDE_BUG_GCC_94482
was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c - Add workaround for GCC bug 111609 fdafd8e @M-HT
- arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c
- avx512 abs: refine GCC compiler checks for
_mm512{,_mask}_abs_pd
(#1118) 5405bbd @thomas-schlichter
Clang
- clang powerpc:
vec_bperm
bug was fixed in clang-14 6feb28a @mr-c - clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c
- aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c
- A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314
- wasm:
SIMDE_BUG_CLANG_60655
is fixed in the upcoming 17.0 release 25cebbe @mr-c simde-detect-clang.h
: add clang 17 detection 923f8ac 684baa1 50d98c1 @Coeur
ClangCL
- fp16: don't use
_Float16
on ClangCL if not supported 8a6b8c5 @mr-c - svml: don't...
v0.8.0-rc1
See draft release notes at https://github.com/simd-everywhere/simde/wiki/Release-Notes
v0.7.6
Summary
See, I knew we should release more often!
Details
Implementation of Arm intrinsics
NEON
neon/abd,ext,cmla{,_rot{180,270,90}}: additional wasm128 implementations 3a18dff @mr-c
neon/cvtn: basic implementation of a few functions fefc785 @mr-c
neon/mla_lane: initial implementation using mla+dup 554ab18 @ngzhian
neon/shl,rshl: fix avx include to unbreak amalgamated hearders 3748a9f @mr-c
neon/shll_n: make vshll_n_u32 test operational 356db0c @mr-c
neon/qabs: restore SSE2 impl for vqabsq_s8 f614843 @mr-c
x86 intrinsics
mmx: loogson impl promotions over SIMDE_SHUFFLE_VECTOR_ 51bf6f2 @mr-c
x86/sse*,avx: add additional SIMD128 implementations e28a87e @mr-c
SSE*
sse{,2,3,4.1},avx: more WASM shuffle implementations 097dd12 @mr-c
sse*,avx: add additional SIMD128 implementations e28a87e @mr-c
sse: allow native _mm_loadh_pi on MSVC x64 314452b @mr-c
AVX512
avx512: typo fix for typedef of __mmask64 e8390a3 4a9f01a @mr-c
avx512/madd: fix native alias arguments for _mm512_madd_epi16 bcf4adb @mr-c
Arch support
simde-arch: #include Hedley for setting F16C for MSVC 2022+ with AVX2 f9cf467 @mr-c
Misc
meson install: arm/neon/ld1 & x86/avx512.h 27836b1 @mr-c
Update clang version detection for 14..16 and add link 4957a9e @jan-wassenberg
v0.7.4
SIMDe 0.7.4
Summary
- Minimum meson version is now 0.54
- 40 new NEON families implemented, SVE API implementation started (14 families)
- Initial support for x86 F16C API
- Initial support for MIPS MSA API
- Initial support for Arm Scalable Vector Extensions (SVE) API
- Initial support for WASM SIMD128 API
- Initial support for the E2K (Elbrus) architecture
- MSVC has many fixes, now compiled in CI using
/ARCH:AVX
,/ARCH:AVX2
, and/ARCH:AVX512
X86
There are a total of 7470 SIMD functions on x86, 2971 (39.77%) of which have been implemented in SIMDe so far.
Specifically for AVX-512, of the 5270 functions currently in AVX-512, SIMDe implements 1439 (27.31%)
Newly added function families
- AVX512CD: 21 of 42 (50.00%)
- AVX512VPOPCNTDQ: 18 of 18 💯
- AVX512_4VNNIW: 6 of 6 (100.00%)
- AVX512_BF16: 9 of 38 (23.68%)
- AVX512_BITALG: 24 of 24 💯
- AVX512_FP16: 2 of 1105 (0.18%)
- AVX512_VBMI2 3 of 150 (2.00%)
- AVX512_VNNI: 36 of 36 💯
- AVX_VNNI: 8 of 16 (50.00%)
Additions to existing families
- AVX512F: 579 additional, 856 total of 2660 (31.80%)
- AVX512BW: 178 additional, 335 total of 828 (40.46%)
- AVX512DQ: 77 additional, 111 total of 399 (27.82%)
- AVX512_VBMI: 9 additional, 30 total of 30 💯
- KNCNI: 113 additional, 114 total of 595 (19.16%)
- VPCLMULQDQ: 1 additional, 2 total of 2 💯
Neon
SIMDe currently implements 3745 out of 6670 (56.15%) NEON functions. If you don't count 16-bit floats and poly types, it's 3745 / 4969 (75.37%).
Newly added families
- addhn
- bcax
- cage
- cmla
- cmla_rot90
- cmla_rot180
- cmla_rot270
- fma
- fma_lane
- fma_n
- ld2
- ld4_lane
- mlal_high_n
- mlal_lane
- mls_n
- mlsl_high_n
- mlsl_lane
- mull_lane
- qdmulh_lane
- qdmulh_n
- qrdmulh_lane
- qrshrn_n
- qrshrun_n
- qshlu_n
- qshrn_n
- qshrun_n
- recpe
- recps
- rshrn_n
- rsqrte
- rsqrts
- shll_n
- shrn_n
- sqadd
- sri_n
- st2
- st2_lane
- st3_lane
- st4_lane
- subhn
- subl_high
- xar
MSA
Overall, SIMDe implementents 40 of 533 (7.50%) functions from MSA.
Details
Implementation of Arm intrinsics
NEON
- aarch64 + clang-1[345] fix for "implicit conversion changes signedness" a22c3cc @mr-c
- neon: Implement f16 types 21496f6 @Glitch18
- neon: port additional code to new style 1c744fd @nemequ
- neon: replace some more abs/labs/llabs usage with simde_math_* versions c59853a @nemequ
- neon: refactor to use different types on all targets c17957a @nemequ
- neon: test for MMX/SSE instead of x86 when choosing implementation 0366dab @nemequ
- neon/abd: add much better implementations c3ddbbe @nemequ 220db33 @ngzhian
- neon/abs: add SSE2 integer abs implementations 6396dc8 @aqrit
- neon/addhn: initial implementation e9ee066 @nemequ
- neon/add: Implement f16 functions e69239c @Glitch18
- neon/add{l,}v: SSE2/SSSE3 opts
_vadd{lvq_s8, lvq_s16, lvq_u8, vq_u8}
8b4e375 dfffdde @mr-c - neon/{add,sub}w_high: use vmovl_high instead of vmovl + get_high b897331 @nemequ
- neon/bcax: initial implementation 96ce481 0ed3dea @Glitch18
- neon/bsl: Implement f16 functions edb75b5 @Glitch18
- neon/cage: Initial f16 implementations 20df81d @Glitch18
- neon/cagt: Implement f16 functions 452a6d3 @Glitch18
- neon/ceq: Implement f16 functions f24ab3d @Glitch18
- neon/ceqz: Implement f16 functions dd2ebf2 de301cd @Glitch18
- neon/cge: Implement f16 functions a512986 f3ad0d4 647dc12 @Glitch18
- neon/cgez: complete implementation of CGEZ family 6d86a20 @Glitch18
- neon/cgt: Add implementation of remaining functions 9930c43 @Glitch18
- neon/cgt, simd128: improve some unsigned comparisons on x86 ae6702a @nemequ
- neon/cgtz: Add implementations of remaining functions 4d749b5 @Glitch18
- neon/cle: add some x86 implementations 5906cc9 d81c7e7 @nemequ 7894c7d @Glitch18
- neon/clez: Add implementaions of scalar functions bc72880 @Glitch18
- neon/clt: Add implementations of scalar functions & SSE/AVX512 fallbacks bc636e1 6a19637 @Glitch18
- neon/cltz: Add scalar functions and natural vector fallbacks 2960ef0 @Glitch18
- neon/cmla, neon/cmla_rot{90,180,270}: check compiler versions e98152f @nemequ
- neon/cmla, neon/cmla_rot{90,180,270}: CMLA requires armv8.3+ 280faae @nemequ
- neon/cmla, neon/cmla_rot{90,180,270}, neon/fma: initial implementation 2aff4f9 @Glitch18
- neon/cnt: add x86 implementations of vcntq_s8 a558d6d @nemequ
- neon/cvt: add
__builtin_convertvector
implementations d06ea5b @nemequ - neon/cvt: add out-of-range and NaN tests 7d0e2ac @nemequ
- neon/cvt: add some faster x86 float->int/uint conversions ceaaf13 @nemequ
- neon/cvt: Add vcvt_f32_f64 and vcvt_f64_f32 implementations 8398f73 @Glitch18
- neon/cvt: cast result of float/double comparison dc215cd @ngzhian
- neon/cvt: disable some code on 32-bit x86 which uses
_mm_cvttsd_si64
48edfa9 @nemequ - neon/cvt: don't use vec_ctsl on POWER 8f9582a @nemequ
- neon/cvt: fix a couple of s390x implementations' NaN handling a8bd33d @nemequ
- neon/cvt: fix compilation with -ffast-math d1d070d @nemequ
- neon/cvt: Implement f16 functions b6a9882 @Glitch18
- neon/cvt, relaxed-simd: add work-around for GCC bug #101614 11aa006 @nemequ
- neon/cvt, simd128: fix compiler errors on PPC 965e68e @nemequ
- neon/cvt: clang bug 46844 was fixed in clang 12.0 71e03a6 @mr-c
- neon/dot_lane: add remaining implementation 3f1c1fa 4a9ca8a @Glitch18
- neon/dup_lane: Complete implementation of function family 12fb731 df320d1 @Glitch18 014ee00 9461557 @nemequ
- neon/dup_lane: use dup_n 2b4a009 @ngzhian
- neon/dup_n: Implement f16 functions 14fdf88 @Glitch18
- neon/dup_n: replace remaining functions with dup_n implementations 27a13b0 @nemequ
- neon/dupq_lane: native and portable 893db57 @ngzhian
- neon/ext: add
__builtin_shufflevector
implementation de8fe89 @ngzhian - neon/ext: add
_mm_alignr_{,e}pi8
implementations 6d28f04 @nemequ - neon/ext: clean up shuffle-based implementation f1de709 @nemequ
- neon/ext: simde_*{to,from}_m64 reqs MMX_NATIVE 13ee902 @mr-c
- neon/ext: unroll SIMDE_CONSTIFY for testing macro implemented functions 62834fa @mr-c
- neon/fma: add a couple x86 and PPC implementations 7a2860b @nemequ
- neon/fma: add more extensive feature checking e541dd1 @nemequ
- neon/fma_lane: Implement fmaq_lane functions a77e6ad 555ef3e @Glitch18
- neon/fma_n: initial implementation 06d5a62 @nemequ dab4342 @nemequ
- neon/get_high: add
__builtin_shufflevector
optimizations 4003afa @ngzhian - neon/get_low: use
__builtin_shufflevector
if available ea3f75e @ngzhian - neon/hadd,hsub: optimization for Wasm ebe09d8 @ngzhian
- neon/ld1: add Wasm SIMD implementation a79bc15 @ngzhian
- neon/ld1_dup: native and portable (64-bit vectors), f64 debb3c8 @ngzhian 6c71aac @Glitch18
- neon/ld1_dup: split from ld1, dup_n fallbacks, WASM implementations 4c586e0 @nemequ
- neon/ld1: Implement f16 functions 6e89a9c f26f775 @Glitch18
- neon/ld1_lane: Implement remaining functions de2de8d @Glitch18 9051a51 @ngzhian
- neon/ld1q: u8_x2, u8_x3, u8_x4 341006c @ngzhian
- neon/ld1[q]_*_x2: initial implementation cd14634 @dgazzoni
- neon/ld{2,3,4}: disable -Wmaybe-uninitialized on all recent GCC e142a59 @nemequ
- neon/ld{2,3,4}: silence false positive diagnostic on GCC 7 3f737a3 @nemequ
- neon/ld2: Implement remaining functions e68f728 @Glitch18 3b3014f @ngzhian 078bb00 @nemequ 041b1bd @mr-c
- neon/ld4_lane: native and portable implementations a973cab @ngzhian 179fb79 @Glitch18 0d1ab79 @nemequ
- neon/ld4: use conformant array parameters 723a8a8 @nemequ
- neon/ld4: work around spurious warning on clang < 10 64e9db0 @nemequ
- neon/min: add SSE2 vminq_u32 & vqsubq_u32 implementation 2cf165e 117de35 @nemequ
- neon/{min,max}nm: add some headers for -ffast-math ebe5c7d @nemequ
- neon/{min,max}nm: use simde_math_* prefixed min/max functions c1607d2 @nemequ
- neon/mlal_high_n: initial implementation d6f75fa @dgazzoni
- neon/mlal_lane: initial implementation 82e36ed 2168ca0 @nemequ
- neon/mls: add
_mm_fnmadd_*
implementations of vmls*_f* 70e0c20 @nemequ - neon/mlsl_high_n: initial implementation ca1a4c3 @dgazzoni
- neon/mlsl_lane: initial implementation de78ae9 @nemequ
- neon/mls_n: initial implementation 042c6eb @nemequ
- neon/movl: improve WASM...
v0.7.2
Summary
Post v0.7.0 fixes; more portable implementations of neon intrinsics
Details
- common: fix SIMDE_FLOAT64_C macro when SIMDE_FLOAT64_TYPE is defined 1d28a5d @rosbif
- complex: split complex math out into separate header 0678336 @nemequ
- diagnostic: silence a few -Weverything diagnostics on clang < 5 6f8d285 @nemequ
Implementation of NEON intrinsics:
- neon/ceq: implement vceq{s_f32,d_f64} f4f42dc @nemequ
- neon/abd: trivial formatting fix 0b8c8ca @nemequ
- neon/abd: add missing scalar functions 517a613 @nemequ
- neon/abs: add vabsd_s64 4091e3e @nemequ
- neon/abs: vabsd_s64 wasn't added to GCC until 9.1.0 52051cb @nemequ
- neon/add: implement vaddd_s64 and vaddd_u64 03d4d1b @nemequ
- neon/cagt: implement vcagt{s_f32,d_f64} 731cf71 @nemequ
- neon/c{ge,gt,le,lt}: some improved 64-bit comparisons 97f4dfb @nemequ
- neon/ext: work around bug in GCC prior to 9.0 0c29a5f @nemequ
- neon/padd: vpadd_f32 was buggy in older clang versions 623cbf7 @nemequ
- neon/rnd: add NaN and ties to test suite fa950a2 @nemequ
- neon/rndm: initial implementation 5bf93ad @nemequ
- neon/rndn: initial implementation 2c624b5 @nemequ
- neon/rndp: initial implementation 7f1f499 @nemequ
- neon/uqadd: clang prior to 9 used incorrect types for the scalar funcs fa0eca0 @nemequ
- neon/uzp1,neon/uzp2: change some dependencies from SSE to SSE2 c00a0e5 @rosbif
x86 intrinsics
SSE*
- sse: fix overflow handling for simde_mm_cvt_ss2si a4658d8 @mr-c
- sse: add SIMDE_MM_{GET,SET}_FLUSH_ZERO_MODE 340bf13 @nemequ
- sse, sse2: add range checks to several conversion functions c3d7abf @nemequ
- sse2: update test for simde_mm_set1_epi32 8854ede @nemequ
- sse2: fix armv7 NEON implementation for simde_mm_shufflehi_epi16 338dac0 @nemequ
- sse2: change some dependencies from SSE to SSE2 c00a0e5 @rosbif
- sse2: fix potentially unused variable in loadu functions f43bfed @nemequ
- sse2: use void* for destinations of loadu functions 98c63ae @nemequ
- sse4.1: check for SHUFFLE_VECTOR before using it in _mm_cvtepu32_epi64 cb73aec @nemequ
- sse4.2: some improved 64-bit comparisons 97f4dfb @nemequ
AVX
- avx: use void* for destinations of loadu functions 98c63ae @nemequ
AVX512
- permutex2var: fix some signed/unsigned mismatch warnings 951caa1 @nemequ
- avx512/s{r,l}li: the imm8 paramters should be unsigned ecc388d @nemequ
XOP
- xop: initial implementation 6cc0cef @nemequ
- xop: add a bunch of NEON implementations b602fbc @nemequ
- xop: fix NEON implementation of simde_mm_maccsd_epi16 8d499b5 @nemequ
Testing with Docker/Podman & CI
- docker: add gdb and valgrind to installed packages 4500040 @nemequ
- ci: move icc build from Travis to GitHub Actions 712f01a @nemequ
- gh-actions: run on pull requests 43e7053 @mr-c
- drone: re-organize drone builds 73fe36a @nemequ
- drone: adjust branch triggers 9eba966 @nemequ
- README: update CI information ca440ae @nemequ
- circleci: add Circle CI 5d5350c @nemequ
- circleci: actually build in 32-bit mode 4267926 @nemequ
- cirrus: add Cirrus CI support 0212a07 @nemequ
- cirrus: run asan/ubsan instead of just another GCC build a1c9f1d @nemequ
- docker: allow for an optional persistent build directory 610fa3d @nemequ
- gh-actions, semaphore: move GCC and clang builds to Semaphore 49d0d82 @nemequ
- ci: disable ci/* builds for various providers 28f8775 @nemequ
- travis: disable all builds 687851b @nemequ
Misc
SIMDe 0.7.0
Version 0.7.0 Summary
- Portable implementation of the NEON intrinsics: 57% finished
- Some more WASM implementations of x86 intrinsics
- Various SSE*, AVX*, and SVML enhancements
- Various new and improved implementations for AltiVec, Neon, POWER architectures.
- The "new" SSE2
_mm_{load,store}u_si{16,32,64}
intrinsics are now implemented along with the SSE_MM_HINT_*
defines. - All of the CLMUL intrinsics have been implemented. "CLMUL_instruction_set" Wikipedia; CLMUL @ Intel Intrinsics Guide.
Please see the 0.7-rc-1 and 0.7.0-rc2 release notes for more details.
Changes since 0.7.0-rc2
Implementation of NEON intrinsics:
neon/orn: add AVX-512VL (ternarylogic) implementations d667aa8 @nemequ
neon/ld3, neon/ld4: disable -Wmaybe-uninitialized on GCC eaaa71f @nemequ
x86 intrinsics
SSE*
sse: cast MM_HINT* values to enum _mm_hint on GCC 3f7e6f7 @nemequ
AVX512
avx512/permutex2var: add remaining intrinsics and translations 5d8d9d2
Misc
math: add modf 580e401 @nemequ
Cleanups of SIMDE_BUG_* definitions e090746 @mr-c
SIMDe v0.6.0
379 commits from 9 contributors, changing 273 files!
0.5.0
I’m pleased to announce the availability of the first release of SIMD
Everywhere (SIMDe),
version 0.5.0,
representing more than three years of work by over a dozen developers.
SIMDe is a permissively-licensed (MIT) header-only library which
provides fast, portable implementations of
SIMD intrinsics for platforms
which aren’t natively supported by the API in question.
For example, with SIMDe you can use
SSE on
ARM,
POWER,
WebAssembly, or almost any platform with a
C compiler. That includes, of course, x86 CPUs which don't support
the ISA extension is question (e.g., calling AVX-512F functions on a
CPU which doesn't natively support them).
If the target natively supports the SIMD extension in question there
is no performance penalty for using SIMDe. Otherwise, accelerated
implementations, such as NEON on ARM, AltiVec on POWER, WASM SIMD on
WebAssembly, etc., are used when available to provide good
performance.
SIMDe has already been used to port several packages to additional
architectures through either upstream support or distribution
packages, particularly on
Debian.
If you'd like to play with SIMDe online, you can do so on Compiler
Explorer.
What is in 0.5.0
The 0.5.0 release is SIMDe’s first release. It includes complete
implementations of:
- MMX
- SSE
- SSE2
- SSE3
- SSSE3
- SSE4.1
- AVX
- FMA
- GFNI
We also have rapidly progressing implementations of many other
extensions including NEON, AVX2, SVML, and several AVX-512 extensions
(AVX-512F, AVX-512BW, AVX-512VL, etc.).
Additionally, we have an extensive test suite to verify our
implementations.
What is coming next
Work on SIMDe is proceeding rapidly, but there are a lot of functions
to implement… x86 alone has about 6,000 SIMD functions, and we’ve
implemented about 2,000 of them. We will keep adding more functions
and improving the implementations we already have.
Our NEON implementation is being worked on very actively right now
by Sean Maher and Christopher Moore, and is expected to continue
progressing rapidly.
We currently have two Google Summer of Code students working on the
project as well; Hidayat
Khan
is working on finishing up AVX2, and Himanshi
Mathur is focused on SVML.
If you're interested in using SIMDe but need some specific functions
to be implemented first, please file an
issue and we may
be able to prioritize those functions.
Getting Involved
If you're interested in helping out please get in touch. We have a
chat room on Gitter
which is fairly active if you have questions, or of course you can
just dive right in on the issue
tracker.