From cd13223e7760d55601f440f4605db33bafd1a6d2 Mon Sep 17 00:00:00 2001 From: Tamar Christina Date: Fri, 27 Feb 2026 07:41:20 +0000 Subject: [PATCH] AArch64: Don't enable ptest elimination for partial vectors [PR124162] The following loop char b = 41; int main() { signed char a[31]; #pragma GCC novector for (int c = 0; c < 31; ++c) a[c] = c * c + c % 5; { signed char *d = a; #pragma GCC novector for (int c = 0; c < 31; ++c, b += -16) d[c] += b; } for (int c = 0; c < 31; ++c) { signed char e = c * c + c % 5 + 41 + c * -16; if (a[c] != e) __builtin_abort(); } } compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve generates ptrue p6.b, vl32 add x2, x2, :lo12:.LC0 add w5, w5, 16 ld1rw z25.s, p6/z, [x2] strb w5, [x6, #:lo12:.LANCHOR0] mov w0, 0 mov p7.b, p6.b mov w2, 31 index z30.s, #0, #1 mov z26.s, #5 mov z27.b, #41 .L6: mov z29.d, z30.d movprfx z28, z30 add z28.b, z28.b, #240 mad z29.b, p6/m, z28.b, z27.b mov w3, w0 movprfx z31, z30 smulh z31.s, p6/m, z31.s, z25.s add w0, w0, 8 asr z31.s, z31.s, #1 msb z31.s, p6/m, z26.s, z30.s add z31.b, z31.b, z29.b ld1b z29.s, p7/z, [x1] cmpne p7.b, p7/z, z31.b, z29.b b.any .L15 add x1, x1, 8 add z30.s, z30.s, #8 whilelo p7.s, w0, w2 b.any .L6 Which uses a predicate for the first iteration where all bits are 1. i.e. all lanes active. This causes the result of the cmpne to set the wrong CC flags. The second iteration uses whilelo p7.s, w0, w2 which gives the correct mask layout going forward. This is due to the CSE'ing code that tries to share predicates as much as possible. In aarch64_expand_mov_immediate we do during predicate generation /* Only the low bit of each .H, .S and .D element is defined, so we can set the upper bits to whatever we like. If the predicate is all-true in MODE, prefer to set all the undefined bits as well, so that we can share a single .B predicate for all modes. */ if (imm == CONSTM1_RTX (mode)) imm = CONSTM1_RTX (VNx16BImode); which essentially maps all predicates to .b unless the predicate is created outside the immediate expansion code. It creates the sparse predicate for data lane VNx4QI from a VNx16QI and then has a "conversion" operation. The conversion operation results in a simple copy: mov p7.b, p6.b because in the data model for partial vectors the upper lanes are *don't care*. So computations using this vector are fine. However for comparisons, or any operations setting flags the predicate value does matter otherwise we get the wrong flags as the above. Additionally we don't have a way to distinguish based on the predicate alone whether the operation is partial or not. i.e. we have no "partial predicate" modes. two ways to solve this: 1. restore the ptest for partial vector compares. This would slow down the loop though and introduce a second ptrue .s, VL8 predicate. 2. disable the sharing of partial vector predicates. This allows us to remove the ptest. Since the ptest would introduce a second predicate here anyway I'm leaning towards disabling sharing between partial and full predicates. For the patch I ended up going with 1. The reason is that this is that unsharing the predicate does end up being pessimistic loops that only operate on full vectors only (which are the majority of the cases). I also don't fully understand all the places we depend on this sharing (and about 3600 ACLE tests fail assembler scans). I suspect one way to possibly deal with this is to perform the sharing on GIMPLE using the new isel hook and make RTL constant expansion not manually force it. Since in gimple it's easy to follow compares from a back-edge to figure out if it's safe to share the predicate. I also tried changing it so that during cond_vec_cbranch_any we perform an AND with the proper partial predicate. But unfortunately folding doesn't realize that the and on the latch edge is useless (e.g. whilelo p7.s, w0, w2 makes it a no-op) and that the AND should be moved outside the loop. This is something that again isel should be able to do. I also tried changing the mov p7.b, p6.b into an AND, while this worked, folding didn't quite get that the and can be eliminated. And this also pessimists actual register copies. So for now I just undo ptest elimination for partial vectors for GCC 16 and will revisit it for GCC 17 when we extend ptest elimination. gcc/ChangeLog: PR target/124162 * config/aarch64/aarch64-sve.md (cond_vec_cbranch_any, cond_vec_cbranch_all): Drop partial vectors support. gcc/testsuite/ChangeLog: PR target/124162 * gcc.target/aarch64/sve/vect-early-break-cbranch_16.c: New test. --- gcc/config/aarch64/aarch64-sve.md | 6 ++--- .../aarch64/sve/vect-early-break-cbranch_16.c | 22 +++++++++++++++++++ 2 files changed, 25 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-sve.md index 97fd9516959..019630eb8d2 100644 --- a/gcc/config/aarch64/aarch64-sve.md +++ b/gcc/config/aarch64/aarch64-sve.md @@ -9877,12 +9877,12 @@ ;; instead and about the ptest. (define_expand "" [(set (pc) - (unspec:SVE_I + (unspec:SVE_FULL_I [(if_then_else (match_operator 0 "aarch64_comparison_operator" [(match_operand: 1 "register_operand") - (match_operand:SVE_I 2 "register_operand") - (match_operand:SVE_I 3 "aarch64_simd_reg_or_zero")]) + (match_operand:SVE_FULL_I 2 "register_operand") + (match_operand:SVE_FULL_I 3 "aarch64_simd_reg_or_zero")]) (label_ref (match_operand 4 "")) (pc))] COND_CBRANCH_CMP))] diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c new file mode 100644 index 00000000000..9bb251f5f47 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/vect-early-break-cbranch_16.c @@ -0,0 +1,22 @@ +/* { dg-do run { target aarch64_sve_hw } } */ +/* { dg-options "-Ofast --param aarch64-autovec-preference=sve-only" } */ +/* { dg-require-effective-target lp64 } */ + +char b = 41; +int main() { + signed char a[31]; +#pragma GCC novector + for (int c = 0; c < 31; ++c) + a[c] = c * c + c % 5; + { + signed char *d = a; +#pragma GCC novector + for (int c = 0; c < 31; ++c, b += -16) + d[c] += b; + } + for (int c = 0; c < 31; ++c) { + signed char e = c * c + c % 5 + 41 + c * -16; + if (a[c] != e) + __builtin_abort(); + } +}