...you know, while I personally think that the RISC approach was an honest mista...

Tuna-Fish · on July 29, 2024

Complicated vector instructions like these are not really antithetical to RISC.

The core of modern RISC thought is basically: "The laws of physics mean that no matter how much hardware you throw at it, only some kinds of instructions can be implemented in a performant way. We should only include these kinds of instructions in the instruction set." Then you build more complex operations out of these simple building blocks, but the fact that every instruction provided can be reasonably implemented to run really fast, the CPU itself can be fast.

Masked vector adds belong in the set of instructions that can be implemented to be fast, and that's why they are included in the RVV RISC-V extension. An example of an instruction that cannot be implemented to be fast would be the humble x86 load+add, where you first look up a value in memory, and then add it to a register. The only reasonable way to implement this to be fast is to just split it into two separate operations which are also dispatched separately, and that is precisely what modern x86 does.

dzaima · on July 29, 2024

RISC-V does have RVV, which similarly can do SIMD, has masking, but also has a vector length separate from masks: https://godbolt.org/z/rrEW85snh. Complete with its own set of ~40000 C intrinsics (https://dzaima.github.io/intrinsics-viewer).

Though, granted, RVV is significantly more uniform than AVX-512 (albeit at the cost of not having some useful goodies).

pca006132 · on July 29, 2024

RISC-V has SIMD extension as well. Even when there is no SIMD, prefetching or instruction selection/scheduling will have a big impact on the performance, so it is unlikely one can easily write a few lines of assembly and get to a similar level of performance.

Narishma · on July 29, 2024

I don't think RISC-V's SIMD extension is very popular. At least I can't think of any available core implementing it. The vector extension is much more common.

Findecanor · on July 29, 2024

The "P" (Packed SIMD) extension is still under development. It uses GPRs and is intended for smaller cores for which V would be too heavyweight.

The proposal originates with Andes, and one of their own ISAs. They have several RISC-V cores with an early draft of it.

fanf2 · on July 29, 2024

There seem to be a few Raspberry Pi style boards available with it. Bruce Hoult wrote about his RISC-V SIMD tolower() at https://lobste.rs/s/bfgsh6/tolower_with_avx_512#c_8wmpce

Narishma · on July 29, 2024

That's the vector extension (V) rather than packed SIMD (P).

dralley · on July 29, 2024

Do the RISC-V vector instructions cover the whole gamut that x86 does? (or at least the modern AVX-512 / AVX-10 coding style)

dzaima · on July 29, 2024

RVV has: masking for everything (though for things like loop tail handling (or even main body) using VL is better and much nicer); all the usual int & FP widths; indexed (gather & scatter) & strided & segmented loads & stores (all potentially masked); all operations support all types where at all possible - including integer division of all widths, and three sign variations for the high half of the 128-bit result of a 64-bit int multiply; And (of course) has 8-bit shifts, which AVX-512 somehow doesn't have.

All while being scalable, i.e. minimum vector register width (VLEN) is 128-bit for the 'v' extension, but hardware can implement up to 65536-bit vectors (and software can choose to either pretend they're 128-bit, or can be written such that it portably scales automatically); and if you want more than 128 bits portably there's LMUL, allowing grouping registers up to 8 in a group, giving up to at-least-1024-bit registers.

For shuffles it has vrgather, which supports all element width lookups and can move any element to any other element (yes, including at LMUL=8, though as you can imagine it can be expected to slow down quadratically with LMUL; and could even become a problem at LMUL=1 for hardware with large VLEN, whenever that becomes a thing).

fanf2 · on July 29, 2024

Thanks for those details, it sounds like it should be very nice for short strings, and more like SVE than AVX

bjoli · on July 29, 2024

Considering all x86 procwssors I know about use a risc architecture internally I am not sure what actual benefits you get from a cisc.

snvzz · on July 30, 2024

toLower() with RVV[0] has been implemented (by brucehoult).

0. https://lobste.rs/s/bfgsh6/tolower_with_avx_512#c_wqhwtp