...you know, while I personally think that the RISC approach was an honest mistake, stuff like this makes me see why some people wanted to got rid of complex instructions.
Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?
Complicated vector instructions like these are not really antithetical to RISC.
The core of modern RISC thought is basically: "The laws of physics mean that no matter how much hardware you throw at it, only some kinds of instructions can be implemented in a performant way. We should only include these kinds of instructions in the instruction set." Then you build more complex operations out of these simple building blocks, but the fact that every instruction provided can be reasonably implemented to run really fast, the CPU itself can be fast.
Masked vector adds belong in the set of instructions that can be implemented to be fast, and that's why they are included in the RVV RISC-V extension. An example of an instruction that cannot be implemented to be fast would be the humble x86 load+add, where you first look up a value in memory, and then add it to a register. The only reasonable way to implement this to be fast is to just split it into two separate operations which are also dispatched separately, and that is precisely what modern x86 does.
RISC-V has SIMD extension as well. Even when there is no SIMD, prefetching or instruction selection/scheduling will have a big impact on the performance, so it is unlikely one can easily write a few lines of assembly and get to a similar level of performance.
I don't think RISC-V's SIMD extension is very popular. At least I can't think of any available core implementing it. The vector extension is much more common.
RVV has: masking for everything (though for things like loop tail handling (or even main body) using VL is better and much nicer); all the usual int & FP widths; indexed (gather & scatter) & strided & segmented loads & stores (all potentially masked); all operations support all types where at all possible - including integer division of all widths, and three sign variations for the high half of the 128-bit result of a 64-bit int multiply; And (of course) has 8-bit shifts, which AVX-512 somehow doesn't have.
All while being scalable, i.e. minimum vector register width (VLEN) is 128-bit for the 'v' extension, but hardware can implement up to 65536-bit vectors (and software can choose to either pretend they're 128-bit, or can be written such that it portably scales automatically); and if you want more than 128 bits portably there's LMUL, allowing grouping registers up to 8 in a group, giving up to at-least-1024-bit registers.
For shuffles it has vrgather, which supports all element width lookups and can move any element to any other element (yes, including at LMUL=8, though as you can imagine it can be expected to slow down quadratically with LMUL; and could even become a problem at LMUL=1 for hardware with large VLEN, whenever that becomes a thing).
Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?