Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

...you know, while I personally think that the RISC approach was an honest mistake, stuff like this makes me see why some people wanted to got rid of complex instructions.

Well, supposedly RISC-V implementations will have none of this malarkey while still rivaling x64/ARM64 in processing speed at comparable technology/clock rates/prices, just with plain old loads-and-xors-and-stores?



Complicated vector instructions like these are not really antithetical to RISC.

The core of modern RISC thought is basically: "The laws of physics mean that no matter how much hardware you throw at it, only some kinds of instructions can be implemented in a performant way. We should only include these kinds of instructions in the instruction set." Then you build more complex operations out of these simple building blocks, but the fact that every instruction provided can be reasonably implemented to run really fast, the CPU itself can be fast.

Masked vector adds belong in the set of instructions that can be implemented to be fast, and that's why they are included in the RVV RISC-V extension. An example of an instruction that cannot be implemented to be fast would be the humble x86 load+add, where you first look up a value in memory, and then add it to a register. The only reasonable way to implement this to be fast is to just split it into two separate operations which are also dispatched separately, and that is precisely what modern x86 does.


RISC-V does have RVV, which similarly can do SIMD, has masking, but also has a vector length separate from masks: https://godbolt.org/z/rrEW85snh. Complete with its own set of ~40000 C intrinsics (https://dzaima.github.io/intrinsics-viewer).

Though, granted, RVV is significantly more uniform than AVX-512 (albeit at the cost of not having some useful goodies).


RISC-V has SIMD extension as well. Even when there is no SIMD, prefetching or instruction selection/scheduling will have a big impact on the performance, so it is unlikely one can easily write a few lines of assembly and get to a similar level of performance.


I don't think RISC-V's SIMD extension is very popular. At least I can't think of any available core implementing it. The vector extension is much more common.


The "P" (Packed SIMD) extension is still under development. It uses GPRs and is intended for smaller cores for which V would be too heavyweight.

The proposal originates with Andes, and one of their own ISAs. They have several RISC-V cores with an early draft of it.


There seem to be a few Raspberry Pi style boards available with it. Bruce Hoult wrote about his RISC-V SIMD tolower() at https://lobste.rs/s/bfgsh6/tolower_with_avx_512#c_8wmpce


That's the vector extension (V) rather than packed SIMD (P).


Do the RISC-V vector instructions cover the whole gamut that x86 does? (or at least the modern AVX-512 / AVX-10 coding style)


RVV has: masking for everything (though for things like loop tail handling (or even main body) using VL is better and much nicer); all the usual int & FP widths; indexed (gather & scatter) & strided & segmented loads & stores (all potentially masked); all operations support all types where at all possible - including integer division of all widths, and three sign variations for the high half of the 128-bit result of a 64-bit int multiply; And (of course) has 8-bit shifts, which AVX-512 somehow doesn't have.

All while being scalable, i.e. minimum vector register width (VLEN) is 128-bit for the 'v' extension, but hardware can implement up to 65536-bit vectors (and software can choose to either pretend they're 128-bit, or can be written such that it portably scales automatically); and if you want more than 128 bits portably there's LMUL, allowing grouping registers up to 8 in a group, giving up to at-least-1024-bit registers.

For shuffles it has vrgather, which supports all element width lookups and can move any element to any other element (yes, including at LMUL=8, though as you can imagine it can be expected to slow down quadratically with LMUL; and could even become a problem at LMUL=1 for hardware with large VLEN, whenever that becomes a thing).


Thanks for those details, it sounds like it should be very nice for short strings, and more like SVE than AVX


Considering all x86 procwssors I know about use a risc architecture internally I am not sure what actual benefits you get from a cisc.


toLower() with RVV[0] has been implemented (by brucehoult).

0. https://lobste.rs/s/bfgsh6/tolower_with_avx_512#c_wqhwtp




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: