A trick to avoid reading beyond the end of the buffer is to make sure the end of the buffer lies on the same page. Typically, the OS will allocate memory in pages of 4KB, thus we can make a function that checks whether it is okay to read beyond or if we should fallback to the copy version.
Masking is nice, but not available everywhere (i.e. intel is still making new generations of CPUs without AVX-512, and apple silicon doesn't have any masked loads/stores either).
It might not be the nicest thing to assume to be the case on all hardware, but it shouldn't be too unreasonable to put it under an "if (arch_has_a_minimum_page_size)". So many things already assume at least 4KB pages, Intel/AMD aren't gonna break like half the world. If anything, they'd want to make larger pages to make larger L1 caches more feasible.
A trick to avoid reading beyond the end of the buffer is to make sure the end of the buffer lies on the same page. Typically, the OS will allocate memory in pages of 4KB, thus we can make a function that checks whether it is okay to read beyond or if we should fallback to the copy version.
-- https://ogxd.github.io/articles/unsafe-read-beyond-of-death/