Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
C++ Concurrency Model on x86 for Dummies (2020) (databasearchitects.blogspot.com)
129 points by greghn on Aug 20, 2022 | hide | past | favorite | 71 comments


It's worth noting that for a lot of people for a long time, Intel TSO was the memory model in practice (which IIRC makes `std::memory_order_seq_cst` pretty much free on Intel).

Many clothes have been torn and many teeth gnashed porting C++ from Intel to more relaxed architectures. Back in the day that usually meant like SPARC, today that usually means ARM.

"Acquire" and "Release" are relatively comprehensible in isolated examples, but the intuitions don't compose well. It's realllly easy to end up with an ABA type issue or an architecture dependency even with the simplest lock-free primitives like CAS. Combine that with how cheap `std::mutex` is on most platforms and it's just really a last resort, at least on server-class gear.

Google engineers have more experience with this than probably anyone, and they've summed it up nicely here: https://abseil.io/docs/cpp/atomic_danger.

Doing raw atomics isn't quite "roll your own cryptography", but it's clearly in "if you have to ask, don't use it" territory. I've lost count of how many times I've fuzzed something to Mars and back to show it "correct" and not been able to justify it compared to `pthread` when the profiles were in.

With that disclaimer, people are going to do it, and so I should probably leave an example of it done well: https://www.youtube.com/watch?v=HJ-719EGIts


> which IIRC makes `std::memory_order_seq_cst` pretty much free on Intel

The article explains how that is only true for loads, not stores.


Thanks for correcting my hand-wavy-ness on that. The thrust of my comment was that blog posts like this need to be handled with some care: this is just enough detail to be dangerous without enough context to be truly helpful.

Getting a bunch of people excited to go start relaxing their loads and stores with mostly-truths like "The CPU's store buffer must be flushed to L1 cache to make the write visible to other threads through cache coherency." is just leaving loaded guns lying around. I'm only mostly sure that some uarchs will read from the store buffer without engaging the cache-coherency stuff at least in an HT/ILP world, but that is kind of the point: playing intrinsics jazz is a very asymmetrical bet. You can win big if you absolutely nail it for your particular chip, but you usually lose it all when it's not correct.

For a slightly more detailed treatment I like this SO answer: https://stackoverflow.com/a/62480523/19734375. It sketches out a better intuition about how the store buffer and speculative stores play out in practice and is very well footnoted.

For those that want to go a level deeper I can't recommend Chips and Cheese highly enough. This treatment of Golden Cove is typical of their rigor in understanding what really pushes architecture-specific performance on cutting-edge gear: https://chipsandcheese.com/2021/12/02/popping-the-hood-on-go....

Edit: Even this comment is self-contradictory. I pulled that SO answer out of my `x86_64 arcana` bookmark folder, and on re-reading it cites a primary source that store buffers are partitioned amongst logical cores by the spec. This stuff is tricky!


If they weren't partioned, it would be hard to preserve TSO as a core would be able to see not only its own stores out of order but the sibling hyperthread as well.

This is the case on POWER for example but it is allowed by its more relaxed memory model.


> Combine that with how cheap `std::mutex` is on most platforms

This is suspiciously hand-wavy. How "cheap" do you think std::mutex is, and on what platforms?

The thrust of your argument makes sense, but one reason it's so tempting is that many C++ programmers are (or at least want to believe they are) writing performance sensitive code and yet the C++ language and standard library prioritize compatibility over any other consideration.

So there's a good chance even though the obvious way to design a mutex on your platform today is a small integer plus OS support, writing std::mutex gets you some huge unwieldy mess because that's compatible with programs written when the "First Black President" was still a white guy named Bill...


I mean it's not hard to read the source for your platform. On Linux/x86_64/libc++ it's roughly:

- https://github.com/llvm-mirror/libcxx/blob/master/include/__...

- https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=nptl/...

I don't particularly care to comb through it to see if anything has changed, but historically it was a a little spin-CAS to make the non-contended path fast and then dropping into a https://en.wikipedia.org/wiki/Futex, which is about as good as it gets for staying mostly in userspace but still letting it be scheduler aware so you're not burning up a core busy-polling, which is what often happens when people try to roll their own shit.

Google wants a bit more latitude on the heuristics and degrees of freedom around read/write ownership, so they did it like this: https://github.com/abseil/abseil-cpp/blob/master/absl/synchr... which is quite a bit better commented/legible.

If anyone reading this can do better than the `abseil-cpp` folks, not only would Google take their PR, they'd probably offer them a job.


I am always disappointed when someone talks about std::mutex poorly. On Linux it is as good as it can possibly be for a generic catch all lock and by that I mean it is really really good for most usecases. If you want to use a spinlock to outperform std::mutex you will at least have to do the legwork of using real time scheduling and guaranteeing that any spinlock you are locking will be unlocked within a finite amount of time with a known upper bound. Any less and your spinlock will cause problems when your thread locks it and then gets interrupted by the OS scheduler.


> On Linux it is as good as it can possibly be for a generic catch all lock

A futex is a 32-bit aligned value, thus it needs 4 bytes. But std::mutex on Linux is 40 bytes, ten times larger. Now, maybe where you come from "ten times larger than it needs to be" is "as good as it can possibly be" but where I come from that's not very good.


The only issue with std::mutex on libstdc++ is that it is larger than it needs to be for ABI reasons. Otherwise it is perfectly adequate for many use cases.


Great point. The `libstdc++` ABI is very low on my list of favorite things. Who doesn't love spilling cache lines because of 1990s layouts and taking L3 misses for `auto s = std::string{"hello world"};` because modern SSO breaks that ABI and... /s


It's adequate if you're happy yielding to the kernel and needing some other thread to issue a system call to wake you up.

Some uses of C++ have stricter latency and real-time requirements which aren't compatible with this.


If you're in a serious multicore setting and you don't take waiting threads off the contended cache line while the other 50 threads take their turn with the underlying contended thing it's very easy to end up way worse off than if you had just let the scheduler drop those threads into e.g. a futex and wake them up when it's their turn.

I'm not sure how deep you'd need to go into the low-power or embedded or hard-realtime worlds before `std::mutex` doesn't do a modest number of spin-CAS rounds before dropping into the futex, I'm sure it exists.

And maybe you work at Optiver and you've got an FPGA interacting with the link-layer and you literally never leave userspace after startup and you've got your own hand-crafted DMA busy-poll situation going because you build your machines with the exact number of cores for the number of threads you need and throw them away when the software changes. There are domains like that. </modest-hyperbole>

The number of "aggressively intermediate" people working at serious companies who roll their own concurrency shit because it's Just Fucking Metal Man is terrifyingly high. And it's dangerous, because if you are someone who needs custom concurrency you know, but if you aren't someone who needs custom concurrency, you often still think you know.


As soon as you yield, things become non-deterministic, so aren't compatible with real-time requirements.

If you have 50 threads trying to access the same thing then what you have is a software design problem. Most synchronization should be done through spsc queues, which are easy to make lock-free (or even wait-free) and efficient, so long as you're aware of how to deal with backoff on the producing side and idle work on the consuming side.

The Optiver model you're describing is pretty much how I'd build any low-latency application. It doesn't really require special hardware to do these things (you can use io_uring to bypass the kernel context switches for anything). It's also much simpler than what you hint at.


I'm willing to accept that a London-based crypto startup (i.e. LMAX-integrated) could have a use for extreme low-latency, extreme low-variance soft-realtime C++. In the sub-mike p99 regime you probably want to keep multithreading out of it entirely in fact.

Hopefully you can accept that nitpicking the use of well-tested concurrency primitives on a forum full of impressionable up-and-coming hackers is almost certainly going to create downward pressure on sensible engineering choices amongst readers of your comments.


Have you been googling me? A few inaccuracies there ;).


>As soon as you yield, things become non-deterministic, so aren't compatible with real-time requirements.

SCHED_FIFO and SCHED_RR.

Although linux is not really an RT OS.


Once again, exactly right.

I deeply appreciate you helping to steer this little tire fire of a comment thread I seem to have created off the rocks, I meant well but it seems to have ended up as an advertisement for insane defaults.

With that said, the parent seems pretty committed. And for all I know, is actually doing sub-microsecond software HFT or hard-realtime signal processing.

Thread: listen to this person ^.


Threads can be derailed, that's usually how interesting things come up.


Fast implementations of std::mutex spin for a bounded amount of time before yielding, if you unlock the mutex fast enough it will have roughly the performance of a spinlock and if your locking thread gets interrupted by the scheduler, other threads waiting for the mutex will stop spinning and yield.

I already mentioned real time scheduling which is well outside of the scope of the vast majority of userspace applications.


I think this is pretty much exactly right with the caveat that "fast implementations" is probably "the overwhelming majority of implementations". `libpthread` on GNU has been doing this since smart phones were quite the novelty. I'm sure it exists, but I'm having a hard time imagining who wrote a C++11 compliant standard library and didn't do this optimization.


RISC-V is pretty interesting in this regard; it uses weak memory ordering by default, but a CPU that implements TSO can simply flag Ztso, and I believe that would be enough to get x86-like memory semantics.

IIRC, Apple M1 uses a similar technique to make sure Rosetta 2 can run translated x86 code without worrying about the memory model. However, M1's implementation can be switched at runtime, while Ztso is not. On the other hand, Ztso is standardized while M1's TSO is not.

Also, the text for Ztso mentions that SPARC uses TSO, just like x86. ARM however is indeed more relaxed.


That’s fascinating about the M1. In retrospect it seems like kind of a no-brainer but I doubt I would have thought of it.

SPARC had different memory models at different ISA revs IIRC: it’s been like 20 years since I was dealing with SPARC so I might be misremembering the details. Alpha would have been a better example.

RISC-V is really interesting. I’ve been slowly working through this: https://github.com/standardsemiconductor/lion, highly recommend!


> Combine that with how cheap `std::mutex` is on most platforms and it's just really a last resort, at least on server-class gear.

Mere 10-20M std::mutex operations per second can bring a server to a total halt. CPU interconnects and CPU internal core-to-core buses are quickly saturated by required synchronization traffic.

So I wouldn't call it a particularly cheap operation. In fact, I can't think of many more expensive operations a CPU could perform. I guess just page walks and interrupts might qualify.


The dominant term in the type of scenario you're describing is contention, not mutexes. I forget how big they're making Zen3 EPYCs right now, I think you can get like 64-128 physical cores. I can walk all over 128 different mutexes on each of those 128 cores for less than the cost of an L3 hit on that same EPYC box [1].

If I fuck up the affinity and they're all banging on one data structure? Yeah, it can jam the whole machine. This is why people who write good logging libraries are always playing games with buffering in thread-local storage and stuff. And sometimes you do need a bunch of producer threads dumping into a queue that simply can't be striped by anything (it's somewhat rare but it happens): in which case you want `folly::MPMCQueue` (or whatever Intel TBB calls their version of that) which Nathan Bronson was kind enough to get right so that other people don't have to end up with something both slow and wrong.

A truly novel lock-free data structure that advances the SOTA in production software is a masters thesis, not a blog post.

[1] This is a bit dated but it's good enough for eyeball work: https://gist.github.com/jboner/2841832.


Uncontended mutex operations are cheap and do not require core-core communication.

Contended operations are expensive, whether they are mutex acquisitions, atomic RMW or plain writes.


Ah I missed your comment when writing my sibling. Your comment is much more succinct while making the point at least as well if not better.


This is grossfully incorrect. Using sequential consistency on x86 does require extra expensive fence instructions.

What's free on x86 is acquire/release.


You're going to eat some cost on any speculative/out-of-order chip to get loads and stores ordered bidirectionally relative to getting it down to the `MFENCE` or whatever by hand, though a great deal less than not having a speculative execution pipeline.

The statement you're calling "grossfully incorrect" was double-qualified by "IIRC" and "basically", which I thought was enough to make clear that I was oversimplifying a bit.

The Google doc I linked starts with this:

"Most engineers reach for atomic operations in an attempt to produce some lock-free mechanism. Furthermore, programmers enjoy the intellectual puzzle of using atomic operations. Both of these lead to clever implementations which are almost always ill-advised and often incorrect."

They're being generous.

I'm well aware that it gets up the ass of every self-styled Ghemawat to point out that most people shouldn't be playing jazz on this stuff, but if even one hacker doesn't fuck up something important because I urged some caution, I'll live.


CAS is pretty much tailored for read-modify-write (RMW) of atomically-sized data. ABA is an issue if it can result in the same bit pattern having a different meaning (thus, CAS is sometimes less intuitive than e.g. LL/SC), but there are plenty of cases where it is enough, even in a straightforward "composable" way.


There's a constant tension between bit-packing and rich semantics older than me! As you likely know C.A.R. Hoare calls `null` his "billion-dollar mistake": https://www.youtube.com/watch?v=ybrQvs4x0Ps.

I linked the Cliff Click talk because it's the best example that I know about of someone using some semi-formal rigor around valid states and the transitions between them in the context of raw atomic operations to get a quite clearly correct result with a legitimate performance imperative: he had a machine with 768 cores, customers who bought that machine to have 768 threads jump on a hash table all at once and wanted ~~700x more throughput having paid for the machine.

But even Cliff Click made sure that both his conceptual states and their physical representations were going A -> B -> C w.r.t. a given conceptual "register".

I'm not trying to knock the serious concurrency pros who need atomic and/or relaxed operations in some super hot path: there are unambiguous use cases for this stuff. I've legitimately needed these things once or twice myself.

Lock-free is a bad default. Even the elite concurrency pros work very hard to get something that's both robust and still a win after the constant factors. I'm saying "proceed with high caution, and on the basis of measurement", not "never ever do this".


Wait, wait, there can be even more confusion! IIRC nvidia’s Denver and Carmel Aarch64 cores, and one of Fujitsu’s, also uses TSO...

...so your tested Intel-to-ARM port might work just fine until run on a different vendor’s core!


POWER/PowerPC has a relaxed memory model, too.

Anyway this is another case where (hopefully) Rust shields the programmer from complexity. This stuff is very difficult to get correct and efficient.


Rust does have some very cool mechanisms for safety, including in the presence of concurrency.

But this thread is generating blowback from someone saying: “slow down there with the hand-rolled atomic operations, you can hand-roll your multi-threaded locking strategy and it’ll be way safer at a modest cost!”

So, probably not the target audience for Rust ;)

I use a lot of C++ still because there are libraries I want and I have a significant investment in existing code, but I’d love to get to something more modern.

Hand-rolled atomics and load/store relaxation in application code make even seasoned C++ hackers a bit nervous: we saw this shit from business logic hackers at FB all the time and my colleague coined the term “aggressively intermediate” for the style.

I don’t mean to pick on the author of a quite good library (and it is quite good), but I ran across this the other day:

https://github.com/jupp0r/prometheus-cpp/blob/master/core/sr...

It’s correct (I think, very easy to be wrong about this sort of thing), but what are we measuring here where we can’t delegate that CAS into pthread? Branch mispredictions?

Either threads are fighting over whatever cache line that’s on (exclusive -> invalid -> exclusive -> invalid), or not. If they are, I’ve just deprived the scheduler of the opportunity to wake me up when the other 59 threads are done. If they’re not, I’ve maybe saved like one line in my L1.

And in something like a metrics library, you could be wrong for a very long time before someone pinned it down.


For a somewhat less "for dummies" treatment of the C++ memory model, see Herb Sutter's classic talk "atomic<> Weapons" [1].

[1] https://www.youtube.com/watch?v=A8eCGOqgvH4


This is a good talk.

A good set of blogposts is Preshing as well: https://preshing.com/20120710/memory-barriers-are-like-sourc...

https://preshing.com/20120913/acquire-and-release-semantics/

(etc. etc. Preshing wrote a lot on this subject, browse the blog)

Not quite on C++ atomics or the C++11 concurrency model. But more of generic discussion on this extremely complex subject.


Turns out the C++ memory model has some issues that came to light after that talk. See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p06...


I've always hated the modern C++ atomic stuff. Atomicity is not a property of the thing being accessed, it is a property of the instructions performing the access: trying to shoehorn that reality into an object-based model was wrong from the beginning.

A library of fence primitives is much easier to work with and understand, IMHO. [1] is the best example I know of. If you've never tried to write code that way, have a crack at it: you'll definitely learn something.

I think lots of programmers struggle to understand memory ordering because the C++ model makes it so much more difficult to understand than it really is. When you actually deal with the fences directly, everything makes a lot more sense.

Also, the word "atomic" is unfortunately really overloaded. Nearly all X86 instructions are "atomic" with aligned pointers, in the sense that the result of two racing MOV instructions will be one or the other value, not some corrupt combination of the two.

[1] https://www.kernel.org/doc/Documentation/memory-barriers.txt


I for one vastly prefer fenced operations as opposed to separate fences.

The synchronizes-with model is much easier to understand than anything reordering based.

Also it is nice that the typesystem helps making sure that atomic things are handled atomically.

When it is not enough now there is atomic_ref.


> The synchronizes-with model is much easier to understand than anything reordering based.

Why do you think that?


Empirically it is much easier to reason about and be confident of the correctness of the result, at least to me and to many people I have discussed the matter with.

I find it easier to reason in term of DAGs instead of all possible interleaved executions. The DAGs are stil dynamic and execution dependent, but you have to consider less combinations.

One you know that some algorithm violates the happen before relation is easy to go back and find the reordering that would cause it.


What a great article. I have been trying to compile some examples here https://github.com/blasrodri/atomic-story Will try to extend it a bit more based on what I've learned from this text


> C++

And using pointers instead of references.


Which of the following code potentially modifies foobar?

    int foobar;
    f(foobar);
    g(&foobar);
Well, in C, "g" is the only function that might modify foobar. In C++ however, the reference type f(int& x); can make f also modify foobar.

For some coders, it is preferable to have this explicit & reminder on their code, to see cases like "g" rather than possible-modifications that occur with "f"-style code.

Personally speaking, I kinda switch between the two styles depending on mood / the situation. I'd think that the dereferences on "atomics" however need to be incredibly explicit however, so "atomic" code will almost certainly be "g" style, rather than "f".


The problem with this is that once you get any deeper than the first call, that helpful "&arg" goes away and it's just "arg".

I would argue this sometimes-there single character is incredibly brittle for the claimed benefits, and illustrative of the unwillingness of large parts of the C++ community to use any reasonable tools that would do a better job instead (e.g. very basic syntax highlighting would be far more consistent in indicating mutability).


> The problem with this is that once you get any deeper than the first call, that helpful "&arg" goes away and it's just "arg".

There's a HUGE difference from a "int& foo" and a "int* foo". ESPECIALLY if I'm doing incredibly sophisticated atomic operations, memory-barrier, and atomic operations (IE: atomic<int>& vs atomic<int>*).

Take a guess which one I'm more comfortable passing around.

This isn't a case of "param->Operation1(1)" like you're talking about. This is a simple integer being used in some of the most complex, sophisticated, low-level optimizations known to the modern programmer. Knowing when, where, why, and how these things are dereferenced is the entire damn point of memory-barriers and atomics. I want _NO_ surprises, at all.

At least, within the context of "C++ Concurrency Models". To be honest, this is the stuff that causes me to run away screaming. But if I'm forced to write code involving the words "compare-and-swap", "atomic", or "barrier", you damn well bet that I'm going to be doing it with "int*" rather than "int&".

--------

Every style of code has its place. When dealing with atomics, I definitely prefer pointers over references. If you're running around in object-oriented land, you probably don't care and have modest benefits to the reference. But the minute you start studying "A-B-A" problems and the difference between "load-acquire" vs "load-consume" and the possible memory-orderings that can happen, you suddenly want to become extremely explicit over your dereferencing.

--------

"When" did the dereference occur? Is it compatible between Thread#1 and Thread#2? Are there any weird "gotchas" that could occur? Has the dependencies you need been ordered correctly in all threads that are active?

If you're unaware of what's going on here, give this a readover. https://en.cppreference.com/w/cpp/atomic/memory_order

I promise you, you'll have some headaches. This is by far the hardest subject in all of low-level programming I've come across. And being more explicit about when the dereferences happen is very useful.


So, to be clear, you appear to either (a) have a case where there are significant behavior differences between passing by mutable reference versus pointer (and thus _need_ to use a pointer), or (b) are relying on operations to certain types being "atomic" and on the compiler and processor not mangling the code you write without using any explicit concurrency-protecting constructs (e.g. std::atomic<T>) which you believe they would do to a reference argument.


Have you read the post in question? Its a good introduction to C++ Concurrency / memory models.

    atomic<int>* blah; 

    a = *blah; // This is relatively fast.
    *blah = b; // This is slow: store-release.
    blah = c; // This is fast, no store-release operation going on. Just a pointer-assignment
Anyone dipping down to this level is interested in the performance characteristics of memory barriers. Indeed, the entire point of this exercise is to get faster than std::mutex lock / unlock after all, and squeezing the last ounce of performance out of your code.

Indeed, one can argue that when using atomics, the more explicit "store" and "load" functions should be used instead.

    a = blah->load(); // These two examples are
    blah->store(b); // arguably the most explicit, preferred choice


> Have you read the post in question? Its a good introduction to C++ Concurrency / memory models.

Yes, I have. And yes, I am familiar with the C++ concurrency/memory models discussed here. And, to be clear, nothing in the article is doing anything that couldn't be done just as clearly with std::atomic<bool>& as with std::atomic<bool>*. Obviously, if you actually want to be able to assign to the pointer, then you need the pointer form.

> Indeed, one can argue that when using atomics, the more explicit "store" and "load" functions should be used instead.

Yes, this is the style I strongly prefer. Of course, here it doesn't matter between

  a = blah->load();
  blah->store(b);
or

  a = blah.load();
  blah.store(b);
and choosing to use the pointer form is often just introducing nullability where you don't need it.


> choosing to use the pointer form is often just introducing nullability where you don't need it.

_Nonnull int* whatever;

Tada! Nullability avoided. https://releases.llvm.org/3.8.0/tools/clang/docs/AttributeRe...


That's not a store-release, it's a store-release followed by a full memory barrier. A store-release is fast but does not order the store against subsequent loads.


> that helpful "&arg" goes away and it's just "arg".

You still get a reminder it's a pointer by the * and -> notation.


That only goes so far - without reasonable method names you still don't know if the operation actually mutates the parameter, e.g. in

  void func(ParamType* param) {
    ...
    auto res_1 = param->Operation1(1);
    auto res_2 = param->Operation2(2);
    ...
  }
the presence of -> doesn't tell you which method mutates. The only thing that would tell you that would be knowing which method is const.


It depends. What if "f" is macro?


Depends. What if someone wrote "#define if while" and the word "if" is a macro?

    // In #include<deep/deep/deep/include.h>
    #define if while

    // somewhere else

    if(1){
        puts("Help, I'm stuck in an infinite loop!");
    }

There's a reason why macros should be all-caps when used, and IMO, C++ programmers should prefer templates. Macros are too powerful, and need to be used extremely cautiously in my experience.

Pointing out that macros can totally mess up your expectations isn't exactly news to anybody who does C/C++ programming.


Really if I’m programming c++ I avoid macros (except if guards). Though I do have the advantage that my code doesn’t need to be portable


Completely off-topic...


yes? what am I missing here?. I've always prefered pointers in my c++ code.


Code following this style (e.g. lots of code following GSG or GSG-derived styles), tends to have large amounts of

  void func(ParamType* param) {
    assert(param != nullptr);
    ...
  }
because passing via pointer introduced undesired nullability. And if that null pointer check isn't present, or is done wrong, you've easily introduced the potential for undefined behavior into the codebase. C++ specifically gives you mutable references to express the desired behavior in these cases, and using a nullable (and memory-unsafe nullable at that) pointer is actively less clear (you now need to document the non-null precondition separately, rather than in the signature itself).


> introduced the potential for undefined behavior

Small quibble, C++ has the capability to be low level enough that the potential for UB is always there. Even with a reference, you could be dealing with a completely random chunk of memory that someone told the compiler to interpret as the data type in question.

The quibble is the presentation of this as introducing the possibility of UB (the word introduced implies it wasn't already there), instead of used as a technique for minimizing the already present potential of UB.


You are just wrong here. References were not added to C++ for that reason. Pointers can be presumed in some code bases not to be null with nullability being the unusual and documented scenario.


I didn't claim that's why they were added, I said that's what they allow you to express.

> Pointers can be presumed in some code bases not to be null with nullability being the unusual and documented scenario.

Just because you can train your users to assume/document when nullability is allowed, doesn't make it a better option than using a language construct that would make it explicit.


You literally did say that, it’s what the meaning of your words is.

Pointers are a better option because they make the mutability apparent at the call site. Codebases with that convention work great. There is no special training or documentation overhead involved.


Specifically, what I said was

> C++ specifically gives you mutable references to express the desired behavior in these cases

where "these cases" refers to the common case of (a) having a mutable argument, (b) which you want passed by reference, and (c) do not want to be nullable, that is exactly the semantics achieved by passing via mutable reference. I am not claiming this is why Stroustrup added them (his own explanation for why references were added is "References are useful for several things, but the direct reason I introduced them in C++ was to support operator overloading."), but that the language provides a construct which provides exactly for the desired behavior.

> Pointers are a better option because they make the mutability apparent at the call site.

At the first call site, yes (sometimes, see below). Not at any further calls.

> Codebases with that convention work great. There is no special training or documentation overhead involved.

As someone who works with a large codebase following this style, I beg to differ. Codebases with this convention are littered with

  ReturnType func(ParamType* param) {
    assert(param != nullptr);
    ...
  }
because mutable-arguments-via-pointer has introduced undesired nullability, where the intended behavior could have been expressed more directly as

  ReturnType func(ParamType& param) {
    ...
  }
and, it inevitably combines all three types of methods

  ReturnType func1(const ParamType* param) {
    if(param != nullptr) {
      ...
  }

  ReturnType func2(ParamType* param) {
    assert(param != nullptr);
    ...
  }

  ReturnType func3(ParamType* param) {
    if (param != nullptr) {
      ...
  }
mingling nullability and mutability, none of which is obvious from

  func1(&foo);
  func2(&bar);
  func3(&baz);
and you still end up needing documentation to figure out what's happening.


> Pointers can be presumed in some code bases not to be null with nullability being the unusual and documented scenario.

So document it regardless instead https://releases.llvm.org/3.8.0/tools/clang/docs/AttributeRe...


It shouldn't be a matter of preference. Use a pointer if you need it to be relocatable or if you need it to be optional (it may be nullptr). Otherwise use a reference.


The other major use-case is where other code constrains you, e.g. it's being passed as a callback to a C library.

There are also a number of cases where passing a pointer into a template function can be more concise. E.g. since pointers are iterators, if you have a generic function that iterates over things and you just want it to look at one thing (or zero) for testing, you can usually do this with pointer arithmetic.


I am not familiar with the capabilities of the compiler but is there any optimization reason to prefer references as well? Like the compiler may be able to figure out an efficient way to keep a referenced object ‘nearby’ for subsequent function calls that also refer to the object, maybe in some register or reordering which code gets run when. Whereas sometimes the compiler might not be able to do something like that with a pointer, incurring mem access costs.

Hm slightly tangential but could you install some kind of page-fault handler on a reference such that the control flow jumps on write back to the last place it was accessed? Not that you’d want to do that in the real world, but it would be an interesting “pub-sub” type control flow..


In C++, there are no differences between pointer and reference semantics except that the former are a) nullable and b) reseatable (can be changed to point to something else).


nobody says you should always use reference instead of pointer in C++


I do prefer references, because they can't be null. It is less state to worry about.

I agree that not having explicit out-parameters makes this a trade-off. It is possible to use an out-parameter wrapper type to get the best of both worlds, but nobody does this.


Lots of people say this... they’re wrong but the day this




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: