Tesla Packs 50B Transistors onto D1 Dojo Chip

ksec · on Aug 28, 2021

Big Numbers are good for headline. But it doesn't put anything in context.

Die Size is 645mm^2 on a 7nm. This is important because we know the reticle limit which is around ~800mm^2.

The Nvidia AI Chip has 54 billion transistors with a die size of 826 mm2 on 7nm.

I recently saw a Ted Talk, If Content is King, then Context is God. I think it capture everything that is wrong in today's society.

zamadatix · on Aug 28, 2021

The maximum die size is interesting but not really the point. The context is more the complexity and capability of the chip, for which transistor count is about as good a measure as you're going to fit in the headline. The immediate subheading jumps to telling you FLOPs which is another attempt at summarizing the capabilities of the chip quickly. Once you have the info that it's large and fast the body serves to provide the detailed context. From that view the title serves to identify the primary context well - a very complex chip, come read more about it.

One basic thing I didn't see in the body was power consumption though, anyone know more details on that?

adrian_b · on Aug 28, 2021

400 W

See e.g. https://semiwiki.com/artificial-intelligence/302502-tesla-do...

abc_lisper · on Aug 28, 2021

Where is this Ted talk? I couldn’t find it

chacham15 · on Aug 28, 2021

I think theyre referencing this: https://www.youtube.com/watch?v=ZJ4GmZflpPI

TekMol · on Aug 28, 2021

The article starts with this statement:

    Artificial intelligence (AI) has
    seen a broad adoption over the past
    couple of years.

And continues:

    At Tesla, who as many know is a
    company that works on electric
    and autonomous vehicles, AI has
    a massive value to every aspect
    of the company's work.

Who is writing like this? And why?

What would Tom's Hardware lose if they left out this type of cheap fillwords?

Should I also start writing like this?

Is this type of "reader hostile writing" a new thing or have newspapers always written like this?

These are not rhetorical questions. I am honestly confused.

Dunedan · on Aug 28, 2021

> […] AI has a massive value to every aspect of the company's work.

That's also just wrong. During the recent "Tesla AI Day", when asked during Q/A, Elon Musk specifically mentioned that they intentionally use machine learning only for very few cases:

    Q: "Is Tesla using machine learning within its manufacturing, design
    or any other engineering processes?"
    
    Elon: "I discourage use of machine learning, because it's really
    difficult. Unless you have to use machine learning, don't do it. It's
    usually a red flag when somebody is saying 'We wanna use machine
    learning to solve this task'. I'm like: That sounds like bullshit.
    99.9% of the time you don't need it."

https://www.youtube.com/watch?v=j0z4FweCy4M&t=9307s

CharlesW · on Aug 28, 2021

IMO that first paragraph is great, especially for readers who may not have your level of industry knowledge and technical acumen. It efficiently contextualizes the article and addresses a common complaint that I often see even on HN — the failure to clearly answer "What is this and why does this matter?"

TekMol · on Aug 28, 2021

So you tell me the readers of a hardware site who click on a title "Tesla Packs 50 Billion Transistors Onto D1 Dojo Chip" hear for the first time about the term Artificial Intelligence?

CharlesW · on Aug 28, 2021

The article doesn't explain or even define "AI", so I'm going to respectfully disagree with your premise.

I understand that you consider the writing objectively "hostile", but the simpler explanation is that you're just not the audience.

SmellTheGlove · on Aug 28, 2021

It reads like Bart Simpson’s report on Libya.

jliptzin · on Aug 28, 2021

Definitely not written by a native English speaker, and if it was, that person needs to get a new career.

__warlord__ · on Aug 29, 2021

Honestly, why? Not everyone start as an expert.

Lio · on Aug 28, 2021

This is made using TSMC's 7nm fab process so surly the number of transistors in this chip is either enabled or limited by that process, isn't it?

Honest question, how much is chip design a factor separate to fab process?

tlb · on Aug 28, 2021

Density is partly a function of the type of circuit. Memory is denser than random logic, for instance. Interconnect eats a lot of area and reduces density.

This chip is largely memory and multipliers, both of which are pretty dense.

Fab processes improve over time to have higher density and lower defect rate (which allows bigger chips while getting acceptable yield). So it's not surprising to see a chip on the same node but shipping a year or 2 later (than Ampere) having more transistors.

Lio · on Aug 28, 2021

Thank you, that's really interesting. So good design will reduces things like interconnect to improve density.

greesil · on Aug 28, 2021

I'm always curious about the decision-making progress when someone decides to make their own ASIC when there are somewhat reasonable commercial alternatives. What was the advantage here for Tesla?

ttul · on Aug 28, 2021

Rolling your own ASIC makes sense if you need to churn out enormous quantities for your own use. The actual cost of fabrication is largely weighted toward non-recurring engineering costs. Once the printing press fires up, chips are very inexpensive.

Does Tesla need tens of thousands of these things?

minhazm · on Aug 28, 2021

In the demo they said their ExaPOD is 3000 of these chips. They have 10 cabinets with 6 tiles each, each tile has 25 of these D1 chips in it. If they’re successful with this they’ll likely build out multiple of these clusters.

greesil · on Aug 28, 2021

Only if they're going to put it in their cars, right?

mchusma · on Aug 28, 2021

I think I'm this case, the main advantage is controlling their own destiny when it comes to building the types of models they need.

I think in 25% of cases it will not get them significantly more performance vs Nvidia.

There is a 50% chance that they can outperform off the shelf chips by a significant amount to make it maybe worth it. (This is pretty likely because dedicated hardware tends to outperform general hardware).

However, there is maybe a 25% risk buying Nvidia doesn't get them there soon.

So building their own chips de-risks the worst case, and it's probably not that much more expensive (at Tesla scale). So seems like a pretty good bet to me.

dragontamer · on Aug 28, 2021

There are plenty of other, innovative companies specializing on FP16 matrix multiplication systolic arrays.

For one, Google TPU. Another: Cerebras wafer scale AI. AMD MI100. Etc etc.

Even if they screwed the pooch with Nvidia, there are plenty of competitors in this space.

Now Tesla has to build its own software stack for large scale distributed learning, which might be harder than the chip design.

Is Tesla really the kind of company that wants to carry the expensive loadstone of training and inference software + hardware?

It's not like PyTorch is gonna run on this thing unless they create a fork. And a huge advantage of things like NVidia are NVlink / NVswitch. Both hardware, and software, that efficiently distributes data at 600GBps across your GPU clusters.

panick21 · on Aug 28, 2021

> Is Tesla really the kind of company that wants to carry the expensive loadstone of training and inference software + hardware?

Yes. They are very much that kind of company. Tesla has been pushing vertical integration and that is very much Elon Musk whole approach for most of his companies.

Doing your own battery manufacturing and even supply chain is considerably more expensive and complex compared to making a chip and getting some software developers.

dragontamer · on Aug 28, 2021

Battery manufacturing isn't vertically integrated at all. They use Panasonic cells and are dependent on Panasonic's half of the gigafactory.

panick21 · on Aug 28, 2021

Yes, so far they are doing that. In fact they use a mix of Panasonic, CATL and LG.

However they are working on having their own battery and battery factory design. They are even working on their own cathode manufacturing plants.

They have a very large 'pilot' plant in California to test a completely redesigned battery manufacturing system. They are actively building additionally battery factories in Austin and Berlin and have a equipment on order for these factories. For the Battery factory in Berlin they have received funding from the European Commission for this.

Tesla will not be fully vertically integrated but they will provide a large parts of their own cells and likely a increasing share over the coming years.

dragontamer · on Aug 29, 2021

> However they are working on having their own battery and battery factory design.

There's no realistic plan for them to get off of Panasonic's cells. They have a purchase agreement, and Panasonic owns half the Nevada Gigafactory.

They have to wait for Panasonic to sell their half of the Gigafactory back to them, during which they'd not have any cell production going on. They basically need to build another US-factory before they even have a chance to become vertically integrated in the USA.

panick21 · on Aug 29, 2021

They don't want to stop working with suppliers. Batteries are massively constraint, they will buy everything they can from LG, CATL and Panasonic. That is still not enough, that is why they are also building their own.

Panasonic in-fact just added another line in Nevada.

These cells are required for Model Y/3. The new cells that they produce themselves will be for some Model Y, Cybertruck and Semi.

I'm not sure why you insist on disagreeing with a simple fact, Tesla is building its own battery plants. They have one battery factory in the US and are already building a second one. This is literally a fact.

londons_explore · on Aug 28, 2021

I see only two reasons to do it yourself with multiple very capable third parties out there which pretty much exactly match the requirements:

1) You hope to make it available to external customers and turn it into a scale business. Your internal use is just the first customer. You need your own hardware so you can keep costs down at large scales.

2). Someone's pet project was to design their own ML ASIC and they thought it would look very good on their CV, and the CEO took the bait.

Hopefully it's the first case!

mschuster91 · on Aug 28, 2021

There's a third and fourth reason: the need to keep the secret sauce secret and keep others from replicating it, and that existing vendors aren't flexible enough to offer what they want.

For all that Elon Musk values openness in some areas (e.g. putting Hyperloop into public domain), he prefers keeping stuff as vertically integrated as possible for everything he deems to be essential to the business, for maximum control.

Tesla has stakes in lithium mining operations, SpaceX has their own metallurgy team and IIRC also a foundry, and while they are using a Liebherr crane at the moment they are thinking about building their own. And for that, it makes sense - SpaceX is only one of Liebherr's customers while SpaceX depends on a crane that fits their needs - so either they get Liebherr to customize their crane or they build their own.

taylorportman · on Aug 28, 2021

With the recent buzz about AI semiconductor design and drive for domestic semiconductor manufacture they are likely positioning themselves to be capable of investing in the next generation of fabrication. Leverage with the government and potential funding.

option · on Aug 28, 2021

they aren’t manufactured domestically. TSMC is in Taiwan. I doubt very much that Tesla will want to own fabs.

tantony · on Aug 28, 2021

Tesla has previously used Samsung's fab in Texas for their HW3 chip. Is there any source saying D1 is built by TSMC instead?

systemvoltage · on Aug 28, 2021

I wonder if Elon has plans to tackle semiconductor manufacturing and re-thinking the entire industry from bottom-up? I've worked in semiconductor industry and it is super old school stuck in the cold-war era management, procedures and culture.

minhazm · on Aug 28, 2021

Tesla actually has a lot of expertise in chip design in Pete Bannon and formerly Jim Keller. I think most people know who Jim Keller is, but if not you can read his wikipedia[1]. Pete Bannon is also an industry giant and worked with Jim Keller at PA Semi and subsequently Apple on their A series chips. These two have decades of experience designing chips that went into tens of millions of devices. Tesla’s FSD computer is in hundreds of thousands of cars. They know what they’re doing.

https://en.wikipedia.org/wiki/Jim_Keller_(engineer)

mupuff1234 · on Aug 28, 2021

The same Wikipedia page also states that Jim Keller left Tesla awhile ago.

millerm · on Aug 28, 2021

FSD computer is in over a million cars, btw. It’s still hundreds of thousands, but more like one thousand thousand.

gautamcgoel · on Aug 28, 2021

Question: how many chips does Tesla need to buy in order to get a reasonable unit price per chip? Obviously <10k is too small, but is 100k reasonable? 1M?

ehsankia · on Aug 28, 2021

Is that only considering the price per cheap deal they get from TSMC, or also including the cost of d&d?

gautamcgoel · on Aug 28, 2021

I'd be interested in both numbers. D&D = design and ?

adventured · on Aug 28, 2021

Design and development.

m3kw9 · on Aug 28, 2021

Black box numbers would be better in terms of physical size, power usage and comparable training/inference times. Everything else is hype.

sonium · on Aug 28, 2021

The whole point of the die on silicone seems to be that this maximizes the interface bandwidth and minimize latency between the dies. If this true the next step would be to bring the multi die modules as close as possible in three dimensions to ultimately build a borg-cube like structure in zero-g with a power source at its core.

mrtnmcc · on Aug 28, 2021

I wonder how their neural network structures informed the hardware design, such as the dimensions of tensor products. Or is Dojo trying for as general purpose ML as possible? I imagine there is a tension between software and hardware teams where Karpathy's team is always changing things while the hardware team wants specs/reqs.

The "tiles of tiles" chip architecture seems like an Elon-obvious, let's just scale what we have approach. Do their neural networks map to that multiscale tiling well?

jstandard · on Aug 28, 2021

Non-hardware person here. How does the D1 compare to Cerebras WSE-2 wafer chip with 2.6 trillion transistors?

The WSE2 is much larger obviously, but I would also think it can result in a large performance boost given everything is on a single chip.

throwaway4good · on Aug 28, 2021

What process are these chips made with?

It says TSMC 7nm - is that DUV or EUVL?

ttul · on Aug 28, 2021

I don’t believe 7nm used EUVL, which would keep the cost down, relatively speaking.

Dunedan · on Aug 28, 2021

TSMC offers both DUV (N7, N7P) and EUV (N7+) for 7nm [1].

[1]: https://en.wikichip.org/wiki/7_nm_lithography_process#TSMC

throwaway4good · on Aug 28, 2021

I am curious if euvl (and associated high transistor density) makes sense for this type of processor or it simply would run too hot?

tromp · on Aug 28, 2021

An extremely deep question...

jeffbee · on Aug 28, 2021

Is that a lot?

Someone · on Aug 28, 2021

When comparing it to other large designs, I think it’s not exceptional, but also not in the back of the pack. This die is 645mm², or a square inch. We could create a wafer that size in the 1960s (https://en.wikipedia.org/wiki/Wafer_(electronics)#Standard_w.... Note these are for circular wafers, so a 1 inch wafer is about ¾ square inch), so in that sense, it isn’t a surprise that we can make such a chip.

We couldn’t put 50B transistors on a square inch in the 1960s, though. We can now. https://en.wikipedia.org/wiki/Transistor_count lists several larger designs.

So, the engineering is impressive, but not spectacular.

Also, this being a grid of interconnected CPUs means the design is simpler than a single design filling the entire die would be. It’s ‘just’ repeating the same design over and over (possibly with some small variations near the edge)

Of course looking at it without knowledge of the state of the art it is astounding that we can even think of constructing machines with 50 billion working parts

dasudasu · on Aug 28, 2021

It's not wafer size, but reticle size which is the limit. 300 mm (diameter) is standard now, yet reticle size is only a fraction of that.

throwaway4good · on Aug 28, 2021

The Apple M1 chip, which is much smaller and has lower power consumption, has 16B.

senectus1 · on Aug 28, 2021

worth pointing out that Dojo is meant to do one thing and one thing only... ML

jeffbee · on Aug 28, 2021

It can multiply and add!

shadilay · on Aug 28, 2021

M1 is on 5nm.

coronadisaster · on Aug 28, 2021

Is that a lot?

throwaway4good · on Aug 28, 2021

jeffbee · on Aug 28, 2021

The nvidia A100 is larger and has 56B.