Some of what I do... AI Clusters
Its not often that I get to talk about what I do. But luckily for me, this is one of those times. The team that I run is a team of tooling and systems engineers. What do we do, we work on future tooling and hardware platforms for our Site Operations teams at Meta. Its a challenging and rewarding space and I don’t say that sarcastically. Though at times, some of the problems we are trying to solve are quite simply, silly to even attempt. Yet we do, and are often very successful.
Here is one of those instances where we were part of a large effort to get the hardware and some of the support tooling fixed to support the servers. Running large AI clusters is breaking some of the ways that we have operated since I have been at the company. This is a good thing as the only constant at Meta is change.
Read about some of the cool hardware and networks that were deployed to support these large AI clusters at Meta.
probably a section to delete
https://www.datacenterdynamics.com/en/news/meta-reveals-details-of-two-new-24k-gpu-ai-clusters/
https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/