Loading…
Virtual Event
August 17–August 20, 2020
Learn More and Register to Attend This Event

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2020 - Virtual to participate in the sessions. If you have not registered but would like to join us, please register here.

Please note: This schedule is automatically displayed in Central European Summer Time (CEST). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date." The schedule is subject to change.
Wednesday, August 19 • 16:55 - 17:30
Production Multi-node Jobs with Gang Scheduling, K8s, GPUs and RDMA - Madhukar Korupolu & Sanjay Chatterjee, NVIDIA

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
With the growing scale of DL and ML applications, distributed execution of jobs across multiple nodes becomes increasingly critical -- to solve bigger problems faster -- as illustrated by the recent MLperf results. However running such workloads in a production K8s cluster shared by multiple jobs/users has several challenges.

In this talk, we’ll give an overview of this area -- including distributed Tensorflow, Pytorch, Horovod, MPI -- and the use of GPU nodes with NCCL and RDMA for accelerated performance. We’ll describe our end-to-end flow for multi-node jobs in K8s including gang scheduling, quotas, fairness and backfilling implemented in our custom scheduler for GPUs. Our cluster includes high-speed networking through RoCE and SR-IOV / Multus CNI. We’ll share our design choices, learnings and operational experience including failure handling, performance and telemetry.

Speakers
avatar for Madhukar Korupolu

Madhukar Korupolu

Distinguished Engineer, NVIDIA
Madhukar is an architect at NVIDIA working on GPU clusters for AI and ML workloads. Areas of interest and experience include AI / ML infra, GPU acceleration, Cloud computing, Distributed systems, Kubernetes, HPC, CDNs etc with previous stints at Google, IBM and Akamai. He holds a... Read More →
avatar for Sanjay Chatterjee

Sanjay Chatterjee

Senior Engineer, NVIDIA
Sanjay Chatterjee is a senior engineer at NVIDIA. He works on runtime system infrastructure and core Kubernetes components to support highly scalable HPC and DL/AI workloads. Previously he worked on DoE/DARPA funded research and advanced technology projects for exascale systems. His... Read More →



Wednesday August 19, 2020 16:55 - 17:30 CEST
InXpo https://onlinexperiences.com/Launch/Event.htm?ShowKey99259