Syllabus

Day 1 (AUG-22): Introduction to High Performance Networking
  • Opening and welcome message
  • Trends in computer architecture and network-based computing
  • CPUs, GPUs, Accelerators, FPGAs
  • Day 2 (AUG-23): Introduction to InfiniBand network technology and RDMA
  • OSI model and typical SW/HW implementations of various layers of the model
  • Typical BW and latency of TCP vs. Offloaded transports (RDMA)
  • Number of cycles per packet at 100Gb speeds and 64b packets
  • The road to User Level Networking
  • Zero copy networking
  • InfiniBand Fundamentals
  • InfiniBand and OSI – the different layers
  • InfiniBand Advanced Capabilities (in a glance)
  • InfiniBand Software Interfaces
  • OS bypass fundamentals
  • HW Context per app: Isolation, scalability
  • Transport offload
  • Memory translation
  • Sync. POSIX copy semantics vs. Async. zero-copy
  • Implications for memory management
  • Implications for memory overcommit
  • The Verbs channel provider model
  • QPs and WQEs
  • CQs and CQEs
  • Memory registration
  • Shared receive queues
  • Arming and signaling
  • EXERCISE 1: COMPARATIVE ANALYSIS BETWEEN INFINIBAND AND IP
  • Using TCP socket interface to write a p2p benchmark application
  • Using IB verbs to write a p2p benchmark application
  • Measure performance (ops/sec), compare different aspects
  • Day 3 (AUG-24): Collective Communication
  • Communication semantics
  • Channel Interface
  • Send / Receive
  • RDMA
  • Atomics
  • Design patterns
  • QPs and WQEs
  • Eager vs. rendezvous
  • Read-mostly transactions
  • Collective Communication Overview
  • An introduction to collective communication
  • MPI collective vs. AI collective vs. PGAS collective
  • Algorithms and optimizations in collective communication
  • All reduce – ring vs. tree
  • Collective communication – network offloads
  • CORE-direct
  • Persistent Communication Offload
  • SHARP
  • State-of-the-art existing libraries
  • NCCL, Gloo, MPI
  • EXERCISE 2 (OPTIONAL): KEY-VALUE STORE OVER RDMA VERBS
  • Using RDMA verbs to write a client-server application
  • Server keeps an in-memory key-value table, clients read/write key-value pairs
  • Day 4 (AUG-25): Unified Communication X (UCX)
  • Introduction to Unified Collective Communication Library (UCC)
  • Working Group status report
  • Architecture – APIs and design goals
  • Unified Communication X (UCX)
  • UCX overview – past, present, and future
  • UCX APIs for HPC
  • UCX APIs for none-HPC applications and use-cases
  • UCX architecture and design
  • Code examples
  • UCX internals – UCX implementation detail
  • EXERCISE 3: UCX-BASED NETWORK FILE SYSTEM
  • Using UCX, write a server app and a client library for a remote filesystem
  • Client library exports open/read/write/seek/close, and connect/disconnect to server
  • Each request, based on size and caching, selects a protocol and accesses the remote file
  • Measure performance (ops/sec) for different access patterns
  • Day 5 (AUG-26): RDMA Applications in HPC, Storage & AI
  • OSU University
  • RDMA in HPC and AI applications
  • Tsinghua University
  • RDMA practice sharing in HPC and AI competition
  • USTC
  • RDMA practice in storage applications
  • Course summary
  • RDMA Programming Hackathon Q&A
  • Day 6 & Day 7(AUG-27/28): RDMA Programming Hackathon
    Day 8 (AUG-29): RDMA Programming Hackathon & Interview
  • RDMA programming hackathon
  • Interview (10 min. presentation and 10 min. Q&A)
  • Final results will be announced in China SC2020, Beijing
  • Workshop closing
  • Conclusion

    RDMA is becoming the key technology for optimizing the applications running in the modern HPC center, AI center and data center. Optimizing the system to achieve highest utilization is the best way to protect the investment. We welcome and kindly call on the students who desire to challenge the highest application performance to register and participate in this Advanced Data Center Networks and RDMA Programming Workshop. This will also be an opportunity to meet and talk with the RDMA industry leaders and experts from the HPC-AI Advisory Council and top universities.

    Reference Materials
    1. IBTA specification: https://www.infinibandta.org/
    2. UCX reference: https://www.openucx.org/,
      https://github.com/openucx/ucx/releases/
    3. UCC reference: https://www.ucfconsortium.org/projects/ucc/