Syllabus
Day 1: High Performance Networking Principles
Trends in computer architecture and network-based computing
CPUs, GPUs, Accelerators, FPGAs
Networking technologies and trends
Ethernet and RoCE
InfiniBand
Proprietary networks
OSI model and typical SW/HW implementations of various layers of the model
Typical BW and latency of TCP vs. Offloaded transports (RDMA)
Number of cycles per packet at 100Gb/s speeds and 64b packets
Breakdown of CPU utilization
Cost of copy
Cost of kernel vs. user
OS bypass fundamentals
HW Context per app: Isolation, scalability
Transport offload
Memory translation
Sync. POSIX copy semantics vs. Async. 0-copy
Implications for memory management
Implications for memory overcommit
The Verbs channel provider model
QPs and WQEs
CQs and CQEs
Memory registration
Shared receive queues
Arming and signaling
Communication semantics
Channel Interface
Send / Receive
RDMA
Atomics
EXERCISE 1: COMPARATIVE ANALYSIS BETWEEN INFINIBAND AND IP
Using TCP socket interface to write a p2p benchmark application
Using IB verbs to write a p2p benchmark application
Measure performance (ops/sec), compare different aspects
Day 2: High Performance Networking Software Design, Applications & Scalability
Design patterns
QPs and WQEs
Eager vs. rendezvous
Read-mostly transactions
Initiator-target Execution semantics
Reactor vs. pro-actor
Thread-less libraries and progress routines
Task-based scheduling
Registration techniques
Buffer pool
Reg. cache
On Demand Paging
Polling vs. interrupts
Optimizations and heuristics
Offloading technologies (Part I)
Core-direct (Offloading to NIC)
Peer-direct / GPU-direct
Congestion & Flow Control
Deadline-aware TCP
Multipath
Credit-based Flow Control
App scaling
“Active message” paradigm
Connection multiplexing
Load balancing
Optimizations and heuristics
EXERCISE 2: KEY-VALUE STORE OVER RDMA VERBS
Using RDMA verbs to write a client-server application
Server keeps an in-memory key-value table, clients read/write key-value pairs
Day 3: Unified Communication X (UCX)
Unified Communication X (UCX)
UCX overview – past, present and future
UCX APIs for HPC
UCX APIs for none-HPC applications and use-cases
UCX architecture and design
EXERCISE 3: UCX-BASED NETWORK FILE SYSTEM
Using UCX, write a server app and a client library for a remote filesystem
Client library exports open/read/write/seek/close, and connect/disconnect to server
Each request, based on size and caching, selects a protocol and accesses the remote file
Measure performance (ops/sec) for different access patterns
Day 4: Collective Communication
Collective Communication Overview
An introduction to collective communication
MPI collective vs. AI collective vs. PGAS collective
Algorithms and optimizations in collective communication
All reduce – ring vs. tree
Collective communication – network offloads
CORE-direct
Persistent Communication Offload
SHARP
State-of-the-art existing libraries
NCCL, Gloo, MPI
Introduction to Unified Collective Communication Library (UCC)
Working Group status report
Architecture – APIs and design goals
EXERCISE 4: COLLECTIVE COMMUNICATION OFFLOAD IN HPC AND AI
MPI collective operation in HPC application
AI collective operation in AI application
Day 5: RDMA Applications in HPC, Storage & AI
OSU University
RDMA in HPC and AI applications
Tsinghua University
RDMA practice sharing in HPC and AI competition
USTC
RDMA practice in storage applications
Course summary
RDMA Programming Hackathon Q&A
Day 6: RDMA Programming Hackathon
Day 7: RDMA Programming Hackathon & Interview
RDMA programming hackathon
Interview (5 min. presentation and 5 min. Q&A)
Final results will be announced in China SC2020, Beijing
Workshop closing