Chairs: Gilad Shainer and Sadaf Alam
The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programmability, has made graphics accelerators a compelling platform for computationally demanding tasks in a wide variety of application domains. Due to the great computational power of the GPU, the GPGPU method has proven valuable in various areas of science and technology. The modern GPU is a highly data-parallel processor, optimized to provide very high floating point arithmetic throughput for problems suitable to solve with a single program multiple data model. On a GPU, this model works by launching thousands of threads running the same program working on different data. The ability of the GPU to rapidly switch between threads in combination with the high number of threads ensures the hardware is busy at all times. This ability effectively hides memory latency, and in combination with the several layers of very high bandwidth memory available in modern GPUs also improves GPU performance. The performance advantage of GPUs enables HPC systems to achieve the needed performance capabilities mandated by the ever increasing simulation complexities.
The HPC|GPU working group mission is to explore usage models of GPU components as part of next generation compute environments and potential optimizations for GPU based computing.
Mellanox InfiniBand / NVIDIA Tesla GPUDirect technology
While GPUs have been shown to provide worthwhile performance acceleration yielding benefits to price/performance and power/performance, several areas of GPU-based clusters could be improved in order to provide higher performance and efficiency. One issue with deploying clusters consisting of multi-GPU nodes involves the interaction between the GPU and the high speed network, in particular to the way GPUs are using the network in order to transfer data between them. Before the GPUDirect technology, a performance issue existed with user-mode DMA mechanisms used by GPU devices and RDMA based interconnects. The issue involved the lack of a software/hardware mechanism of “pinning” pages of virtual memory to physical pages that can be shared by both the GPU devices and the networking devices. In general, GPUs use pinned memory in the host memory to increase DMA performance by eliminating the need for intermediate buffers, or to pin and unpin regions of memory on-the-fly. The use of pinned memory buffers can allow a well-written code to achieve zero-copy message passing semantics via RDMA. The lack of a mechanism for managing memory pinning among user-mode accelerator and the message passing libraries creates performance issues due to the need of having a third device, the host CPU, be responsible for moving the data between the different GPU and InfiniBand pinned memory regions.
The better communication mechanism between GPUs and RDMA interconnect devices would involve the development of a mechanism for performing DMA and RDMA operations directly between GPUs and bypass the host entirely. Such an interface could conceivably allow RDMAs from one GPU device directly to another GPU on a remote host. An intermediate solution can use the host memory for the data transactions, but requires elimination of the host CPU’s involvement by having the acceleration devices and the RDMA networking adapters share the same pinned memory.
The new hardware/software mechanism is called GPUDirect-1 and it eliminates the need for the CPU to be involved in the data movement, and essentially enables not only higher GPU-based cluster efficiency, but sets the way for the creation of "floating point services". GPUDirect-1 is based on a new interface between NVIDIA GPU and InfiniBand device that enables both devices to share pinned memory buffers, and for the GPU to notify the network device to stop using the pinned memory so it can be destroyed. This new communication interface allows the GPU to maintain control of the user-space pinned memory, and eliminates the issues of data reliability. GPUDirect-1 enables fastest communications between remote GPUs (GPUs located on separate servers). NVIDIA also released GPUDirect 2 which enables direct connection between GPU on the same server and the same IOH.
Performance evaluation of the GPUDirect-1 technology can be seen in the graphs below, using Amber, a molecular dynamics software package, one of the most widely used programs for bimolecular studies with an extensive user base. GPUDirect-1 enables up to 33% performance increase on a 8 GPUs system (C2050 NVIDIA Fermi GPU).