HPC|Scale Special Interest Subgroup

Chair: Richard Graham

Computational science is the field of study concerned with constructing mathematical models and numerical techniques that represent a range of domains such as scientific, social science and engineering. Such models are models on computational systems to analyze and explore such areas of study. Numerical simulation enables the study of complex phenomena that would be too expensive or dangerous to study by direct experimentation. The quest for ever higher levels of detail and realism in such simulations requires enormous computational capacity and has provided the impetus for breakthroughs in computer algorithms and architectures.

Today’s high-performance computing systems use a diverse set of computing elements including a range of CPUs, GPUs and other programable elements, interconnected by a high-speed, low-latency interconnect. Such interconnects generally fall into two categories, those who offload communication management to the network, those who use CPU-side, or on-loaded, resources for such management. The offloaded networks are also starting to add computational capabilities within the network to better distribute the work across the system. With the scale of many HPC systems around the world being of order of thousands of end points, application and systems scalability issues come to the forefront of list of challenges facing the users and administrators of such systems.

The HPC|Scale working group’s mission is to explore the capabilities of upcoming advanced interconnect technologies and their use in addressing the challenges of large-scale HPC systems.

Base Technologies and Capabilities

Communication Management – Offload and Onload Approaches
For network activity to take place, such activity needs to be initiated, managed, progressed and completed. Such traffic is typically initiated and completed at an application level, whether it be and end-user or system application. However, the approaches to management and progress of such activity tend to fall into two categories, offloaded and onloaded.

Communication management and progress offload uses dedicated network resources, typically the Host network adapter, for such activity. Such an approach allows for communication to progress independent of activity on the compute portion of the system, which forms a good foundation for overlapping communication and computation. As part of the communication infrastructure, it also readily lends itself for the creation of efficient and specialized communication capabilities, as part of the network hardware, fitting in naturally with available network capabilities.

Communication management and progress onload uses host-side compute elements for such activity. This uses the same resources that applications may use, reducing the compute resources available to the application by either dedicating some of these resources to network activity, or sharing these resources with the running application, posing a challenge for overlapping communication and computation. In addition, it uses resources aimed at computation for communication management purposes, further reducing the efficiency of such an approach.

Collectives communication accelerations
Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Many scientific applications use collective communications to satisfy a range of communication needs, such as determining the magnitude of residual vectors in iterative numerical solvers, performing Fourier transforms, and performing distributed-data reductions. The global nature of these communication patterns leads to them being a major factor in determining the performance and scalability of simulation codes, with these effects increasing with process count. Developing effective strategies for collective operations is essential for applications to make effective use of available compute resources and offers the potential for rather large improvements in application performance. These benefits increase with system complexity and size. However, the algorithmic data dependencies between processes is such that a large application load imbalance may hide much of the benefits of such optimized algorithms.

Offloading collective communication management to network hardware has been done by several system vendor. Most recently, InfiniBand hardware has been enhanced to support reduction operations in the switches with the SHARP technology[1]. Earlier versions of InfiniBand HCAs have added the ability to coordinate activity being generated by different network queue pairs[2]. Cray has supported small reduction operations in their Aries network[3], and Quadrics[4] supported collective operations in some of their network devices.

Point-to-Point Acceleration
Point-to-point network operations, whether one- or two-sided, are the most widely used form of network communication. Thus, accelerating such communication is essential for obtaining well performing applications at scale.

The first optimization step is moving point-to-point communication management to the network, giving the applications communication support for overlapping communication and computation. Such support can be used by communication libraries to implement non-blocking communication protocol support.

In the scientific HPC community in particular, the Message Passing Interface (MPI) is ubiquitous in application use, and in particular two-sided communication. Therefore, adding network level support for the MPI tag-matching provides the support needed by most MPI applications to enable communication-computation overlap.

InfiniBand networks have recently been enhanced to support offloading the MPI tag-matching semantics. In addition, the Portals API defines a tag matching protocol on which MPI communication can be provided.

Network Efficiency
As job scale increases and the number of jobs running on a give system grows, the potential for interference between data streams increases, reducing network efficiency. Such interference generally can’t be predicted, so methods for dealing with such overlapping communication patterns as they occur are essential for enabling high levels of network utilizations. Adaptive routing is one such mechanism. Such mechanisms tend to increase overall system utilization by quite a large amount[5].

Network Resilience
The increased component count with growing networks makes the need to handle network failures without user intervention is critical to running production systems. Transparency to the end user and system administrator is a key ingredient to make such capabilities generally useful for increasing overall system utilization.

In current system architectures hosts are interconnected with some network interconnect so that data can be exchanged between such hosts over the interconnect. Each host is connected to this network by some sort of host network adapter , and these are interconnected by cables to forwarding entities within the network, such as switches. Routing of network traffic between the network adapters is managed by algorithms, called routing algorithms, that route data over this network.

When part of the interconnect fails, such as a network adapter, network cable or switch, an alternative route between end-points needs to be selected, so applications can continue and progress their work. If alternative routes do not exist, the applications the require such routes will cease to progress. However, when alternative routes exist, network-level protocols are defined to allow re-routing traffic around the failed components. Data that is in flight and may have not been received at the destination can be routed without loss by algorithms managed at the network level over alternative network paths. With the network routing algorithm adjusted to ignore the failed components, new traffic avoids the failed portions of the network. Such approaches do not require external intervention, user or administrative, to make changes in the routing algorithms, and are therefore very good at keeping the overall system functioning in the presence of network failures, thus increasing system utility.

[1] Graham, R.L., Bureddy, D., Lui, P., Rosenstock, H., Shainer, G., Bloch, G., Goldenerg, D., Dubman, M., Kotchubievsky, S., Koushnir, V., Levi, L., Margolin, A., Ronen, T., Shpiner, A., Wertheim, O., Zahavi, E.: Scalable hierarchical aggregation protocol (SHArP): A hardware architecture for efficient data reduction. In: Proceedings of the First Workshop on Optimization of Communication in HPC. pp. 1–10. COM-HPC ’16, IEEE Press, Piscataway, NJ, USA (2016), DOI: 10.1109/COM-HPC.2016.6

[2] Graham, R.L., Poole, S., Shamis, P., Bloch, G., Bloch, N., Chapman, H., Kagan, M., Shahar, A., Rabinovitz, I., Shainer, G.: Connectx-2 infiniband management queues: First investigation of the new support for network offloaded collective operations. In: Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. pp. 53–62. CCGRID ’10, IEEE Computer Society, Washington, DC, USA (2010), DOI: 10.1109/CCGRID.2010.9

[3] Cray ® XC™ Series Network. Bob Alverson, Edwin Froese, Larry Kaplan and Duncan Roweth. https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf

[4] Performance Evaluation of the Quadrics Interconnection Network. Fabrizio Petrini, Eitan Frachtenberg, Adolfy Hoisie and Salvador Coll. Proceedings 15th International Parallel and Distributed Processing Symposium, 2001, 1698-1706

[5]The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems. Vazhkudai et. al., Proceedings of Supercomputing 2018 (SC18): 31th Int'l Conference on High Performance Computing, Networking, Storage and Analysis, Dallas, TX, November 2018.