HPC|Cloud Special Interest Subgroup

Cloud Computing Will Usher in a New Era of Science Discovery
High-performance computing requires the availability of a massive number of computers for performing large-scale experiments. Traditionally, these needs have been addressed by using local HPC systems which are not always in the reach of every scientist. Cloud computing can provides scientists with a completely new model of utilizing the computing infrastructure. Compute resources, storage resources, as well as applications, can be dynamically provisioned on a pay per use basis. These resources can be released once they are no longer needed. Such services are often offered within the context of a Service Level Agreement (SLA), which ensures the desired Quality of Service (QoS). In order to support HPC applications, HPC clouds, either public clouds or private clouds, need to provide compute, latency, and efficiency as similar as possible to local HPC systems. The HPC|Cloud special interest subgroup motivation is to explore the different aspects of high performance clouds. The HPC|Cloud subgroup includes the following organizations from the HPC advisory Council members: AMD, Dell, Mellanox Technologies, Platform Computing, The Federal University of Rio de Janeiro, The 451 Group and VMware.

The concept of computing “in a cloud” is typically referred as a hosted computational environment (could be local or remote) that can provide elastic compute and storage services for users per demand. Therefore the current usage model of cloud environments is aimed for computational science. Future clouds can be served as environments for distributed science to allow researchers and engineers to share their data with their peers around the globe and allow expensive achieved results to be utilized for more research projects and scientific discoveries.

To allow the shift to the fourth mode of “science discovery” those cloud environments will need not only to provide capability to share the data created by the computational science and the various observations results, but also to be able to provide cost-effective high-performance computing capabilities, similar to that of today’s leading supercomputers, in order to be able to rapidly and effectively analyze the data flood. Moreover, an important criteria of clouds need to be fast provisioning of the cloud resources, both compute and storage, in order to service many users, many different analysis and be able to suspend tasks and bring them back to life in a fast manner. Reliability is another concern, and clouds need to be able to be “self healing” clouds where failing components can be replaced by spares or on-demand resources to guarantee constant access and resource availability.

Case: From Computational Science to Science Discovery: The Next Computing Landscape

HPC as a Service
One of the main advantages of HPC clusters is the flexibility and efficiency they bring to their user. With the increase in the number of applications being served by HPC systems, new systems need to server multiple users and multiple applications. Traditional HPC systems typically served a single application at a given time, but in order to maintain high flexibility HPC a new concept of HPC as a Service (HPCaaS) has been developed. The HPC Advisory Council has been one of the first organizations to perform research activities and to provide guidelines for OEMs and end-users for developing HPCaaS clusters.

Smart scheduling strategies for HPCaaS are essential in order to be able to host multiple applications simultaneously while maintaining or even increasing the total systems productivity.

Case: Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications - PDF

HPC in a Cloud
In the past, high-performance computing has not been a good candidate for cloud computing due to its requirement for tight integration between servers’ nodes via low-latency interconnects.  The performance overhead associated with host virtualization, a prerequisite technology for migrating local applications to the cloud, quickly erodes application scalability and efficiency in an HPC context.  Furthermore, HPC has been slow to adopt virtualization, not only due to the performance overhead, but also because HPC servers generally run fully-utilized, and therefore do not benefit through consolidation. The performance overhead inherent in virtualization has, in turn, made for slow adoption of low-latency interconnects by cloud providers as part of their service offering. Instead, the primary focus has been for non mission-critical or non-performance-demanding applications.

Image Courtesy: 451 Group

The HPC Advisory Council performs studies to explore and assess the performance overheads of high-performance applications in cloud environments.  In those studies, the HPC advisory council provides a deep analysis of the performance overhead associated with running high-performance applications over high speed networks in a cloud environment, and it addresses the needs for virtualization in HPC clouds.

Case: HPC in a Cloud - PDF