Category Archives: HPC

ROI through efficiency and utilization

High-performance computing provides an invaluable role in research, product development and education. It helps accelerate time to market, and provides significant cost reductions in product development and tremendous flexibility. One strength in high-performance computing is the ability to achieve best sustained performance by driving the CPU performance towards its limits. Over the past decade, high-performance computing has migrated from supercomputers to commodity clusters. More than eighty percent of the world’s Top500 compute system installations in June 2009 were clusters. The driver for this move appears to be a combination of Moore’s Law (enabling higher performance computers at lower costs) and the ultimate drive for the best cost/performance and power/performance. Cluster productivity and flexibility are the most important factors for a cluster’s hardware and software configuration.

A deeper examination of the world’s Top500 systems based on commodity clusters shows two main interconnect solutions that are being used to connect the servers for creating those compute powerful systems – InfiniBand and Ethernet. If we divide the systems according to the interconnect family, we will see that the same CPUs, memory speed and other settings are common between the two groups. The only difference between the two groups, besides the interconnect, is the system efficiency, or how many of CPU cycles can be dedicated to the application work, and how many of them will be wasted. The below graph list the systems according to their interconnect setting, and their measured efficiency.


As seen, systems connected with Ethernet achieves an average 50% efficiency, which means that 50% of the CPU cycles are wasted on non-application work or are idle, waiting for data to arrive.  Systems connected with InfiniBand achieve an above 80% efficiency average, which means that less than 20% of the CPU cycles are wasted. Moreover, the latest InfiniBand based systems have demonstrated up to 94% efficiency (the best Ethernet connected systems demonstrated 63% efficiency).

People might argue that the Linpack benchmark is not the best benchmark for measuring parallel application efficiency, and does not fully utilize the network. The graph results are a clear indication that even for the Linpack application, the network does make a difference, and for better parallel application, the gap will be much higher.

When choosing the system setting, with the notion of maximizing return on investment, one needs to make sure no artificial bottlenecks will be created. Multi-core platforms, parallel applications, large databases etc require fast data exchange and lots of it. Ethernet can become the system bottleneck due to latency/bandwidth and CPU overhead due to the TCP/UDP processing (TOE solutions introduce other issues, sometime more complicated, but this is a topic for another blog) and reduce the system efficiency to 50%. This means that half of the compute system is wasted, and just consumes power and cooling. Same performance capability could have been achieved with half of the servers if they were connected with InfiniBand. More data on different application performance, productivity and ROI, can be found at the HPC Advisory Council web site, under content/best practices.

While InfiniBand will demonstrate higher efficiency and productivity, there are several ways to increase Ethernet efficiency. One of them is optimizing the transport layer to provide zero copy and lower CPU overhead (not by using TOE solutions, as those introduce single points of failure in the system). This capability is known as LLE (low latency Ethernet). More on LLE will be discussed in future blogs.

Gilad Shainer HPC Advisory Council Chairman

Cloud computing for HPC?

One of the interesting projects we are dealing with is the feasibility to use cloud computing for high performance computing. I remember a paper on using the Amazon EC2 for HPC, and the conclusion was that some GB of bandwidth are missing between the compute nodes… J  In the past, high-performance computing has not been a good candidate for cloud computing due to its requirement for tight integration between the servers’ nodes via low-latency interconnects.  Moreover, the performance overhead associated with host virtualization, a pre-requisite technology for migrating local applications to the cloud, quickly erodes application scalability and efficiency in an HPC context.  Furthermore, HPC has been slow to adopt virtualization, not only due to the performance overhead, but also because HPC servers generally run fully-utilized, and therefore do not benefit through consolidation.

Not all clouds are the same, nor will be, and while virtualization is needed for enterprise applications, yet for HPC clouds is not a must, and application provisioning can be done on a physical server granularity. Moreover, there are emerging virtualization solutions that reduce the overhead and enable native application performance.

The council had presented some of the first finding from the HPC cloud project at ISC’09 (posted on the advanced topics section at We have submitted a full paper for publication, and hope to post it on the web site soon.

Next phase of the project will be adding the virtualization aspect, in particular Xen and KVM, and explore the effects on application performance, as well as the system utilization and efficiency capabilities.  

Gilad Shainer,
HPC Advisory Council Chairman

Inauguration of 1st European Petaflop Computer in Jülich, Germany

On Tuesday, May 26, the Research Center Jülich reached a significant milestone of German and European supercomputing with the inauguration of two new supercomputers: the supercomputer JUROPA and the fusion machine HPC FF. The symbolic start of the systems were triggered by the German Federal Minister for Education and Research, Prof. Dr. Annette Schavan, the Prime Minister of North Rhine-Westphalia, Dr. Jürgen Rüttgers, and Prof. Dr. Achim Bachem, Chairman of the Board of Directors at Research Center Jülich as well as high-ranking international guests from academia, industry and politics.

JUROPA (which stands for Juelich Research on Petaflop Architectures) will be used Pan-European-wide by more than 200 research groups to run their data-intensive applications. JUROPA is based on a cluster configuration of Sun Blade servers, Intel Nehalem processors, Mellanox 40Gb/s InfiniBand and Cluster Operation Software ParaStation from ParTec Cluster Competence Center GmbH. The system was jointly developed by experts of the Jülich Supercomputing Center and implemented with partner companies Bull, Sun, Intel, Mellanox and ParTec. It consists of 2,208 compute nodes with a total computing power of 207 Teraflops and was sponsored by the Helmholtz Community. Prof. Dr. Dr. Thomas Lippert, Head of Jülich Supercomputing Center, explains the HPC Installation in Jülich in the video below.

HPC-FF (High Performance Computing – for Fusion), drawn up by the team headed by Dr. Thomas Lippert, director of the Jülich Supercomputing Centre, was optimized and implemented together with the partner companies Bull, SUN, Intel, Mellanox and ParTec. This new best-of-breed system, one of Europe’s most powerful, will support advanced research in many areas such as health, information, environment, and energy. It consists of 1,080 computing nodes each equipped with two Nehalem EP Quad Core processors from Intel. Their total computing power of 101 teraflop/s corresponds, at the present moment, to 30th place in the list of the world’s fastest supercomputers. The combined cluster will achieve 300 teraflops/s computing power and will be included in the rating of the Top500 list, published this month at ISC’09 in Hamburg, Germany.

40Gb/s InfiniBand from Mellanox is used as the system interconnect. The administrative infrastructure is based on NovaScale R422-E2 servers from French supercomputer manufacturer Bull, who supplied the compute hardware and the SUN ZFS/Lustre Filesystem. The cluster operating system “ParaStation V5″ is supplied by Munich software company ParTec. HPC-FF is being funded by the European Commission (EURATOM), the member institutes of EFDA, and Forschungszentrum Jülich.

Complete System facts: 3288 compute nodes ; 79 TB main memory; 26304 cores; 308 Teraflops peak performance.

Gilad Shainer,
HPC Advisory Council Chairman

The HPC Advisory Council Cluster Center – update

Recently we have completed a small refresh in the cluster center. The Cluster Center offers an environment for developing, testing, benchmarking and optimizing products free of charge. The center, located in Sunnyvale, California, provides on-site technical support and enables secure sessions onsite or remotely. The Cluster Center provides a unique ability to access the latest clustering technology, sometimes even before it reaches public availability.

In the last few weeks, we have completed the installation of a Windows HPC Server 2008 cluster, and now it is available for testing (via the Vulcan cluster). We have also received the Scyld ClusterWare™ HPC cluster management solution from Penguin Computing (a member company) and installed it on the Osiris cluster.

Scyld was designed to make the deployment and management of Linux clusters as easy as the deployment and management of a single system. A Scyld ClusterWare cluster consists of a master node and compute nodes. The master node is the central point of control for the entire cluster. Compute nodes appear as attached processor and memory resources. More information on Scyld can be found here.

Adding Scyld to Osiris helps the Council with the best practices research activities that provide guidelines to end-users on how to maximize productivity for various applications using 20 and 40Gb/s InfiniBand 20 or 10 Gigabit Ethernet. I would like to thank Matt Jacobs and Joshua Bernstein from Penguin Computing for their donation and support during the Scyld installation.

Best regards,
Gilad Shainer
Chairman of the HPC Advisory Council

Interactive Supercomputing

Interactive Supercomputing mission is to bridge the gap between easy-to-use desktop modeling, simulation and development tools with the power, scalability and low cost of parallel computer systems, clusters and grids. In order to fulfill this mission, we have developed the Star-P software platform. Is is an interactive parallel computing platform that extends existing desktop simulation tools for simple, user-friendly parallel computing on a spectrum of computing architectures such as multi-core clusters.

Our customers are scientists, engineers and analysts who want to solve large and complex problems that can no longer be done productively on the desktop computer. By eliminating the re-programming associated with porting desktop application code to parallel systems, Star-P fundamentally transforms the workflow, substantially shortening the “time to solution,” and delivers the “best of both worlds”-the interactive and familiar use of the desktop coupled with supercomputer-like problem-solving capabilities.

We have presented the performance capabilities of Star-P at SC08, and wanted to share with you the presentation.