High-performance computing provides an invaluable role in research, product development and education. It helps accelerate time to market, and provides significant cost reductions in product development and tremendous flexibility. One strength in high-performance computing is the ability to achieve best sustained performance by driving the CPU performance towards its limits. Over the past decade, high-performance computing has migrated from supercomputers to commodity clusters. More than eighty percent of the world’s Top500 compute system installations in June 2009 were clusters. The driver for this move appears to be a combination of Moore’s Law (enabling higher performance computers at lower costs) and the ultimate drive for the best cost/performance and power/performance. Cluster productivity and flexibility are the most important factors for a cluster’s hardware and software configuration.
A deeper examination of the world’s Top500 systems based on commodity clusters shows two main interconnect solutions that are being used to connect the servers for creating those compute powerful systems – InfiniBand and Ethernet. If we divide the systems according to the interconnect family, we will see that the same CPUs, memory speed and other settings are common between the two groups. The only difference between the two groups, besides the interconnect, is the system efficiency, or how many of CPU cycles can be dedicated to the application work, and how many of them will be wasted. The below graph list the systems according to their interconnect setting, and their measured efficiency.
As seen, systems connected with Ethernet achieves an average 50% efficiency, which means that 50% of the CPU cycles are wasted on non-application work or are idle, waiting for data to arrive. Systems connected with InfiniBand achieve an above 80% efficiency average, which means that less than 20% of the CPU cycles are wasted. Moreover, the latest InfiniBand based systems have demonstrated up to 94% efficiency (the best Ethernet connected systems demonstrated 63% efficiency).
People might argue that the Linpack benchmark is not the best benchmark for measuring parallel application efficiency, and does not fully utilize the network. The graph results are a clear indication that even for the Linpack application, the network does make a difference, and for better parallel application, the gap will be much higher.
When choosing the system setting, with the notion of maximizing return on investment, one needs to make sure no artificial bottlenecks will be created. Multi-core platforms, parallel applications, large databases etc require fast data exchange and lots of it. Ethernet can become the system bottleneck due to latency/bandwidth and CPU overhead due to the TCP/UDP processing (TOE solutions introduce other issues, sometime more complicated, but this is a topic for another blog) and reduce the system efficiency to 50%. This means that half of the compute system is wasted, and just consumes power and cooling. Same performance capability could have been achieved with half of the servers if they were connected with InfiniBand. More data on different application performance, productivity and ROI, can be found at the HPC Advisory Council web site, under content/best practices.
While InfiniBand will demonstrate higher efficiency and productivity, there are several ways to increase Ethernet efficiency. One of them is optimizing the transport layer to provide zero copy and lower CPU overhead (not by using TOE solutions, as those introduce single points of failure in the system). This capability is known as LLE (low latency Ethernet). More on LLE will be discussed in future blogs.
Gilad Shainer HPC Advisory Council Chairman