Inspur Information and MEGWARE Help the Friedrich-Alexander-Universität Erlangen-Nürnberg Build an Advanced GPU Cluster for Scientific Research

Background introduction:

The Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) is renowned for its science and engineering, and is rated as the second most innovative university in Europe by Reuters. Its areas of expertise include materials science, chemistry, life sciences, computer science, and biomedical engineering. High-performance computing (HPC) is a key research focus at FAU with numerous applications across its faculties, particularly HPC-related teaching and research in computer science.

FAU is member of the German NHR-Alliance, which is a federation of nine tier-2 computing centers in Germany. The university’s Erlangen National High Performance Computing Center (NHR@FAU) supports HPC-based research on a national scale. To meet the massive parallel computing needs for scientific research, NHR@FAU sought to build the largest computing cluster in the university’s history. This HPC project would significantly expand FAU’s research and HPC-capabilities, and also provide new opportunities for researchers throughout Germany.

Challenges:

The use of machine learning (ML) and molecular dynamics (MD) has become increasingly important for many areas of research, including materials science, chemistry, life sciences , computer science, biomedical engineering, and linguistics. As FAU’s research in these areas becomes increasingly more complex, it requires state-of-the-art HPC. NHR@FAU, desiring to bolster the university’s scientific research capabilities, sought to construct a state-of-the-art, energy- and cost-effective GPU cluster.

Business Needs/Challenges 

1. Machine Learning and AI

A wide array of research is turning to machine learning and AI for scientific advances. At FAU, major research utilizing ML and AI is being done in computer science for pattern recognition, biomedical engineering, and linguistic studies in spoken word and gesture pairing.

2. Molecular Dynamics

FAU uses MD to simulate the time-dependent properties of macromolecules, properties of biological systems such as protein systems, and pharmacology. These simulations require immense computing power and the demand for these simulations is growing rapidly.

Solution introduction:

Inspur and MEGWARE’s HPC solution has greatly enhanced FAU’s scientific research capabilities. The floating-point performance for model inference and training exceeded the university’s original expectations by 115%. For the ML nodes, Inspur recommended the NF5488A5 GPU Server powered by NVIDIA A100 GPUs, a next-generation data center GPU with better performance in CUDA Core, Tensor Core, video memory, computing power, etc. For the MD nodes, Inspur recommended the NF5468A5 GPU Server powered by NVIDIA A40 Tensor Core GPUs. This solution maximized performance while minimizing costs. FAU had relatively low requirements for server PCIe expansion, making Inspur NF5488A5 and NF5468A5 servers an ideal choice. Alternative server products offer more PCIe slot expansion capacity, but this is a costly feature the university will never fully utilize. Consequently, Inspur NF5488A5 and NF5468A5 servers were able to fulfill all project requirements, while also reducing procurement costs.

Detailed solution:

1) ML nodes supporting diverse machine learning and artificial intelligence tasks

Various software and algorithms for ML-based research have diverse computing characteristics. Different software has very different utilization of the CPUs, GPUs, memory, and hard disks. Computing resource utilization also varies considerably depending on the task. Therefore, Inspur recommended the NF5488A5 GPU server, which supports 8 third-generation NVLink fully interconnected NVIDIA A100 GPUs and 2 64-core AMD EPYC 7713 CPUs in a 4U chassis. It can sufficiently support the calculation of massive ML datasets and improve training efficiency.

2) MD nodes with strong computing power

Due to the complexity of molecular simulation theory, the requirements for HPC in this field are extremely high. Inspur recommended the NF5468A5 GPU server. NF5468A5 adopts the industry's most advanced NVIDIA NVSwitch interconnect structure for high-speed performance and communication bandwidth. It utilizes 8 NVIDIA A40 Tensor Core GPUs and 2 AMD EPYC 7713 CPUs in a 4U chassis with a PCIe 4.0 high-speed interface between the CPUs and GPUs without using a PCIe switch, which eliminates communication delays between the CPUs and GPUs. It is a powerful match for an assortment HPC MD requirements.

3) A GPU Cluster that can support a wide variety of complex scientific research

The GPU cluster composed of these ML and MD nodes named “Alex” by NHR@FAU is among the TOP500 and Green500 of the most powerful and energy efficient HPC systems in the world. Alex is the core component of NHR@FAU’s HPC Infrastructure to handle the rapidly growing computing resource demands for ML and MD in scientific research. It is has a total of 256 NVIDIA A100 Tensor Core GPUs and 304 NVIDIA A40 Tensor Core GPUs. In addition to the massive GPU resources available, there are 140 AMD EPYC 7713 CPUs, and the total memory capacity is over 50TB. The cluster is interconnected through a high-speed HDR InfiniBand network, resulting in top-level general-purpose computing with excellent HPC and AI performance that runs a multitude of research-specific software with various hardware requirements, while supporting massive ML datasets, molecular dynamics simulations, and improving training efficiency.

Customer Benefits

NHR@FAU’s HPC cluster using Inspur GPU servers and developed by MEGWARE, handles applications such as Tensorflow, PyTorch, QuantumEspresso, and VASP, and scientific research software such as NAMD, LAMMPS, AMBER, GROMACS, etc. In addition, NHR@FAU performed HPL tests with HPC-Benchmarks: The 21.4-hpl Docker image came with a single node (NF5488A5 ) performance value of 80 TFLOPS. With the HPL test requirements of this high-performance project, combined with the deep understanding of HPC in the field of scientific computing, Inspur's professional HPC application analysis team provided a performance optimization solution for a 15% performance boost to 91 TFLOPS per node, which greatly exceeded NHR@FAU’s expectations. FAU and other German researchers now have access to some of the most advanced HPC resources in Germany, which is being used in cutting-edge research in fields such as computer science and molecular dynamics. 

Recommend Products

Related Solutions