Supercomputing applications are defined by these types of tightly connected concurrent processes, putting more emphasis on the performance of the interconnect, in particularly the latency. Running a traditional supercomputing application on an infrastructure designed for elastic applications, such as AWS or Azure, typically yield slow-downs by a factor 50 to 100. Measured in terms of cost, they would cost 50-100 times more to execute on a typical public cloud computing infrastructure.
Most supercomputing applications are associated with very valuable economic activities of the business. As mentioned earlier, production optimization and logistics applications save companies like Exxon Mobil and Fedex billions of dollars per year. Those applications are tightly integrated in the business operation and strategic decision making of these organizations and pay for themselves many times over. However, for the SMB market these supercomputing applications offer great opportunity for revenue growth and margin improvements as well. However, their economic value is attenuated by the revenue stream they optimize; 10% improvement for a $10B revenue stream yields a $1B net benefit, but for a $10M revenue stream the benefit is just a $1M, not enough to compensate for the risk and cost that deploying a supercomputer would require.
Enter On-Demand Supercomputing.
In 2011, we were asked to design, construct, and deploy an On-Demand supercomputing service for a Chinese cloud vendor. The idea was to build an interconnected set of supercomputer centers in China, and offer a multi-tenant on-demand service for high-value, high-touch applications, such as logistics, digital content creation, and engineering design and optimization. The pilot program consisted of a supercomputer center in Beijing and one in Shanghai. The basic building block that was designed was a quad rack, redundant QDR IB fat-tree architecture with blade chassis at the leaves. The architecture was inspired by the observation that for the SMB market, the granularity of deployment would fall in the range of 16 to 32 processors, which would be serviced by a single chassis, keeping all communication traffic local to the chassis. The topology is shown in the following figure: