Emails and other communications sent between Microsoft employees are literally making liquid boil inside a steel holding tank packed with computer servers at its data centre on the eastern bank of the Columbia River.
Unlike water, the fluid inside the couch-shaped tank is harmless to electronic equipment and engineered to boil at 122oF, 90 degrees lower than the boiling point of water. The boiling effect, which is generated by the work the servers are doing, carries heat away from labouring computer processors. The low-temperature boil enables the servers to operate continuously at full power without risk of failure due to overheating.
Inside the tank, the vapor rising from the boiling fluid contacts a cooled condenser in the tank lid, which causes the vapor to change to liquid and rain back onto the immersed servers, creating a closed loop cooling system.
“We are the first cloud provider that is running two-phase immersion cooling in a production environment,” Husam Alissa, a principal hardware engineer on Microsoft’s team for data centre advanced development in Redmond, Washington, explains.
Moore’s Law for the data centre
The production environment deployment of two-phase immersion cooling is the next step in Microsoft’s long-term plan to keep up with demand for faster, more powerful data centre computers at a time when reliable advances in air-cooled computer chip technology have slowed.
For decades, chip advances stemmed from the ability to pack more transistors onto the same size chip, roughly doubling the speed of computer processors every two years without increasing their electric power demand.
This doubling phenomenon is called Moore’s Law after Intel co-founder Gordon Moore, who observed the trend in 1965 and predicted it would continue for at least a decade. It held through the 2010s and has now begun to slow.
That is because transistor widths have shrunk to the atomic scale and are reaching a physical limit. Meanwhile, the demand for faster computer processors for high performance applications such as artificial intelligence has accelerated, Alissa notes.
To meet the need for performance, the computing industry has turned to chip architectures that can handle more electric power. Central processing units, or CPUs, have increased from 150 watts to more than 300 watts per chip, for example. Graphics processing units, or GPUs, have increased to more than 700 watts per chip.
The more electric power pumped through these processors, the hotter the chips get. The increased heat has ramped up cooling requirements to prevent the chips from malfunctioning.
“Air cooling is not enough,” Christian Belady, distinguished engineer and vice president of Microsoft’s data centre advanced development group in Redmond, says. “That’s what’s driving us to immersion cooling, where we can directly boil off the surfaces of the chip.”
Heat transfer in liquids, he noted, is orders of magnitude more efficient than air. What is more, he added, the switch to liquid cooling brings a Moore’s Law-like mindset to the whole of the data centre. “Liquid cooling enables us to go denser, and thus continue the Moore’s Law trend at the data centre level,” he says.
Lesson learned from cryptocurrency miners
Liquid cooling is a proven technology, Belady notes. Most cars on the road today rely on it to prevent engines from overheating. Several technology companies, including Microsoft, are experimenting with cold plate technology, in which liquid is piped through metal plates, to chill servers.
Participants in the cryptocurrency industry pioneered liquid immersion cooling for computing equipment, using it to cool the chips that log digital currency transactions.
Microsoft investigated liquid immersion as a cooling solution for high-performance computing applications such as AI. Among other things, the investigation revealed that two-phase immersion cooling reduced power consumption for any given server by five to 15 per cent. The findings motivated the Microsoft team to work with Wiwynn, a data centre IT system manufacturer and designer, to develop a two-phase immersion cooling solution. The first solution is now running at Microsoft’s data centre in Quincy.
That couch-shaped tank is filled with an engineered fluid from 3M. 3M’s liquid cooling fluids have dielectric properties that make them effective insulators, allowing the servers to operate normally while fully immersed in the fluid.
This shift to two-phase liquid immersion cooling enables increased flexibility for the efficient management of cloud resources, according to Marcus Fontoura, a technical fellow and corporate vice president at Microsoft who is the chief architect of Azure compute.
For example, software that manages cloud resources can allocate sudden spikes in data centre compute demand to the servers in the liquid cooled tanks. That is because these servers can run at elevated power – a process called overclocking – without risk of overheating. For instance, we know that with Teams when you get to 1 o’clock or 2 o’clock, there is a huge spike because people are joining meetings at the same time,” Fontoura adds. “Immersion cooling gives us more flexibility to deal with these burst-y workloads.”
Sustainable data centres
Adding the two-phase immersion cooled servers to the mix of available compute resources will also allow machine learning software to manage these resources more efficiently across the data centre, from power and cooling to maintenance technicians, Fontoura adds. “We will have not only a huge impact on efficiency, but also a huge impact on sustainability because you make sure that there is not wastage, that every piece of IT equipment that we deploy will be well utilised,” he said.
Liquid cooling is also a waterless technology, which will help Microsoft meet its commitment to replenish more water than it consumes by the end of this decade.
The cooling coils that run through the tank and enable the vapor to condense are connected to a separate closed loop system that uses fluid to transfer heat from the tank to a dry cooler outside the tank’s container. Because the fluid in these coils is always warmer than the ambient air, there is no need to spray water to condition the air for evaporative cooling, Alissa explains.
Microsoft, together with infrastructure industry partners, is also investigating how to run the tanks in ways that mitigate fluid loss and will have little to no impact on the environment. “If done right, two-phase immersion cooling will attain all our cost, reliability and performance requirements simultaneously with essentially a fraction of the energy spend compared to air cooling,” Ioannis Manousakis, a principal software engineer with Azure, says.
Bringing the sea to servers
Microsoft’s investigation into two-phase immersion cooling is part of the company’s multi-pronged strategy to make data centres more sustainable and efficient to build, operate and maintain.
For example, the data centre advanced development team is also exploring the potential to use hydrogen fuel cells instead of diesel generators for backup power generation at data centres.
The liquid cooling project is similar to Microsoft’s Project Natick, which is exploring the potential of underwater data centres that are quick to deploy and can operate for years on the seabed sealed inside submarine-like tubes without any onsite maintenance by people.
Instead of an engineered fluid, the underwater data centre is filled with dry nitrogen air. The servers are cooled with fans and a heat exchange plumbing system that pumps piped seawater through the sealed tube.
A key finding from Project Natick is that the servers on the seafloor experienced one-eighth the failure rate of replica servers in a land data centre. Preliminary analysis indicates that the lack of humidity and corrosive effects of oxygen were primarily responsible for the superior performance of the servers underwater.
Alissa anticipates the servers inside the liquid immersion tank will experience similar superior performance. “We brought the sea to the servers rather than put the data centre under the sea,” he says.
If the servers in the immersion tank experience reduced failure rates as anticipated, Microsoft could move to a model where components are not immediately replaced when they fail. This would limit vapour loss as well as allow tank deployment in remote, hard-to-service locations. What is more, the ability to densely pack servers in the tank enables a re-envisioned server architecture that is optimised for low-latency, high-performance applications as well as low-maintenance operation, Belady notes.
Such a tank, for example, could be deployed under a 5G cellular communications tower in the middle of a city for applications such as self-driving cars. For now, Microsoft has one tank running workloads in a hyperscale data centre. For the next several months, the Microsoft team will perform a series of tests to prove the viability of the tank and the technology.
“This first step is about making people feel comfortable with the concept and showing we can run production workloads,” Belady says.