Machine Learning Guided Cooling Optimization for Data Center

Jan 1, 2026·
Shrenik Jadhav
,
Zheng Liu
· 0 min read
Abstract
Effective data center cooling is crucial for reliable operation, yet cooling systems often exhibit inefficiencies that lead to excessive energy consumption. This paper presents a three-stage, physics-guided machine learning framework for identifying and reducing cooling energy waste in a high-performance computing facility. Using one year of 10-minute resolution operational data from the Frontier exascale supercomputer, we first train a monotonicity-constrained gradient boosting surrogate to predict facility accessory power from coolant flow rates, temperatures, and server power. The surrogate achieves a mean absolute error of 0.026 MW and predicts power usage effectiveness within 0.01 of measured values for 98.7% of test samples. In the second stage, the surrogate serves as a physics-consistent baseline to quantify excess cooling energy, revealing approximately 85 MWh of annual inefficiency concentrated in specific months, hours, and operating regimes. The third stage evaluates guardrail-constrained counterfactual adjustments to supply temperature and subloop flows, demonstrating that up to 96% of identified excess can be recovered through small, safe setpoint changes while respecting thermal limits and operational constraints.
Type
Publication
arXiv preprint arXiv:2601.02275