AI

Abstract

Large scale content and cloud infrastructure providers strive to offer the highest level of availability across the infrastructure stack. This however is not an easy feat given the fast pace of technology evolution, infrastructure expansion and global reach. Google’s network infrastructure has been built to achieve scale, efficiency and very high reliability by following a set of key architectural principles, which we refer to as the “zero touch network”. Failures do happen in any global scale network infrastructure such as Google’s. By analyzing past failures, we found that a large number of them happened when a network management operation was in progress. To minimize such failures, we have built a network infrastructure where all network operations are automated, requiring no additional steps beyond the instantiation of intent. The network infrastructure is fully declarative and changes applied to individual network elements are derived by the network infrastructure from the high-level network-wide intent. Any network changes are automatically halted and automatically rolled-back by the management infrastructure if the network displays unintended behavior. Finally, the infrastructure does not allow operations which violate network policies. While it might be tempting to limit the rate at which the network evolves to minimize risk of network failures, we have internally come to the opposite conclusion. In a zero-touch-network, continuous incremental evolution results in a more robust infrastructure rather than in-frequent large changes.