Datacenter CPU landscape increases in vendor, architectural diversity as performance gains sought
May 6 2021
by James Sanders, John Abbott
Public cloud platform operators wield immense buying power – as workloads are migrated to cloud platforms, this concentration of compute creates an environment where the cloud platforms can strongly influence OEMs' design and product decisions. On one hand, this puts the likes of AWS, Azure and GCP in a kingmaker position; on the other, simply building the necessary parts becomes viable in terms of economies of scale, if the required component is not available on the open market. Intel's latter-day difficulties have pushed hyperscalers to open their facilities to AMD-powered systems, although custom silicon efforts such as Amazon's Graviton2 CPUs are increasingly gaining traction. With multiple competitors coming to Intel's Data Center group, a vendor- and architecturally diverse landscape appears to be the future of enterprise compute.
The 451 Take
Competition has opened for enterprise-scale CPUs, and the reasons for this are multifaceted: the protracted difficulties Intel faced in manufacturing, diminished performance gains for each successive generation, and the advent of workloads (e.g., AI/ML) that have to date been more efficiently handled on GPUs and FPGAs/ASICs. This, combined with the buying power of cloud platform operators, is leading to increasingly tailored and often bespoke silicon for cloud platforms.
While this trend has been seen with Graviton, catering to this trend is an integral component of Pat Gelsinger's plan to transform Intel. In recent years it has seen its leadership eroded on multiple fronts – a resurgent AMD offers seamless x86-64 compatibility with higher core counts and greater power efficiency (the latter of which is enabled by TSMC's 7nm process), while Arm enjoys relevance in public cloud via Graviton at AWS, deployment of Ampere CPUs at Oracle and an abundance of speculation that Microsoft and Google are pursuing the design of custom CPUs for their respective cloud operations. Intel's attempts to turn the tide continue with the release of Ice Lake-SP: while this will undoubtedly be deployed by hyperscalers, meaningful competition in CPUs and the proliferation of customized silicon for specific workloads will persist.
Of note, 44% of respondents in 451 Research's Infrastructure Evolution 2020 - Quarterly Advisory Report indicated they have AMD-based servers in production, with 22% indicating they have Arm-based servers in production and a further 24% are planning to implement in the next 24 months. While Intel remains the incumbent provider (with 64% of respondents indicating Intel-based servers in production), there is markedly more diversity in the datacenter CPU landscape as competitive products have been introduced by vendors and deployed by customers.
Pat Gelsinger's return to Intel: How much is needed to right the ship?
On March 24, newly appointed Intel CEO Pat Gelsinger outlined his plan to reorient Intel following years of protracted issues with manufacturing products on its 10nm process. Intel is scheduled to formally introduce Xeon CPUs built on its 10nm Ice Lake microarchitecture
in April. Gelsinger noted that cascading delays to 10nm affected Intel's 7nm designs – the 7nm product, Meteor Lake, is anticipated to tape-out in Q2 2021, and is intended to be the underpinning for Intel's 2023 client and datacenter CPU lineup.
Meteor Lake represents a pivot in Intel's strategy – it will leverage Intel's Foveros packaging technology, allowing Intel to produce the CPU as a 'tile' that can be combined with tiles produced at outside foundries, such as TSMC, to create a finished product with best-in-class IP cores from various firms. Manufacturing was the cornerstone of Gelsinger's presentation, with the IDM 2.0 strategy noting Intel's use of outside foundries. Gelsinger also announced Intel Foundry Services, a stand-alone business unit intended to manufacture chips for customers, as well as offer its IP portfolio to customers – including x86-64. In theory, this scenario would permit a cloud platform operator to contract the production of a custom system on a chip (SOC) incorporating Intel's x86-64 core with custom hypervisor silicon – providing the VM security/lifecycle management favored by the cloud operator while retaining binary compatibility with existing x86-64-dependent workloads.
Qualcomm's purchase of NUVIA and potential return to the datacenter
Qualcomm CEO-elect Cristiano Amon is seeking a turnaround in performance as the company contends with Intel and AMD for clients' computing tasks – NUVIA appears to be central to this strategy. While Qualcomm briefly played in this market, its effort (Qualcomm Centriq) was canceled after just a year after launch amid the failed NXP acquisition and fighting off a hostile takeover attempt from Broadcom. A potential return to the datacenter is not impossible – NUVIA was founded explicitly to disrupt the datacenter-class CPU market – although plans along these lines are unlikely to be telegraphed before finalization.
Microsoft has a long-standing relationship with Qualcomm and had publicly pledged to support Centriq prior to its cancellation. Qualcomm produces customized SOCs for Microsoft's Surface X tablet products; this collaboration could be extended to Azure, although Microsoft was rumored in December 2020 to be building its own Arm CPUs for Azure.
NVIDIA's planned Arm acquisition and datacenter ambitions
NVIDIA announced plans to acquire Arm Ltd from SoftBank in September 2020, in a complex deal anticipated to take 18 months to close. This situation creates interesting optics for the GPU manufacturer; NVIDIA today competes with many of Arm's licensees. NVIDIA has spent over a decade aiming to increase its presence in datacenters, introducing Compute Unified Device Architecture (CUDA) in 2007, allowing for general-purpose workloads to be executed on GPUs, a capability instrumental in the rise of AI/ML use in enterprise compute.
NVIDIA offers a lineup of datacenter-targeted GPUs, as well as preconfigured server hardware (NVIDIA DGX) with Mellanox interconnect equipment and, for now, Intel or AMD CPUs, depending on model. NVIDIA acquired Mellanox in 2020 – it stands to reason that with the acquisition of Arm, it could produce more tightly integrated servers, reducing reliance on AMD and Intel in the process. At NVIDIA GTC 2021, the firm announced Grace, a high-performance Arm-powered server CPU intended for use in AI systems, starting in 2023.
The Grace CPU is touted as being built on a 'next-generation' Arm Neoverse CPU platform (likely one incorporating Armv9 architecture) with LPDDR5X memory bandwidth greater than 500GB/sec, NVLink interface providing GPU-to-CPU speeds in excess of 900GB/sec and CPU-to-CPU speeds in excess of 600GB/sec.
Independent of NVIDIA, Arm is preparing additions to its Neoverse datacenter CPU platform – upon which Amazon's Graviton2 and Ampere Altra products are built – with the Neoverse V1 (Zeus) representing a high-performance design available now to licensees and the Neoverse N2 (Perseus) scale-out design planned for availability this year.
Armv9 architecture focuses on enterprise with security and performance upgrades
Arm Ltd announced the Armv9 architecture – the first new major architecture update in a decade – introducing the Arm Confidential Compute Architecture, allowing applications in use, in transit and at rest to be encrypted. This enhances the security posture in multi-tenant environments (e.g., public cloud) where other tenant workloads are not guaranteed to be hostile. (As a quick refresher, the Spectre and Meltdown vulnerabilities permitted adjacent tenant workloads to peek at data in RAM; the software mitigation for this imposed a non-trivial performance penalty.) The Armv9 implementation further shields the tenant workloads from the host; in this case, the cloud platform operator.
Arm also introduced SystemReady, a standardization framework for Arm-powered devices (extending and replacing the principles in the existing ServerReady framework) as well as introducing new Scalable Vector Extensions 2 (SVE2) as standard, the first generation of which (SVE) can be seen in Fujitsu's Fugaku supercomputer. SVE2 is intended to further accelerate ML and DSP capabilities.
It stands to reason that products implementing Armv9 designs would be forthcoming in product refresh cycles for Amazon Graviton and Ampere Altra product lines; likewise, the Arm Neoverse reference designs (on which Graviton and Altra are built) would be a natural starting point for competitive Arm-powered datacenter-class SOCs.
AMD EPYC 3 Milan cements status in datacenter
On March 15, AMD introduced Milan, the third generation of its EPYC series of server chips, which began shipping in volume in Q4 2020. The first generation, EPYC 7001 Naples in 2017, brought AMD back as a contender in the datacenter after more than a decade's absence, but it wasn't until EPYC 7002 Rome in 2019 that it was fully production ready. By launching it on time and on spec, AMD proved to potential OEMs and cloud customers that it was committed to and capable of a predictably executed roadmap. HPC, cloud and enterprise users rapidly adopted it. Milan (EPYC 7003) is the next step. It delivers a strong per-core performance using up to 63 new 7nm Zen 3 cores per processor, or 16 to 128 threads per socket, and features more efficient use of per-core cache memory, 4-6 or 8 interleaved memory channels, 128 lanes of PCI 4.0 connectivity and an integrated security chip on die. It's also more easily updated from the previous generation with just a BIOS update, rather than a platform change (as with Ice Lake).
AMD has found differentiated value points as a means of competing against Intel, most notably in offering options on single or dual sockets, core counts and frequency without limiting any of the memory and connectivity options. EPYC 7003 comes in 19 variants in three groups: core performance (high frequency, more cache per core), core density (high core and thread count) and balanced. HPC users typically require high throughput and memory for complex datasets, cloud providers are after high density and enterprises value transactional and analytics performance. But just as Intel is now positioning itself as an XPU company so AMD has additional offerings beyond x86. Its Instinct GPUs are now providing viable competition to NVIDIA for AI and ML, and its ongoing acquisition of Xilinx promises networking and edge inference capabilities.
With new datacenter CPUs either just making it into the channel – like AMD EPYC 3 or Intel Ice Lake-SP – or just around the corner, with custom designs integrating ARMv9 likely forthcoming, the variety of deployment options for cloud platform operators and for hardware vendors integrating these CPUs into systems is greater than it has been in years. This should be a welcome change: competition spurs innovation, and innovation is desperately needed to accommodate the increased complexity of business workloads.