While everyone has had a reasonably good time bashing the Itanium for the past several years, Intel Corp.’s Itanium does have some significant upper-echelon features that x86/x64 systems could only dream about. Many of those features are in the RAS (Reliability, Availability, and Serviceability) arena — capabilities like failed DIMM isolation, hot-swappable RAM, inter-socket memory mirroring, corrupt data containment, and CPU hot-adds. Until the release of the Nehalem-EX, these features simply didn’t exist in the Xeon world. They do now.
The Nehalem-EX chip is designed for high-capacity SMP servers, scaling from two to 256 sockets at up to 256GB of addressable RAM per socket. Each chip has eight physical cores and 24MB of L3 cache, and can present 16 logical cores through Hyper-Threading. These are big-time numbers. It’s possible to drop 1TB of RAM into a four-socket Nehalem-EX server.
It’s also important to understand the differences between the Nehalem-EX and the Westmere-EP. The Westmere-EP is built on a 32nm process, while the Nehalem-EX is built on a 45nm process. Where the Westmere-EP has six cores, like the X7400 Dunnington, the Nehalem-EX has eight. Where the Westmere-EP tops out at 12MB of L3 cache, the Nehalem-EX runs up to 24MB. Where the Westmere-EP runs up to 3.33GHz per core, the Nehalem-EX runs at 2.26GHz per core (at the moment). Where the Westmere-EP has two QuickPath interconnects, the Nehalem-EX has four, and can address twice the RAM of the Westmere-EP. Both offer Hyper-Threading, Intel VT virtualization hooks, and Turbo Mode.
The Nehalem-EX is suited for very large scaled workloads. Although the Westmere-EP has the bump in clock rate, it doesn’t scale anywhere near the levels provided by the Nehalem-EX. That said, some workloads are better suited to the Westmere-EP, especially single-threaded tasks that benefit from the higher clock rate.
EX-treme performance
To test the Nehalem-EX, I opted for my suite of real-world concurrency tests. Lacking an Intel X7400-series server in the lab, I pitted a Dell R810 running two Intel X7560 Nehalem-EX CPUs against an older HP DL580 G3 running four Intel X7350 Tigerton CPUs. Note the differences between these systems before digging into the results: The HP DL580 had four quad-core X7350 CPUs running at 2.93GHz per core with a 4MB L3 cache. The Dell R810 had only two eight-core X7560s running at 2.26GHz per core with a 12MB cache. Whereas the X7560 Nehalem-EX CPUs support Hyper-Threading, the X7350s in the DL580 do not. It’s not apples-to-apples, but it gives a good sense of what performance gains to expect if your servers are more than a year old and running on the X7300-series platform.
The tests I ran are based on common operations found in many applications. The LAME tests convert a 152MB WAV file to MP3 at a 256Kbps bit rate. The compression tests use gzip and bzip2 to compress and uncompress a 55MB MP3 file. The MD5 tests calculate MD5 sums on 152MB files, and the MP4-to-FLV tests transcode a 24MB MP4 file to FLV. These tests are single-threaded, but run concurrently with increasing levels of concurrency to stress physical and logical cores, memory bandwidth, and memory interconnects, as well as disk I/O.
On the Nehalem-EX, I ran these tests with Hyper-Threading enabled and disabled. For comparison, I’ll reference the results with Hyper-Threading disabled so that the figures represent the same number of logical CPUs. All tests were run on CentOS 5.4. The reported figures were drawn from tests run from ramdisk to eliminate disk I/O from being a bottleneck.
The results start out somewhat unimpressively. With eight concurrent processes, the four X7350 CPUs in the DL580 were evenly matched against the two Nehalem-EX CPUs in the R810 in the LAME and gzip tests, but were significantly behind in the other tests. At a concurrency level of 16, the gap widened substantially on all tests, with the older system slightly ahead of the Nehalem-EX in the LAME and gzip tests, but running way behind in the remainder. Once the testing started to significantly oversubscribe the number of logical CPUs on each server, the Nehalem-EX pulled way into the lead and stayed there across all tests.
In fact, I ran many test passes at the 48, 64, and 96 concurrent process levels to verify the results because the performance differences were so huge. For example, at 64 concurrent processes, it took 2 minutes, 12 seconds for the two-CPU Nehalem-EX system to complete the MP4-to-FLV test. The four-CPU X7350 system took over 30 minutes to complete the same task. That’s a massive performance difference. The performance delta between the two servers only grew wider as the concurrency increased. Not only was I able to ramp the Nehalem-EX up to 768 concurrent processes, but it was still running the tests about 50 percent faster than the X7360 could run 64 concurrent processes.
This extreme performance increase is due to a number of reasons. The older X7350 system might have had two additional CPUs and a 670MHz clock rate bump per core, but it only had 4MB of L3 cache compared to the 24MB L3 cache on the Nehalem-EX. The X7350 also lacked the benefit of QuickPath, and the memory bus became a bottleneck. Thus, in the heavier workload tests, the Nehalem-EX blew the X7360 out of the water, even with a reduced clock rate per core and the same number of cores. In the lighter workloads, the difference was not nearly as significant.
I also ran the same suite of tests on a four-CPU AMD Opteron 8435 server. These six-core, 2.6GHz Istanbul CPUs have been out for the better part of a year now, and don’t quite match up to the Nehalem-EX (due to slower RAM, 25 percent less L3 cache, and the lower speed of this version of HyperTransport vs. QPI). But they make a reasonable comparison for the Nehalem-EX in terms of real-world deployment.
These tests showed that the Nehalem-EX definitely benefits from the faster, 1,066MHz DDR3 RAM (vs. Istanbul’s 800MHz DDR2), QPI, and the increased cache, as the X7560 bested the AMD Opteron 8435 in most tests, although not nearly as substantially as you might think. I ran the tests against a 24-core Istanbul system and again with an artificial constraint limiting the AMD box to only 16 physical cores. It’s not a perfect comparison, considering there were still four CPUs in the AMD box, but it’s reasonable.
The results: The full 24-core AMD Istanbul system held a performance edge at several concurrency levels against the X7560 with Hyper-Threading enabled. However, the Istanbul system lost ground when limited to only 16 cores, compared to the X7560 with or without Hyper-Threading enabled. In most cases, the margin was around 10 percent in favor of the X7560 over the AMD Opteron 8435, although it fluctuated somewhat throughout the concurrency levels. Both servers blew the doors off the X7350-based server, especially in the higher concurrency levels. The moral of this story is that the Nehalem-EX scales out extremely well. However, AMD’s new 12-core Magny Cours chip could make it a whole new ballgame.
Blurring the lines
One of the major differences between x86/x64 servers and most RISC servers and mainframes is the ability of higher-end RISC platforms to handle error detection, correction, and recovery at the system level. This is not a matter of simply determining that a DIMM has gone bad and displaying the location of the failure, but automatically blocking off that memory segment and permitting the DIMM to be hot-swapped with another, then resuming normal operations with the replacement without any downtime. The MCA (Machine Check Architecture) in the Nehalem-EX provides this capability, as well as other enhanced reliability features.
Providing these features isn’t as simple as it may sound. The OS needs to play a significant part in this dance too, since the processor needs to inform the operating system of a RAM failure and allow the OS to either restart a process that was using that memory or otherwise shuffle data away from the bad RAM prior to isolation and replacement.
There’s also support in the Nehalem-EX for hot-add RAM and CPUs, meaning that RAM and processors can be added on the fly to an existing system without a reboot. Naturally, this also requires close communication with the operating system and firmware to enable, so don’t expect these features to be available on older OS platforms, though most major operating system vendors have said they will support these features at the processors’ release.
There are more RAS features too, such as QPI packet retry and QPI CRC checking that bolster the reliability of QuickPath Interconnects, I/O hub hot-add, and memory thermal throttling. Suffice it to say, Intel has thrown a whole bunch of extremely high-level reliability functions into the new Xeon.
Although the Nehalem-EX doesn’t offer the fastest clock rates, it offers more cores per CPU than any other Intel processor, it can address massive amounts of RAM, and it adds a whole host of reliability features — features previously only found in the Itanium. We’ll soon see what AMD’s just-arrived Magny Cours can deliver, but whatever the outcome, it’s clear that x86/x64 computing has never been better.