Name: Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future
Item: Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future
Author: Anand Lal Shimpi

Original Link: https://www.anandtech.com/show/2671

Nehalem Part 3: The Cache Debate, LGA-1156 and the 32nm Future

VIEW ARTICLE

by Anand Lal Shimpi on November 19, 2008 8:00 PM EST

Posted in
CPUs

33 Comments

Another Part? Oh there will be more

In an unexpected turn of events I found myself deep in conversation with many Intel engineers as well as Pat Gelsinger himself about the design choices made in Nehalem. At the same time, Intel just released its 2009 roadmap which outlined some of the lesser known details of the mainstream LGA-1156 Nehalem derivatives.

I hadn’t planned on my next Nehalem update being about caches and mainstream parts, but here we go. For further reading I'd suggest our first two Nehalem articles and the original Nehalem architecture piece.

Nehalem’s Cache: More Controversial Than You’d Think

I spoke with Ronak Singhal, Chief Architect on Nehalem, at Intel’s Core i7 launch event last week in San Francisco and I said to him: “I think you got the cache sizes wrong on Nehalem”. I must be losing my shyness.

He thought I was talking about the L3 cache and asked if I meant it needed to be bigger, and I clarified that I was talking about the anemic 256KB L2 per core.

We haven’t seen a high end Intel processor with only 256KB L2 per core since Willamette, the first Pentium 4. Since then Intel has been on a steady ramp upwards as far as cache sizes go. I made a graph of L2 cache size per core of all of the major high end Intel cores for the past decade:

Click to Enlarge

For the most part we’ve got a linear trend, there are a few outliers but you can see that earlier in 2008 you’d expect Intel CPUs to have around 2 - 3MB of L2 cache per core. Now look at the lower right of the chart, see the little orange outlier? Yeah, that’s the Core i7 with its 256KB L2 cache per core, it’s like 2002 - 2007 never happened.

If we look at total on-chip cache size however (L2 + L3), the situation is very different:

Click to Enlarge

Now we’ve got an exponential growth of cache size, not linear, and all of the sudden the Core i7 conforms to societal norms. To understand why, we have to look at what happened around 2005 - 2006: Intel started shipping dual-core CPUs. As core count went up, so did the total amount of cache per chip. Dual core CPUs quickly started shipping with 2MB and 4MB of cache per chip and the outgoing 45nm quad-core Penryns had 12MB of L2 cache on a single package.

The move to multi-core chip designs meant that the focus was no longer on feeding the individual core, but making sure all of the cores on the chip were taken care of. It’s all so very socialist (oh no! ;) ).

Nehalem was designed to be a quad-core product, but also one that’s able to scale up to 8 cores and down to 2 cores. Intel believes in this multi-core future so designing for dual-core didn’t make sense as eventually dual-core will go away in desktops, a future that’s still a few years away but a course we’re on nonetheless.

AMD's shift to an all quad-core client roadmap

Intel is pushing the shift to quad-core, much like AMD is. By 2010 all of AMD’s mainstream and enthusiast CPUs will be quad-core with the ultra low end being dual-core, a trend that will continue into 2011. The shift to quad-core makes sense, unfortunately today very few consumer applications benefit from four cores. I hate to keep re-using this same table but it most definitely applies here:

Back when AMD introduced its triple-core Phenom parts I put together a little table illustrating the speedup you get from one, two and four cores in SYSMark 2007:

	SYSMark 2007 Overall	E-Learning	Video Creation	Productivity	3D
Intel Celeron 420 (1 core, 512KB, 1.6GHz)	55	52	55	54	58
Intel Celeron E1200 (2 cores, 512KB, 1.6GHz)	76	68	91	70	78
% Increase from 1 to 2 cores	38%	31%	65%	30%	34%
Intel Core 2 Duo E6750 (2 cores, 4MB, 2.66GHz)	138	147	141	120	145
Intel Core 2 Quad Q6700 (4 cores, 8MB, 2.66GHz)	150	145	177	121	163
% Increase from 2 to 4 cores	8.7%	0%	26%	1%	12%

Not only are four cores unnecessary for most consumers today, but optimizing a design for four cores by opting for very small, low latency L2 caches and a large, higher latency L3 cache for the chip isn’t going to yield the best desktop performance.

A Nehalem optimized for two cores would have a large L2 cache similar to what we saw happening on the first graph, but one optimized for four or more cores would look like what the Core i7 ended up being. What’s impressive is that Intel, in optimizing for a quad-core design, was still able to ensure that performance either didn’t change at all or improved in applications that aren’t well threaded.

Apparently the L2 cache size was and still is a controversial issue within Intel, many engineers still feel like it is too small for current workloads. The problem with making it larger is not just one of die size, but also one of latency. Intel managed to get Nehalem’s L2 cache down to 10 cycles, the next bump in L2 size would add another 1 - 2 cycles to its latency. At 512KB per core, 20% longer to access the cache was simply unacceptable to the designers.

In fact, going forward there’s no guarantee that the L2 caches will see growth in size, but the focus instead may be on making the L3 cache faster. Right now the 8MB L3 cache takes around 41 cycles to access, but there’s clearly room for improvement - getting a 30 cycle L3 should be within the realm of possibility. I pushed Ronak for more details on how Intel would achieve a lower latency L3, but the best I got was “microarchitectural tweaks”.

As I mentioned before, Ronak wanted the L3 to be bigger on Nehalem; at 8MB that’s only 2MB per core and merely sufficient in his eyes. There are two 32nm products due out in the next 2 years, I suspect that at least one of them will have an even larger L3 to continue the exponential trend I showed in the second chart above.

Could the L2 be larger? Sure. But Ronak and his team ultimately felt that the tradeoff between size/latency was necessary for what Intel’s targets were with Nehalem. And given its 0 - 60% performance increase, clock for clock, over Penryn - I can’t really argue.

Mainstream Nehalem: On-chip GPU and On-chip PCIe

Let’s take a look at the Core i7, the first Nehalem incarnation:

Three DDR3 memory channels, a QPI link to the X58 chipset and support for multiple GPUs off of the X58 IOH. The Core i7, as you know by now, plugs into Intel’s new LGA-1366 socket. But in the second half of next year, there will be a new socket for mainstream users: LGA-1156.

Meet Lynnfield, it’s also a 4-core/8-thread design with an 8MB L3 cache, just like the Core i7, but it plugs into LGA-1156. Instead of 3-DDR3 channels it’s got two and instead of QPI it’s got Intel’s DMI connecting it to the chipset. It’s a lower bandwidth interconnect but Lynnfield doesn’t need a ton of bandwidth between it and the chipset, the reason being its secret weapon: Lynnfield has an on-package PCIe controller.

There are 16 PCIe lanes on Lynnfield (presumably PCIe 2.0) and they can be used as two x8s or a single x16, so you’ll get 2-way SLI/CrossFire support assuming all licensing silliness is worked out. The close proximity of the PCIe controller to the CPU could mean some very interesting things for latency, if well designed Lynnfield could have the lowest latency CPU-GPU connection we’ve seen on a desktop PC. Whether or not that’ll actually mean anything for real world performance remains to be seen, I’d guess not but it’s neat to talk about nonetheless.

Next up we’ve got Havendale, this is a 2-core/4-thread part with a 4MB L3 (still 2MB of L3 per core, just like Lynnfield and the Core i7). The “pinout” (if we can still call it that on these pinless CPUs) is the same as Lynnfield, so we’ve got a two channel DDR3 memory controller and DMI to the chipset.

Havendale’s secret sauce is that it’s got an on-package GPU, I’d expect it to be a bigger, better, faster variant of G45 (hopefully a lot better/faster) built on a 45nm process. This should beat AMD to the punch with the first single-chip CPU/GPU for mainstream desktops/notebooks, as AMD delayed its first APUs until 2011. Alongside the on-package GPU we've also got the same PCIe controller from Lynnfield.

The actual display output on Havendale will be through the chipset itself but the GPU and PCIe interface are on the CPU’s package. Harvendale only offers a single x16 PCIe slot, you can’t run it in 2 x8 mode.

At the right clock speeds, Havendale should be perfect for notebooks and desktops as well. These days two-cores with Hyper Threading would be the perfect mixture of cores/performance for the majority of consumers. As I noted in part 2 of our Nehalem coverage, give me Nehalem’s power efficiency in a notebook and I’ll be beyond happy.

The hiccup however is that we won’t see Havendale until Q1 2010. It’ll start production in Q4’09 but systems won’t ship until the beginning of the next year. This does leave a hole in Intel’s Nehalem roadmap as there won’t be any Intel integrated graphics chipsets between now and Q1 2010, which should give NVIDIA ample opportunity to sell chipsets into the mainstream Nehalem market.

What to Buy: Mainsteam vs. High End Nehalem

With two sockets targeted at desktops, how will the Core i7s that launched this month stack up to the mainstream Lynnfield and Havendale parts?

The absolute highest frequencies will only be available in LGA-1366 packages and I’d expect this is where we’d see 8-core/16-thread Nehalem parts first (if not exclusively). We’ve already shown that the three DDR3 channels don’t really help for most desktop applications, but this could change when Nehalem moves to 8 cores. Overclockability may also be better on LGA-1366 as the CPUs themselves will be higher bins.

Intel’s roadmaps show three pricepoints of Lynnfield processors in 2009. The top end Lynnfield part looks to be something that’s similar in price/frequency to the i7-940 (or whatever replaces it in Q3 2009). If I were to guess I’d say that’d be a $562 3GHz+ Lynnfield with performance somewhere in between an i7-940 and i7-965.

There will be a midrange Lynnfield, most likely priced/clocked similarly to the i7-920 or its eventual replacement. I’d guess a 2.66GHz - 2.93GHz CPU priced at around $284. Finally the low-end Lynnfield will be somewhere near $200 and probably weigh in at 2.4/2.53GHz. With Havendale not arriving until 2010, it’s currently absent from all Intel roadmaps.

Intel is going to support both platforms, LGA-1366 and LGA-1156 for the long term, the difference will be in the type of processors enabled. LGA-1366 may end up being more of a high end enthusiast play, Intel indicated that LGA-1366 CPUs would be binned higher so you can expect higher overclocks and obviously higher top end frequencies.

At the same time you should be able to get pretty far with LGA-1156, simple 500MHz overclocks shouldn’t be a problem but the 1GHz+ overclocks we’re used to on LGA-1366 and LGA-775 may not be as possible - at least not at 45nm.

Intel isn’t going to do anything to limit overclocking on LGA-1156 platforms, the same current limit bypass that’s on LGA-1366 boards will be optional on 1156 boards should the motherboard manufacturer choose to support it.

The breakdown seems pretty simple: if you’re the type of person who bought the Q6600/Q9300, then Lynnfield may be the Nehalem for you. If you spent a bit more on your CPU or are more of an enthusiast overclocker, the current Core i7 seems like the path Intel wants you to take.

The issue with Lynnfield is that it’s a good 6+ months away, and if Core i7 can speedup your workloads a lot today then you’ll be tempted to make the upgrade now. In notebooks we’ll see Lynnfield in the larger machines and Havendale in most of the platforms.

Without mainstream mobile Nehalem until Q1 2010, next year will be a very long wait for a serious mobile upgrade. But if you can wait it out, or buy something cheaper today, the time to upgrade will be in Q1 2010. I’m going to go ahead and revise my Apple notebook recommendation given that we probably won’t see a Nehalem based MacBook until 2010. Buy the cheapest MacBook you can today and make it last, upgrade again in 2010. Ooh, that rhymes.

What’s Next: A Preview of Westmere and Sandy Bridge

Conroe was designed to address a deficiency in the desktop and mobile markets and Nehalem to tidy up the workstation/server space, so what’s next? Westmere and Sandy Bridge are the 32nm followons and Intel has already hinted at the major changes coming in that generation: power consumption and floating point performance.

Westmere will be little more than a die shrink to 32nm, we may get some more cache but I wouldn’t expect significant performance improvements other than from clock speeds. Westmere will take Nehalem’s power efficiency and combine it with a pure power reduction to be quite a threat.

Sandy Bridge will add support for AVX:

By the end of 2009 we should have support for both OpenCL and DirectX 11 by GPUs from all vendors, including Intel with Larrabee. These APIs in combination with the highly parallel nature of the GPUs that will be able to run them, should allow for some incredible speedups on highly data parallel applications. While most of these applications are currently limited to the scientific field, we’ll start to see them appear in the consumer space (we’re starting to already with video transcoding and Photoshop).

Not all applications are data parallel enough to run well on a GPU, but they may require more than what present day CPUs can offer in terms of floating point throughput. Intel’s AVX instructions are designed to bridge the gap between the CPU and the GPU, offering an alternative to developers who could stand the gain performance from running some of their code on a GPU but would rather keep the work on the CPU itself to make programming simpler. Developers will move their code off to the GPU if the performance is worthwhile, but if you can get similar performance gains without recoding, that’s the preferred avenue. Make sense?

Eventually I’m guessing we’ll see the Larrabee and Nehalem lines of the x86 ISA merge, AVX is merely the first step in that direction.

Final Words

Another day, another chapter on Nehalem comes to an end. I’m back from 10 days in Texas and California, visiting the usual suspects and there’s much more to write about. We’re finally getting wind of X58 motherboards at well below $300, have much more to talk about with overclocking and there’s still that issue of multi-tasking performance.

Nehalem may have launched, but our work is far from done this year. Stay tuned.