Original Link: https://www.anandtech.com/show/1751



Intel CPU Roadmap Update

We have a small update to the Intel desktop roadmap, and not much has really changed. Everything from our last update remains the same, and it's basically business as usual. So what's new? We'll start off with the most interesting area in our view, the dual core units. As usual, we'll highlight the updates and additions.

Intel Desktop Performance Roadmap
Processor Core Name Clock Speed Socket Launch Date
??? Conroe ??? ??? 2H'06
Pentium D >= 950 Presler ??? LGA 775 Q2'06
Pentium D 950 Presler 3.4 2x2MB LGA 775 Q1'06
Pentium D 940 Presler 3.2 2x2MB LGA 775 Q1'06
Pentium D 930 Presler 3.0 2x2MB LGA 775 Q1'06
Pentium D 920 Presler 2.8 2x2MB LGA 775 Q1'06

We already covered the arrival of the Presler Pentium D cores last month (and Smithfield has been available for a few months). The chips will be dual core 65nm parts with EM64T, VT, EIST, and XD. If you're not familiar with those acronyms, here's the recap:

  • EM64T adds 64-bit support and is the Intel equivalent of AMD64.
  • XD provides some protection against buffer overflow attacks, again matching up to AMD's NX (No-eXecute) technology.
  • VT stands for Virtualization Technology and provides hardware level support for running multiple OSes concurrently on a single computer.

As we mentioned in our recent AMD roadmap update, it was only possible to run multiple OSes concurrenty in the past through such third party tools as VMware, and the hardware support should increase the performance quite a bit. As with the other technologies mentioned, VT has an AMD counterpart, dubbed Pacifica. The remaining technology warrants further explanation.

EIST stands for Enhanced Intel Speedstep Technology, which allows the processors to throttle down to lower clock speeds and voltages when idle and thus conserve power. The version of EIST in the Presler core should be superior to that of the Smithfield core as it will also be available on the 2.8 GHz model. Current EIST on Pentium and Pentium D chips reduces the clock speed to 2.8 GHz, making it a useless feature for a chip that runs at 2.8 GHz by default. We don't have any specific details on the new EIST, but we hope that it will offer more benefits than a static clock speed and voltage reduction. Ideally, we'd like to see something like AMD's Cool and Quiet where all lower CPU multipliers are unlocked - that's what Intel has in their Pentium M chips as well. Overclockers in particular like to have such control; however, Intel may or may not offer that degree of tuning.

We have one new entry for a potentially faster Presler model: 960 running at 3.6 GHz is the most probable candidate, although whether or not Intel decides to release such a chip will depend on a variety of factors. The more interesting addition is Conroe, which will use Intel's next generation architecture. Details on what Conroe will bring to the table are scarce, but we would imagine that all the previously mentioned technologies will be present. The major change is that Conroe will not use the NetBurst architecture that has been used in the Pentium 4 (and derivatives) line.

For those that don't follow processors closely, here's a brief explanation on why this decision was made. The long pipeline of NetBurst has become a liability with clock speeds beyond 4 GHz producing a lot of heat. Increasing clock speeds have always created more heat, but now we're hitting the point where they begin to scale out of control. Rather than trying to find ways of dealing with 150W power levels (or perhaps even higher), Intel has designed a new architecture "from the ground up." Of course, they're not really starting over, as they'll be using elements of all of their previous designs, but Conroe will be enough of a change that it will have a new name.



In order to have any inkling of what Conroe will offer, we need to take a step back for a minute. The last truly new architecture that Intel introduced was the IA-64/EPIC platform for Itanium (although depending on how you look at it, some would say that NetBurst actually came after IA-64). Prior to that, Intel had the P6 architecture, which was preceded by P5 (Pentium), 486, 386, etc. all the way back to the first parts Intel made. At present there are three major architectures that are all in production at Intel: P6 (Pentium Pro/II/III now evolved to Pentium M), NetBurst (Pentium 4 and derivatives), and IA-64/EPIC used in Itanium processors. P6 isn't actually the real name of the architecture for Pentium M, of course - Intel has never come forth with an official name. While Pentium M does use an extension of P6, the Banias and Dothan cores really change things quite a bit. We'll talk about how in a moment, but we'll refer to the architecture as P6-M for the remainder of this article. When we say P6-M, we mean Banias, Dothan, and Yonah. Let's take a quick look a the benefits and problems of each architecture before we talk about Conroe.

Prescott is used on the recent Pentium and Celeron chips and has a 31 stage pipeline, coupled to a separate 8 stage fetch/decode front end. (Earlier Northwood and Willamette cores use a 20 stage pipeline with the 8 stage front end.) Together the total pipeline length comes in at 39 stages - over twice the length of the current AMD K8 pipeline. In fact, the next longest pipelines outside of Intel aren't even out yet: Cell and Xenon are both around 21 stages long. The benefits of a long pipeline are in raw clock speeds. It's no surprise that NetBurst is the only chip currently shipping in speeds greater than 3 GHz, and Cell and Xenon are slated to join that "elite" group of processors in the future.

While a lengthy pipeline allows for high clock speeds, it also introduces inefficiencies in cases where a branch prediction misses. When that occurs, everything following the branch instruction has to be cleared from the CPU pipeline and execution begins again - a penalty of as much as 30 cycles in the case of Prescott. (Of course, it could be even longer if there's a cache miss and main memory needs to be accessed, but that delay would occur with or without the branch miss so we'll ignore it.) In order to avoid the full penalty of a branch misprediction (39 cycles), Intel decoupled the fetch/decode unit from the main pipeline and turned the L1 cache into a "trace cache" where instructions are stored in decoded form. The trace cache is actually a very interesting concept and certainly helped improve performance. It basically allows many instructions to skip 1/4 to 1/3 of the standard pipeline. While Intel no longer holds the performance crown, it wasn't until the launch of the K8 that Intel really lost the lead.

In terms of internal functioning of the NetBurst pipeline, each clock cycle at most three traces (instructions decoded into micro-ops) can be issued from the trace cache to the queues within the main pipeline. The NetBurst queues (schedulers) can then dispatch up to six micro-ops per cycle, but there are restrictions and in many cases there are execution slots that can't be filled on any given cycle. Based on the number of traces issued per clock, most would call NetBurst a three-wide issue design. That makes NetBurst the same as AMD's K7/K8 cores as well as the P6/P6-M cores in terms of issue rate. Purely from a theoretical standpoint, NetBurst could execute 3 instructions per clock, multiplied by the clock speed to give the final performance. Nothing ever reaches the theoretical performance of course - if it did, then NetBurst would still be over 35% faster than any other architecture, given its high clock speeds. Branch misses, cache misses, instruction dependencies, etc. all serve to reduce the theoretical performance offered.

Moving on to the Pentium M core, you can find out some of the details of what was changed in our Dothan investigation from last year. The basic idea is to take the P6 core and add some of the latest technologies to the design. To recap the earlier article, the Pentium M has several major design features. First, it goes with a more moderate pipeline length: longer than P6 to allow higher clock speeds, but shorter than NetBurst. (Intel isn't saying more than that, though guesstimates would put the length around 14 to 17 stages.) Next, Intel added micro-ops fusion to the core, which helps some instructions move through the core faster and avoids delays associated with out-of-order cores. Micro-ops fusion in essence eliminates dependency problems on certain instructions, since they are "fused" together. The core also has a dedicated stack manager that helps improve memory access efficiency as well as lower power use. Better branch prediction is another major improvement relative to the P6 design - take something like the branch prediction of NetBurst and put it on the P6 core and that's a rough description of what was done. Branch prediction is one of the features of an architecture that generally makes all code run a bit faster, and it once again reduces inefficiencies. The number of execution units remains the same as in P6, which means there's less wasted power on idle parts of the chip, while the faster system bus of NetBurst helps to keep the processor fed with data. Finally, power saving features were added to the cache, allowing the CPU to only fully power up small areas of the L2 cache for each cache access. The end result is a processor that has certain limitations but ends up achieving a very high performance per Watt rating, which is important for a mobile part. As we've shown in several articles, Pentium M makes for an attractive laptop processor but still can't compete with desktop parts in certain tasks.

Moving on to the final architecture, we come to IA-64/EPIC. While similar in some ways to VLIW (Very Long Instruction Word) architectures of the past, Intel worked to overcome some of the problems with VLIW (specifically the need to recompile code for every processor update) and called their new approach EPIC: "Explicitly Parallel Instruction Computer". In contrast to the P6, NetBurst, K7, and K8 architectures that can issue up to three instructions per cycle, the current Itanium 2 chips can issue six instructions per clock. From a purely theoretical standpoint, the fastest Itanium 2 running at 1.6 GHz actually has more computational power than any other Intel chip. Throw in dual core designs with HyperThreading - HyperThreading that actually works much better than NetBurst HTT due to the wide design of EPIC - and each chip not only has the potential to issue six instructions per clock, but it should actually come relatively close to that number. Another difference between Itanium and the other designs is that large amounts of cache are present in order to keep the pipelines fed with data. Current models ship with up to 9MB of L3 cache, while future parts like the Montecito will have 24MB of L3 cache (and a transistor count of 1.7 billion transistors - about eight times the transistor count of the Pentium D Smithfield core)!

Of course, with the wide issue rate of Itanium 2 (the original Itanium had a 6-wide core as well, but could generally only get 3.5 to 4.0 IPC at best), you need a lot of execution units. NetBurst has 7 execution units in Prescott: two simple integer units (which can function as 4 integer units if you count the double pumped design), a complex integer unit, two FP/SIMD units, and dedicated memory load and store units. If you want to count the simple integer units as 2 each, you could make a stretch and say NetBurst has nine execution units. AMD's K7 and K8 both have nine execution units as well, only they go for a less customized approach and instead have three each of the integer, FP/SIMD, and memory units. Each of AMD's units is fully functional, unlike the "simple" and "complex" integer units in NetBurst. In contrast to these architectures, the current Itanium 2 chips have six ALUs (Arithmetic Logic Units), three BRUs (Branch Units), two FPUs, one SIMD, two load units, and two store units - call it 16 functional units if you prefer, though the specialization of some of them makes it slightly less than that. While Itanium 2 is very wide, the length of the pipeline is only 8 stages - less than any other modern x86 processor by a significant amount. That certainly plays a role in the reduced clock speeds, but like Athlon 64, lower clock speeds with a more efficient architecture can outperform long pipelines in many instances. In order to extract all of the potential performance from Itanium, however, a lot of work needs to be done during code compilation. This is the Achilles' heel of VLIW designs; Processor updates require the code to be recompiled. While EPIC doesn't require that you recompile the code, newer compiler optimizations can improve performance significantly.

All that talk about other Intel architectures (as well as some of AMD), and yet we still haven't said exactly what Conroe is. The simple truth is that no one other than Intel and people under strict NDA really know for sure what the Conroe architecture will entail. There is a point to all of this discussion of previous architectures, though. While we've really only skimmed the surface of the designs, hopefully you can see how wildly different each architecture is from the others. NetBurst is long and narrow, EPIC is short and wide, and P6-M is a medium length pipeline that is narrower than either of the others but requires less power. The high clock speeds and resultant power levels have created problems for NetBurst, but there are still cases where it substantially outperforms P6-M. Itanium is still a better solution for certain types of big business work (databases in particular) than any of the other Intel architectures. While all three architectures have their strong points, none of them qualify as a universally superior solution. Having fallen behind AMD performance in many areas, we seriously doubt that Intel wants to create a design that merely aims at being "faster than AMD in most areas." Whether or not they can succeed is of course a question for the future.

If we don our speculation hats for a minute, we'd say that Conroe will return to more typical pipeline lengths and also reduce the maximum clock speed of the processors based off it relative to NetBurst. A 20 pipeline stage design, give or take, seems to be reasonable - we heard a few people at WinHEC suggest that NetBurst was hubris in terms of pipeline lengths, and that 20 or fewer stages is where all foreseeable pipelines - Intel and otherwise - are heading. The concept of a trace cache also seems to have merit, so some variant of that concept could show up in Conroe - micro-ops fusion plus a trace cache larger than that of NetBurst sounds interesting to us at least, though we're not at all sure it's feasible. Along with the shorter, more efficient pipeline, Conroe could also look into going to a wider issue rate. Some people have argued (rather convincingly) that x86 code is not conducive to issuing more than 3 instructions per clock without expending significant die resources, however, and current designs rarely manage issuing three instructions per clock anyway. A better solution could be to simply add more execution units, branch prediction, prefetch logic, etc. to ensure that the core can actually reach the maximum issue rate more frequently. Taking something like Pentium M and adding more FP/SIMD computational power isn't too much of a stretch (though that seems to be where Yonah is already heading).

If any of these ideas make the final design of Conroe, it's basically just an educated guess. The main point right now is that new architectures from Intel are not a frequent occurrence, so we expect it to be substantially different than P6/P6-M, NetBurst, and EPIC. Depending on how much collaboration there is between the various CPU design teams, we could see many elements of all three architectures or we could see a design largely derived from one or two of the others. If you consider that the Northwood to Prescott changes were pretty significant and Intel still didn't dub Prescott a new architecture, Conroe (and derivatives) ought to be a more significant redesign than going from 20 to 31 pipeline stages, adding EM64T, and changing cache sizes. Chances are that by the time we know more, we'll be under NDA as well until the official launch, so consider this our last chance at some enthusiast speculation.



Besides all the talk about the future architecture, there's not much to say about the desktop parts. Here's the current roadmap.

Intel Desktop Performance Roadmap
Processor Core Name Clock Speed Socket Launch Date
Pentium 673 Cedar Mill 3.8 2MB LGA 775 2H'06
Pentium 672 Prescott 2M + VT 3.8 2MB LGA 775 Q4'05
Pentium 663 Cedar Mill 3.6 2MB LGA 775 Q1'06
Pentium 662 Prescott 2M + VT 3.6 2MB LGA 775 Q4'05
Pentium 653 Cedar Mill 3.4 2MB LGA 775 Q1'06
Pentium 643 Cedar Mill 3.2 2MB LGA 775 Q1'06
Pentium >= 633 Cedar Mill 3.0 2MB LGA 775 Q2'06
Pentium 631 Cedar Mill (no VT) 3.0 2MB LGA 775 Q1'06
Pentium 571 Prescott 3.8 1MB LGA 775 Now
Pentium 561 Prescott 3.6 1MB LGA 775 Now
Pentium 551 Prescott 3.4 1MB LGA 775 Now
Pentium 541 Prescott 3.2 1MB LGA 775 Now
Pentium 531 Prescott 3.0 1MB LGA 775 Now
Pentium 521 Prescott 2.8 1MB LGA 775 Now

The single core Pentiums remain unchanged from last month, with the exception of the 673 showing up at the top. Processor models ending with a 3 will use the new Cedar Mill core, the single core version of Presler. They will be based on 65nm process technology and will include all the same extra technologies we mentioned earlier. They will also have HyperThreading enabled, where the dual core Presler chips do not. There is also a potential lower end 633 model scheduled to be introduced in Q2'06, though it may or may not be released, likely depending on demand and yields.

One final update for the mainstream desktop market is that the EM64T enabled 5x1 Pentium chips are finally available. We've been talking about them for a few months, and retail availability was expected before now. They were probably held back to let inventory of the earlier versions clear out. You can see the new chips in our Pricing Engine.

Intel Desktop Value Roadmap
Processor Core Name Clock Speed Socket Launch Date
Celeron D ??? Cedar Mill 512K + EM64T ??? LGA 775 2H'06
Celeron D 355 Prescott 256K + EM64T 3.33 256K LGA 775 Q4'05
Celeron D 351 Prescott 256K + EM64T 3.2 256K LGA 775 Now
Celeron D 346 Prescott 256K + EM64T 3.06 256K LGA 775 Now/Soon
Celeron D 341 Prescott 256K + EM64T 2.93 256K LGA 775 Now/Soon
Celeron D 336 Prescott 256K + EM64T 2.8 256K LGA 775 Now/Soon
Celeron D 331 Prescott 256K + EM64T 2.66 256K LGA 775 Now/Soon
Celeron D 326 Prescott 256K + EM64T 2.53 256K LGA 775 Now/Soon

The Celeron picture is similar to the single core Pentium market. The EM64T enabled Celeron D chips are all starting to ship, after a month or two of waiting. Once again, you can check the current prices and availability on our Pricing Engine - at present the 351 is available, but we aren't picking up any of the slower "+1" parts. Current generation Celerons do not have VT, HT, or EIST support, but they do include XD (as have all Celeron D chips since the "J" variants started shipping almost a year ago).

Once Intel transitions to 65nm, a new version of the Celeron based off the Cedar Mill core will arrive. Clock speeds are not yet set, but we do know that it will continue to use a 533 MHz FSB, and it will increase the amount of L2 cache to 512K. That will make the chip relatively interesting, as the old Northwood core also included 512K of cache. Of course, the pipeline of Northwood was only 20 stages rather than 31, so clock for clock Northwood may still be faster. With a 65nm process, however, we expect the chips to be able to hit relatively high clock speeds, and architectural tweaks may help them to surpass Northwood performance. You're still looking at relatively equivalent performance for around $100 a CPU, which compares favorably to the higher end Northwood cores of the past.

If you're in the market for a value system and you feel the need to purchase a 64-bit processor, we'd recommend that Intel buyers get a motherboard that uses the 945P or 945G chipsets. If you don't feel the support for dual core processors is important, we'd still recommend 915G/P as being the next best alternative. We would stay away from the 915GV/GL/PL chipsets as they have limitations that make them unattractive; not to mention the fact that those chipsets are almost EOL by now too. 915GV is like 915G, but no external graphics port is provided. 915GL has the same problem, but it only supports DDR memory instead of DDR or DDR2. Finally, 915PL supports an external X16 PCIe slot but eliminates DDR2 support and allows the use of 1 DIMM per memory channel, limiting you to a maximum of 2GB RAM. The bottom line is that the P or the G versions are what most mainstream users will desire. There's also the 955X chipset on the high end, which allows the use of up to 8GB of RAM. Those who really need the increased address space of a 64-bit OS will likely want the option to use more than 4GB of RAM as well.

Final Thoughts

Most of the major releases of 2005 have now occurred, and other than a speed bump or two, there's not much else awaiting Intel owners (at least in the desktop market) this year. 2006 has quite a bit more in store. We haven't covered the mobile section, but many expect to see the dual core Pentium M Yonah available for the desktop as well as mobile markets. Beyond that, we're all waiting for Intel's answer to the performance conundrums of the past year, where they officially lost the performance crown to AMD and haven't been able to regain it. The change to 65nm processes will likely have some advantages, but the real counterattack is going to come in the form of the next Intel architecture. We've given some speculation as to what may be present, but most of the real details are still closely guarded secrets.

For nearly 25 years Intel was the leader in PC processor technology and performance, and we have a suspicion that they will be pulling out all the stops to regain that lead in 2006. The fact that they still lead in profits and market share gives them a lot of resources they can apply to that goal, and we eagerly await additional details of the new architectures. AMD will naturally have their own new processors and architectures, but we know even less about K9 than we do about Conroe. Waiting is the hardest part, unfortunately.

Log in

Don't have an account? Sign up now