Parent: Processor Architectures

Intel Microarchitecture 2020?

(2016 Dec)
There is an article on WCCFtech, that a new Intel processor architecture to succeed the lake processors (Sky, Cannon, Ice and Tiger) will be "faster and leaner" and more interestingly might not be entirely compatible with older software. The original source is I suppose it is curious that the Lake processors form a double tick-tock or now process-architecture-optimization (PAO), but skipped Kaby, and Cannon. Both the bridge (Sandy and Ivy) and well processors (Has and Broad) each had only one tick-tock pair.

Naturally, I cannot resist commenting on this. About time!

For perspective, in the really old days, processor architecture and instruction set architecture (ISA) was somewhat the same thing. The processor implemented the instruction set, so that was the architecture. I am excluding the virtual-architecture concept where lower cost version would not implement the complete instruction set in hardware.

The Intel Pentium Pro was a significant step away from this, with micro-architecture and instruction set architecture now largely different topics. Pentium Pro has its own internal instructions, called micro-operations. The processor dynamically decodes X86 instructions to the "native" micro-operations. This was one of the main concepts that allow Intel to borrow many of the important technologies from RISC.

The Pentium 4 processor, codename Willamette, had a Trace cache, that was a cache for decoded instructions. This may not have been in the Core 2 architecture that followed Pentium 4.

My recollection is that Pentium Pro had 36 physical registers of which only 8 are visible to the X86 ISA. The processor would rename the ISA registers as necessary to support out-of-order execution. Pentium 4 increased this to 128 registers.

Also see MIT 6.838 and NJIT rlopes

The Nehalem micro-architecture diagrams do not mention a µop cache, (somehow the acronym is DSB) but Sandy Bridge and subsequent processors do. This is curious because both Willamette and Nehalem are Oregon designs, while Core 2 and Sandy Bridge are Haifa designs.

The other stream that comes into this topic involves the Intel Itanium adventure. The original plan for Itanium was to have a hardware (silicon) X86 unit. Naturally, this would not be comparable to the then contemporary X86 processors, which would have been Pentium III, codename Coppermine at 900MHz, for Merced. So by implication, X86 execution would probably be comparable to something several years old, a Pentium II 266MHz with luck, and Itanium was not lucky.

By the time of Itanium 2, the sophistication of software CPU emulation was sufficiently advanced that the hardware X86 unit was discarded. In its place was IA-32 Execution Layer. Also see the IEEE Micro paper on this topic. My recollection was the Execution Layer emulation was not great but not bad either.

The two relevant technologies are: one, the processor having native µops instead of the visible X86 instructions, and two, the Execution Layer for non-native code. With this, why is the compiler generating X86 (ok, Intel wants to call these IA-32 and Intel 64 instructions?) binaries.

Why not make the native processor µops visible to the compiler. When the processor detects a binary with native micro-instructions, it can bypass the decoder? Also make the full set of physical registers visible to the compiler? If Hyper-threading is enabled, then the compiler should know to only use the correct fraction of registers.

Have one or two generations of overlap, for Microsoft and the Linux players make a native micro-op operating system. Then ditch the hardware decoders for X86. Any old code would then run on the Execution Layer, which may not be 100% compatible. But we need a clean break from old baggage or it will sink us.

Off topic, but who thinks legacy baggage is sinking the Windows operating system?


Of course, I still think that one major issues is that Intel is stretching their main line processor core over too broad a spectrum. The Core is used in both high-performance and high-efficiency mode. For high performance, it is capable of well over 4GHz, probably more limited by power than transistor switching speed. For power efficiency, the core is throttled to 2 or even 1 GHz.

If Intel wants to do this in a mobile processor, it is probably not that big a deal. However, in the big server chips, with 24 core in Xeon v4 and possibly 32 cores in the next generation (v5), it becomes a significant matter.

The theory is that if a given core is designed to operate at a certain level, then doubling the logic should achieve a 40% increase in performance. So if Intel is deliberately de-rating the core in the Xeon HCC die, then they could built a different core specifically to one half the original performance is perhaps one quarter the complexity.

So it should be possible to have 100 cores with half the performance of the Broadwell 4GHz capable core, i.e., equivalent to Broadwell at 2GHz? If this supposed core were very power efficient, then perhaps we could even support the thermal envelope of 100 mini-cores?

Of course, not every application is suitable for wide parallelism. I would like to see Intel do a processor with mixed cores. Perhaps 2 or 4 high performance cores and 80 or so mini-cores?

Cornell ECE 4750 Computer Architecture