Parent: Processor Architectures

Intel and AMD Historical, Pentium 4, AMD Opteron, Dual Core, Pentium M to Core 2, Nehalem, Sandy Bridge to Haswell, Hyper-Threading, SIMD Extensions

This section will now incorporate all Hyper-Threading material, Also see Hyper-Thread Performance

Intel Hyper-Threading

Hyper-Threding from NetBurst to Nehalem (Mar 2010?)

There seems to be significant confusion about Hyper-Threading. Part of the problem is that vendors like to tout every new feature as the greatest invention since the six-pack, and its follow-on the 12-pack. I used to think the 4-pack was a travesty. But now I am older and can nolonger finish a 12-pack with each meal. Suddenly the 4-pack is not such a travesty.

But I digress. I do applaud innovation. And I do accept that the first generation is never perfect or close to problem-free for that matter. That's why it is called the bleeding edge. If you want to play in this ballpark, you better know how to do a proper investigation. Given that Intel is a company with immense resources. They could have put out a detailed document on when to use HT, and when not to. Instead, like every other company, they touted the positives, and hid the negatives.

The Intel Pentium 4 architecture NetBurst processors up to Xeon 50x0 and 7100, from 2001 to 2006 were the first generation of x86 processors with HT. Technically, the 180nm Willamette/Foster and 130nm Northwood/Gallatin processors were the first generation HT, and the 90nm Prescott/Potomac and 65nm Cedar Mill/Tulsa processors were generation 1.5, improvements, but there was not enough time to make second generation changes.

The NetBurst architecture HT could benefit a very limited range of SQL operations (high network roundtrip, backup with compression). HT was neutral in many, and negative in some, particularly parallel execution plans. For most people, it was better to disable HT, or lock SQL Server to lower half of logical processors (1 logical per physical core, Windows OS processor enumeration has since changed).

The subsequent generation of Intel Core 2 architecture processors Xeon 5100-5400 (from 2006 to 2009), and Xeon 7200-7400 (2007-2010) were designed by a completely different team and did not have HT so this is a moot point.

Now at Hyper-Threading has returned with the Intel Xeon 5500 (Nehalem architecture, introduced 2009) and 5600 (Westmere), and soon 7500 (Nehalem-EX). BTW, Nehalem was designed by the former Willamette/Prescott team. Every indication is that most of the HT issues of NetBurst have been fixed. So I would definitely not recommend disabling HT in BIOS. I generally recommend limiting MAXDOP to 4, but for certain tested queries explicitly setting a higher MAXDOP as appropriate.

AMD Opteron to date has never had HT.

The Intel Itanium processors with dual-core have HT, and I believe this is also a good HT implementation, as there was sufficient time lapse between the first generation of HT to correct any critical issues.

More about Hyper-Threading

The original intent of HT in Willamette (Pentium 4, NetBurst) was to make better use of the 3 superscalar execution units. The Pentium Pro expanded the 2 execution unit in Pentium to 3 units. The x86/IA-32 instruction set architecture knows nothing about the microprocessor having superscalar execution units. The microprocessor must figure out on the fly if multiple instructions can be issued simultaneously. So while superscalar architecture did improve microprocessor performance, there was a rapid fall off in benefit and rapid increase in complexity in adding more superscalar execution units. I recall that the average number of instruction completed per cpu-cycle in the 3 unit was 1, as there were many dead cycles.

(This is also why the Intel/HP joint effort went with Explicitly Parallel something something for Itanium. If the compiler can figure out what instructions can be executed in parallel, then the processor does not have to expend silicon/transistors for this)

So the idea was to have an extra program counter and registers to run 2 threads simultaneously, given that the average utilization of the execution units was low. I recall there being some discussion that without a priority thread instruction, there would be probelms with transactional database engine style code.

Simultaneous versus Time-slice Multi-Threading

In both Itanium and Nehalem, it is my understanding that Intel has backed away from simultaneous multi-threading to time-slice multi-threading. The theory here is that in modern microprocessors, the clock rate is so short relative to memory access, and even L3 cache, that there will be many dead cycles every time there is an L3 or memory access. So now HT will switch in the alternate thread any time a thread encounters dead cycles.

In the NetBurst HT, I did an extensive investigation and could find no SQL operation that benefited from HT, with some negatives. The only database application that did benefit was network round-trip intensive ops, like SAP, that fetch a single row at a time. The gain here was about 15%. When I did work on Quest LiteSpeed, I found that the compression engine could get an astounding 40-50% performance gain with HT. So the theory of HT was sound, and what ever in the SQL engine that has problems with HT does not occur in a simple multi-threaded compression engine.

In summary, HT with the current generation Xeon, and even Itanium, is too good to blindly disable in BIOS, based on tribal knowledge that really applied to the older generation NetBurst processors. Too many "best practices" are written by people who do not actually investigate the underlying reason for a "rule", they just want a list of rules to live by.

Hyper-Threading or HyperThreading

A Google search seems to prefer HyperThreading over Hyper-Threading. The official Intel term, and they have many marketing people who do nothing be determine the official term, seems to be Hyper-Threading but many pages on the Intel web site has HyperThreading. Here is the Intel article for the original Willamette-Pentium 4 NetBurst Hyper-Threading article. The Wikipedia entry on Hyper-Threading

Below was moved from Nehalem or Sandy Bridge sections

Intel Hyper-Threading Recap: Simultaneous and Temporal
Several Intel processor architecture have Hyper-Threading, including the Pentium 4 (NetBurst), the more recent Itanium (Monteceito forward) and even Atom. The original Pentium 4 implementation was simultaneous multi-threading. The Itanium and Nehalem (Atom?) implementions are time-slice or temporal multi-threading. (Wikipedia say Nehalem is SMT, not temporal?)

The general idea in microprocessor architecture has always been to make best use of the available transistors. Long ago, the objective was single threaded performance at the processor socket level (and multi-threaded performance at system level). For the last several years, the objective shifted to multi-threaded performance within a reasonable power envelop, with consideration for the fact that single-threaded performance is still important.

The ideal microprocessor design has all units uniformly running at maximum load continuously. Of course, the ideal cannot be achieved across a broad range of applications except perhaps brief intervals.

In a single-threaded architecture, the pattern of Moore's law has been each doubling of the transistor budget could increase performance by about 40%. In a multi-core design, the aggregate throughput could be linear with the number of cores, or transistor budget. One criteria power budget of the combined cores. In many multi-core designs, the clock frequency is restricted to below the top frequency capability of the single core. So multi-core processors can achieve scaling than Moore's Law if an application is effectively multi-threaded.

Early in the original Pentium 4 (Willamette) design phase, it was thought that simultaneous multi-threading (SMT) as it was called before Hyper-Threading, could contribute 30% through-put performance gain in multi-threaded server applications at a cost of approximately 10% in transistor budget. As it turned out, unanticipated complications resulted in less gain in the key TPC-C benchmark, probably on the order of 10%.

It is possible or suspected that all of the Pentium 4 HT performance gain can be attributed to the network round-trip portion, and no gain in the core SQL Server engine. In addition, there was erratic behavior in other applications probably due to neither the operating system nor the application being properly designed for a HT processor architecture. This is especially evident in parallel execution plans. The Prescott based Pentium 4 processor did show 40-50% performance gain on Quest LightSpeed database backup compression, probably the highest reported HT performance gain. Hyper-Threading does have potential, but there are issues that need to be worked out.

For the Nehalem generation, an informal statement puts the HT performance gain in the range of 30% for high call volume workloads, and in the range of 10% for DW, but additional details are necessary before making conclusive statements. If the original Willamette die size impact were still the case, then the 30% gain in transaction for 10% die size is a good trade. However the 10% gain in DW is only marginal, about the same as having more non-HT cores.

In a test with custom non-transactional (no locking code) index search engine, an astounding nearly 100% performance gain was observed with HT in a 2-way quad-core Nehalem system. The code was almost entirely a repetitive sequence of string comparison followed memory fetch. If the comparison operation is shorter than a local memory fetch (50-60ns, or 150-180 CPU-cycles), then 100% gain would seem to be possible. Perhaps there are still unresolved HT contention points in the SQL Server engine.

With the Bulldozer architecture, AMD is asserting that the Fetch/Decode and FP units are under-utilized in previous Opteron architectures. Of course, Bulldozer is targeted at server applications, which tend to emphasize integer performance. Hence the Bulldozer design has two integer cores sharing the under-utilized units.