CorsairVengeance

Heat Sink for Server Memory

(2019-10)
In the previous article Low Latency Memory in Servers (Sep-2019), also here, I mistakenly attributed the performance of extreme memory for gaming systems largely to screening standard DRAM chips for parts able to meet better timing specifications. The Onur Mutlu Lectures: SAMOS Tutorial - Memory Systems - Part 4: Low-Latency Memory (video) with reference to Lee et al., states that lower operating temperature has a large effect. This accounts for most of the latency difference between standard and performance memory, though screening may still be a contributing element.

The server systems community does not show interest in low latency memory at higher cost. Historically, there had been good reasons for this. Servers were typically multi-processor systems with a very large memory subsystem. The path from processor to memory is long, complex and possibly non-uniform. As a result, a very large part of the latency occurred outside of the DRAM chip. The benefit of faster DRAM would have been marginal as the percentage change in overall round-trip latency was small. And this was compounded because the higher cost of a DRAM chip is multiplied over very many DIMMs, each with many chips.

For the last few years, it has been possible to have significant compute on a single die processor direclty connected to sufficient memory that is capable of handling most workloads as detailed in Multi-Processors Must Die (and here). Even in the single die processor connecting directly to memory, the majority of round-trip latency is still outside the DRAM chip. However, the relative impact of a 30% reduction in latency at the DRAM chip is now large enough to be worthwhile, netting perhaps 11% for software characterized by pointer-chasing code. The prominent example is database transaction processing for which, software licensing is on per-core basis and outweighs hardware costs. And hence, an increase in performance per core has high value. If the cost of performance memory is 2-3 times higher than standard memory, the net gain is still positive.

The methods employed in performance memory for gaming systems can be used on ECC RDIMM and LRDIMMs for servers. The question is whether there is sufficient space for a heatsink with the DIMM spacing on existing server motherboards. This appears to be true, although the spacing is less than that of desktop/gaming motherboards. With little effort and development cost, vendors can evaluate the viability of low latency memory in single processor server systems. The ultimate goal is justify the effort necessary to produce a purpose designed low latency DRAM for use as main memory. Demonstrating the impact possible on this front might also prod CPU manufacturers to substantially reduced memory latency sources outside of DRAM, specifically the 18-19.5ns on the L3 of Xeon SP processors (hint, Intel).

Performance Memory I

Performance memory products, all unbuffered non-ECC, from Corsair, G.Skill and others lead with a dazzling data transfer rate, which could be as high as 4,000MT/s. Compare this to 2,666MT/s on the Xeon SP and 2,933MT/s for gen 2. However, it is the latency timing specifications that are important for certain important server applications.

The Corsair Vengeance LPX 16GB DIMM (32GB kit), tested at 3,600MT/s and 16-18-18-36 latency, cites SPD* speed of 2133MT/s, 15-15-15-36.

*JEDEC Standard No. 21-C Annex L: Serial Presence Detect (SPD) for DDR4 SDRAM Modules states: "The SPD data provides critical information about all modules on the memory channel and is intended to be used by the system's BIOS in order to properly initialize and optimize the system memory channels."

I presume that this means Corsair sourced a 2,133MT/s part rated for 15-15-15-36 timing at the standard temperature specification of 95°C.

The G.Skill Trident Z F4-3333C16D-32GTZ is a 16GB DDR4 product tested at 3,333MT/s and 16-16-16-36 timing. The F4-3200C14D-32GTZ is tested to 3,200MT/s and 14-14-14-34 timings. Both cite SPD as 2133MT/s but do not indicate latency timing.

The Micron 8Gb DDR4 SDRAM

The Micron 8Gb DDR4 SDRAM datasheet cites a range specifications, including refresh time for various temperatures, timings for various data transfer rates, and operating temperatures. The commercial part specifies 0° ≤ Tc ≤ 95°C.

Note: Refresh time is 64ms at Tc range -40° to 85°C and 32ms at >85° to 95°C.

Components typically have two grades, a regular part and an E part. There might also be a Y part.

The DDR data transfer rate is twice the clock rate, as its name implies. A data transfer rate of 2,666 MT/s corresponds to 1,333MHz clock rate. The clock cycle time at this MTs is 0.75ns, which Micron labels as a -075 Speed Grade for the regular part and -075E for the E part.

The table below is list of key timing parameters at speed grades 075 and 093 for the regular and E parts.

Speed GradeData Rate (MT/s)Target CL-nRCD-nRPt (ns)
-075E266618-18-1813.50
-075 266619-19-1914.25*
-093E213315-15-1514.06
-093 213316-16-1615.00

* There is the value 13.75ns in parenthesis for the -075 part. This may be the true timing, and the listed value is the lowest integer multiple of the clock cycle but higher than the true value?
13.50ns in parenthesis.

Micron lists tAA, tRCD, and tRP separately. But they all have the same value and are shown as just t above. tAA is tCL?

Note that the Corsair Vengeance LPX citing SPD speed 2,133 MT/s and 15-15-15 timings matches the Micron -093E specifications. This might imply that Corsair buys parts from a DRAM manufacturer pre-screened to a better timing. They might also conduct additional screening afterwards. This is consistent with Onur Mutlu's report in which limiting temperature to 55°C contributes to 30% faster read.

Performance Memory II

The Corsair and G.Skill performance memory specifications from the previous section are below.

ProductData Rate (MT/s)TimingtCL (ns)tRCD/RP
Vengeance LPX360016-18-18-368.910.0
F4-3333C16D333316-16-16-369.609.60
F4-3200C14D320014-14-14-348.758.75

The Corsair Vengeance LPX at 3600 MT/s has 16 clock tCAS time, and 18 for each tRCD and tRP. The two G.Skill Trident Z products specify uniform component timings.

There are streaming workloads in gaming for which data transfer rate is a factor. But it is clear that latency is also a factor. For database transaction processing, latency is most important, while transfer above some value may not be particularly important.

Also, it might be that the Xeon SP processors only support the specified data transfer rates of 2,666 MT/s in gen 1 and 2,933 in gen 2. In this case, we are just interested in latency timing at the specified rate and not higher rates.

In the server system, overall round-trip latency after an L2 miss should be comprised of the following elements:
  1) L3,
  2) memory controller,
  3) transmission time (both directions),
  4) DRAM timing (tCL + tRCD at light load and with good luck?)
  5) transfer time for 8 words (4 clocks)?

If each memory channel has a good pattern in which accesses to a given rank (within one or two DIMMs) and bank (within the DRAM chip) are sufficiently spaced out, then tRP is hidden? Otherwise tRP is incurred?

E1EIKI_2013_v7n1_53_f004b

In any case, DRAM timing is complicated, see  DRAM.

The assumption for server workloads is that memory access is somewhat random, and hence we do not expect repeated access within a given row (tCL only)? For B-tree navigation, certain index pages, too many to fit in L3, are accessed frequently enough that we incur tRP as well?

The DRAM contribution to latency is then tCL + tRCD + 4 clock cycles (to transfer 8 words) in favorable circumstances?

For the Trident Z products at 3,333 MT/s with 16 clock timing and 3,200 MT/s with 14 clock timing, the table below shows the calculated DRAM latency.

At 3,200MT/s and 14 timing, the clock is 0.0625ns, tCL + tRCD = 17.5ns, four clocks is 2.5ns for 20.0ns.

ClockData Rate (MT/s)Timingt (ns)tCL+tRCD+4×clock
0.6003,33316-16-169.60= 21.6ns
0.6253,20014-14-168.75= 20.0ns

It is unclear why the 3,333 MT/s part could not achieve 15 clock latency corresponding to 9.0ns when the 3,200 MT/s part could do 8.75ns. A similar anomaly occurs between the G.Skill 3,600 and 4,000 MT/s parts. It is unclear as to what balance between data transfer rate (bandwidth) and latency is best for gaming systems.

The DIMM

Temperature is a large factor in the achievable latency of current production DRAM parts. That plus one extra timing clock reduction from screened parts could enable 35% lower latency at the DRAM. The question is now whether there is room on existing server motherboards, which have different specifications than desktop motherboards?

The image below is from the a Micron 64GB DDR4 LRDIMM datasheet. This is the full-size module with dimensions 133.48 x 31.25mm, or 5.25 x 1.23in.

  DIMM1c

The thickness of a DIMM is 3.9mm. Note the DIMM is double sided in that there are chip packages on both sides of the module. Below left is an figure from the Micron datasheet. Below right is a representation of a double sided DIMM.

      DIMM1d       DIMMp1

The Motherboard

The image below is from the Supermicro X11 DPU motherboard, showing one of the two Xeon SP sockets with accompanying 12 DIMM slots.

    XeonSPskt2

There are memory channels on each side of the processor and 2 DIMMs per channel. The interior dimension between the white locking tabs of the DIMM slot appears to be 126.65mm or 4.99 in.

Based on the DIMM slot interior length of 126.65 mm, the spacing between DIMM slots is 8.6mm? The spacing between populated DIMM slots is then 8.6 - 3.9mm = 4.7mm. Of course, it would help if Supermicro or other could provide the actual specification.

Below is a photo of a Xeon SP 3647-pin socket with 6 DIMMs on each side.

  XeonSPskt4
Photo credit: TG

Below is a closeup of the DIMM slots populated, which my support my guess of around 4.7mm spacing between DIMMs.

      XeonSPskt3

In case anyone is curious, the image below is for DIMM slots of an Intel Desktop and Xeon E LGA 1151 motherboard.

      XeonEskt

The spacing between DIMMs appears to be 10mm, more than the LGA 3647 motherboard.

DIMM with Heatsink

Below is a representation of two DIMMs based on an 8.6mm spacing between slots and a 4.7mm gap between DIMMs.

      DIMM1

The gap is narrow, but we could have heat sink for a double-sided DIMM as follows.

      DIMM1

The fins of the heat sink could have different offsets between the front and back sides. When multiple adjacent DIMMs are populated, the heat sink fins alternate to use the space between the remaining gap.

It is possible that some sort of mechanism would be needed to insert the entire bank of DIMMs together. Air ducts should be employed to direct airflow over the fins.

Performance Memory at 2,666MT/s

For an ECC RDIMM at 2,666MT/s, the clock is 0.75ns. The standard part timing is 19 clocks, and the E part timing is 18. Assuming that similar timing gains could be achieved at the 55°C temperature bound as in gaming memory, we also look at latency at 14 and 13 clock latencies as well below.

GradeData Rate (MT/s)Timing   t (ns)tCL+tRCD+4×clock
-075 2,66619-19-19 14.25= 31.5ns
-075E2,66618-18-18 13.50= 30.0ns
make
this
2,66614-14-14 10.50= 24.0ns
part
please!
2,66613-13-13  9.75= 22.5ns

Note, there is an extra clock for registered DIMMs versus a single unbuffered DIMM, not accounted for above.

Single Processor Xeon SP Memory Latency

In a single processor (socket) Xeon SP system, the memory latency should be 77ns, based on L3 + 58ns and the standard -075 part. The L3 latency for the 28-core XCC die is reported to be 19.5ns, though this might be less for the LCC and HCC dies?

Of the 58ns occurring after L3 miss, we expect 31.5ns to be in the DRAM, and the remainder to be split between the memory controller and transmission time (possibly including the RDIMM overhead?).

With the E part, we expect memory latency to be reduced by 2 (DRAM) clocks totaling 1.5ns, to 75.5ns. The impact in database transaction processing, characterized by pointer-chasing code, is 77 ÷ 75.5 = 1.0199, for a 2% gain. Is perhaps too small to be of interest?

However, if either of 14 or 13 clock timings are possible, then the expected gains would be 10.8% or 13.2%. This is noticeable!

But how valuable is it? Consider that our system has 12 DIMM slots and the current price of a 64GB DIMM is about $387, for a total memory price of $4,644. Would 10% performance gain be worth $9,000?

Does that seem to be an extraordinary price step for 4.5ns of tCL and tRCD?

The SQL Server 2017 Enterprise Edition license is $6,736 per core at list price. The price for 28-core licenses is $188,615 at list. It would help if Microsoft could suggest a typical discount percentage applicable in the US?

From this point of view, 10% improvement in performance per core could justify such a cost structure, if a 30% reduction in DRAM latency can be delivered.

At 2,933 MT/s, the clock cycle is 0.682ns. The standard timing is 21 clocks or 14.32ns. At 20 clocks, the latency is 13.64ns.

Summary

There are existing performance memory products having substantially lower latency than the ECC RDIMM/LRDIMMs available for server systems. These are sourced from standard DRAM chips, probably with tighter screening and additional testing. The production methods could be employed for server memory. There is probably sufficient room for a heatsink on server motherboards.

While an enhanced memory module could be used in multi-processor systems, the relative performance impact is larger in single socket systems. Given the cost structure of software licensing, the value of low latency memory can justify the expect price delta over standard latency parts.

The longer term goal is establish sufficient market demand for DRAM manufacturers to justify the cost and effort to necessary for a product designed from the beginning for low latency.

Even better would be if Intel or other could eliminate or hide the L3 latency, either with a very fast L3 tag check or even issuing the memory access concurrent with the L3.

 

References

Onur Mutlu, Professor of Computer Science at ETH Zurich
website, lecture-videos

ACACES Summer School 2018, Lecture 5 Low-Latency Memory
DRAM memory latency: temperature, row location, voltage, etc.

Low temperature operation also contributes substantially to lower DRAM latency. See Onur Fall 2017 Computer Architecture, Lecture 6, Low-Latency DRAM and Processing In Memory.

Young Hoon Son, Seongil O, Yuhwan Ro, Jae W. Lee, Jung Ho Ahn,
Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations, isca13_charm.

 

Comments

1. If a significant reduction is made in DRAM latency, the next thing to think about is L3. The large die high core count processors now have 18-19ns latency.

This reflects the long distances of perhaps 22mm in horizontal and 14.7mm vertical over the 32.18mm in horizontal and 21.56mm die.

  Skylake_XCCd

The smaller quad-core client processor die has L3 of 10-11ns with dimensions 13.31 × 9.19mm and the length of the ring is perhaps 6.3mm?

  Skylake_4c   Skylake_4c

2. DRAM refresh - temperature. Could refresh period be increased at lower temperature?
Onur Mutlu presentation at MemCon 2013, Memory Scaling: A System Architecture Perspective, slide 22 say refresh overhead will rise sharply beyond 8Gb density?

Locality of Errors

In Onur's lecture, there is a slide showing that errors with tRCD = 7.5ns showed spatial locality on the DRAM chip. This could be because the bit cells further away, either on the addressing side or on the sensing side, will encounter errors first. Or it could be manufacturing variations, in which presumably an entire batch will have slow cells in certain locations.

The common scenario in modern systems is that memory configuration is far beyond what is required. More than half of the memory probably serves to reduce IO from a moderate to noise levels.

The extra 3ns reduction in DRAM component timing would be highly valuable. There is no need for a sophisticated multi-level timing scheme. We could simply test a kit of DIMM modules with chips manufactured in the same batch, mark the slow cells, which hopefully within common operating system pages, then ignore these pages? If we lose 20-30% of the DIMM capacity, no problem. In a prior article, I suggested disabling the entire upper half of each DRAM bank.

If component timing could be reduced to 7.5ns, then the low load DRAM latency is 15ns, and round-trip memory latency is 63.5ns. This is 21% better than the baseline of 77ns.

 

Low Latency Memory in Servers from Linkedin 2019-09-17

  eDRAM1Gb

In the gaming world, fanatics go to extraordinary lengths to achieve an advantage. There is no prize for second place in mortal combat. For this, vendors offer low latency non-ECC unbuffered memory with component timings of approximately 9-10ns, well below the standard 14+ns. DRAM vendors do not actually make special chips designed for gaming systems, though they may offer a part with enhanced timing. The high performance memory is mostly achieved by (edit: lower temperature operation? and partly by) cherry picking standard parts able to meet better timing specifications. The best gaming memory can carry a price three times higher than a conventional part.

The question here is whether a 3X price premium for low latency memory is also viable in server systems. We should be able to use screened DRAM chips to make ECC RDIMMs in the same way as for modules used by gaming systems.

Obviously, lower memory latency is good. The issue in servers was that the full system architecture, including the memory subsystem was complex, having multiple hops between processors and DRAM. This is inherent in multi-processor systems with expander buffers to connect extra memory slots.

  XeonE7_2

The result is that a large part of the complete memory access latency takes place outside of the DRAM chip. If considerable expense were incurred for a faster DRAM chip, there would be little performance gain at system level.

Multi-processor systems with complex architecture were important years ago. In recent years, there are enough cores on a single processor chip and socket to handle even very heavy workloads. DRAM chip density is now high enough that memory expansion buffers are no longer necessary. Direct connection between processor and DRAM is now standard while still having sufficient memory capacity for most server workloads.

Simply stepping down from a multi-processor system by itself appears to lower memory latency between 13-30% on the Intel Xeon SP processor. The 2-socket system has 89ns local node and 139ns remote node memory latency.

  XeonSP_2S

Single socket memory latency is 76-77ns. Both are based on standard DRAM tCAS, tRCD and tRP timings of approximately 14.25ns each. For the single socket system, memory latency, 18-19ns occurs in the L3 cache, and the remaining 58ns is split between the memory controller, inter-chip transmission and the DRAM chip.

  XeonSP_1S

The 7-cpu.com website reports memory latency for the Intel i7-7820X (same die as the Xeon SP LCC) at 4.3GHz as 79 cycles + 50ns using DDR4-3400 at 16-18-18, corresponding to approximately 10ns DRAM component timings. An 8ns reduction in latency at the DRAM chip, from 77ns to 69ns at the system level translates to 11% performance for database transaction processing workloads, characterized by pointer-chasing code.

Assume our system has 12 x 64GB = 768GB total system memory. The current price of the 64GB ECC LRDIMM is about $350, contributing $4,200 to our system cost. Based on the pattern in gaming memory of 3X premium for cherry picked parts, our low latency 64GB ECC DIMM is now $1050, and $12,600 for a set of 12.

Does it seem that $8,400 is a great deal to pay for 11% on a specific workload? Consider that the Intel 2nd gen Xeon SP 6252 24-core 2.1GHz processor carries a published price of $3,655 or $152 per core and the Xeon SP 8276 28-core 2.2GHz at $8,719 or $311 per core. (There are also 24-core processors at 2.4 and 2.9GHz, and 28-core processors at 2.7 and 3GHz.)

In essence, Intel has established that a 17% increase in core count at this level justifies a premium of $5000. So the 3X premium for memory is in the right ballpark if it can deliver 11% performance gain. This is just the hardware cost. Now factor in the SQL Server Enterprise Edition per-core licensing cost of $6,736 list and perhaps $4,100 discounted? And also the annual support cost.

This is a sufficient argument to justify low-latency ECC memory for server systems, which can be made now with the existing method used for gaming systems. Ideally, the right thing to do is a DRAM chip purpose designed for low latency instead of screening parts from a design for low cost. This probably a very involved effort, and DRAM vendors are extremely secretive on their technology.

Elsewhere I have suggested an interim measure of disabling the upper-half of each DRAM bank, which bypasses the slowest part of the bank. The apparent cost overhead of this is 2X, which is less than the 3X premium of cherry picked parts, though we may also be giving up 2X of system memory capacity.

  DRAM_bank2

Further measures could be to finally abandon the multiplexed row-column addressing and sub-dividing the DRAM chip into smaller banks, as is the example of RL-DRAM. We may not need to go the extreme measures in the Intel 1Gbit eDRAM chip with 128 banks citing 3.75ns latency?

Addendum Oct 17

See Onur Mutlu Lectures on Youtube,  SAMOS Tutorial - Memory Systems - Part 4: Low-Latency Memory

At around 13:00, he mentions DRAM chips being spec'd to 85C. Operating at 55C, read latency can be reduced by 32.7% and write latency by 55.1%, sometimes more! We should have heat sinks on both sides of server R/LRDIMMs as allowed by spacing, and ducts to force air flow through the heat sinks.

In testing at low timings, errors tend to occur in specific locations. if there were a mechanism to mark bad rows on individual DIMMs? Perhaps something in the BIOS/UEFI to mark certain rows as bad, and let the OS know not to use those addresses?

Server memory has ECC to detect and correct for software errors, normally radiation induced. But another source of errors is operating DRAM close to the timing spec. Perhaps we should have 80-bit wide DIMMs for extra detect and correct capability. Furthermore, if a clean data page has a detected but uncorrectable error, instead of an OS blue screen, let SQL Server handle it.

 

 Parent,