Home,  Parent,  Memory Latency,  DRAM,  Single Processor,  Fast DRAM,  Fast DRAM II,  Low Latency Memory, and SRAM

Fast DRAM Update (2019-10)

Onur Mutlu Lectures on Youtube, SAMOS Tutorial - Memory Systems - Part 4: Low-Latency Memory mentions several aspects in which existing DRAM products could be used to achieve better performance with current generation server processors. DRAM manufactured as components for main memory have conservative timing specifications to allow for high yield and broad operating range. Operating temperature is typically 85°C if not higher. When operated beyond specifications (lower latency), errors tend to occur in specific rows. Also ECC could correct errors to some degree.

Timing
Micron DRAM components could have two grades, a regular part and an E part. For the Micron 8Gb DDR4 SDRAM at 2666MT/s, the -075 Speed Grade has tCL-tRCD-tRP timings of 19-19-19 corresponding to 14.25ns (the Micron datasheet labels the first timing as tAA). The -075E part has 18-18-18 timings corresponding to 13.50ns. At any given data transfer rates, the timing, representing an integer multiple of the clock, must be equal or higher than the true time specification.

All commercially available ECC RDIMM and LRDIMM products show timings of 19-19-19, meaning that no one bothers with making DIMMs for server systems from the E part?

Temperature
Onur states that the upper bound operating temperature of 85°C is generous, and that 55°C is possible. The Micron commercial part has an operating temperature range of 0 to 95°C. However, the refresh period is 64ms up to 85°C, and must be reduced to 32ms at higher temperature. At 55°C, his investigations showed that 30% better read timings were possible, more for write? Does this mean timings around 11ns? This would be consistent with the specialty unbuffered non-ECC DIMMs for gaming systems with heatsinks for better cooling.

The DIMM

The image below is for a Micron 64GB DDR4 LRDIMM. This is presumably the full-size module with dimensions 133.48 x 31.25mm, or 5.25 x 1.23in.

DIMM1
DIMM1

The interior dimension between the white locking tabs of the DIMM socket appears to be 126.65mm or 4.99 in.

The thickness of a DIMM is 3.9mm. Note the DIMM is double sided in that there are chip packages of both sides of the module. Below left is an figure from the Micron 64GB ECC LRDIMM datasheet. Below right is a representation of a double sided DIMM.

   DIMM1       DIMM1

The Motherboard

The image below is from the Supermicro X11 DPU Motherboard, showing one of the two Xeon SP sockets with accompanying 12 DIMM slots, 2 DIMMs per channel.

DIMM1

Based on the DIMM slot interior length of 126.65 mm, the spacing between DIMM slots appears to be 8.6mm. The spacing between populated DIMM slots is then 8.6 - 3.9mm = 4.7mm. Of course, it would help if Supermicro or other could the actual specification.

Below is a photo of the DIMM slots populated, which appears to support my guess of around 4.7mm spacing between DIMMs.

DIMM1

DIMM with Heatsink

Below is a representation of two DIMMs based on an 8.6mm spacing between slots and a 4.7mm gap between DIMMs.

      DIMM1

The gap is narrow, but we could have heat sink for a double-sided DIMM as follows.

      DIMM1

The fins of the heat sink could have different offsets between the front and back sides. When multiple adjacent DIMMs are populated, the heat sink fins alternate to use the space between the remaining gap.

It is possible that some sort of mechanism would be needed to insert the entire bank of DIMMs together. Air ducts should be employed to direct airflow over the fins.

Single Processor Xeon SP Memory Latency

For a single (XCC die) Xeon SP processor (or socket), the full round-trip latency is probably 77ns, of which 19ns is in the L3. The difference of 58ns is split between the memory controller, transmission time and the DRAM chip.

With conventional timings of 14.25ns for each of tRCD and tCAS, the two combined is 28.5ns. Under light to medium load, the third component, tRP, is hidden? This would imply the 58ns of latency after L3 is split with 28.5ns on the DRAM chip, and 29.5 between the controller and transmission?

If the timing components can be reduced from 14.25ns to 10.5ns, that reduce the combined tCAS + tRCD by 7.5ns. This would reduce round-trip memory access from 77ns to 69.5ns. For a database transaction processing work load, this could translate to a 10% increase in performance?

Locality of Errors

In Onur's lecture, there is a slide showing that errors with tRCD = 7.5ns showed spatial locality on the DRAM chip. This could be because the bit cells further away, either on the addressing side or on the sensing side, will encounter errors first. Or it could be manufacturing variations, in which presumably an entire batch will have slow cells in certain locations.

The common scenario in modern systems is that memory configuration is far beyond what is required. More than half of the memory probably serves to reduce IO from a moderate to noise levels.

The extra 3ns reduction in DRAM component timing would be highly valuable. There is no need for a sophisticated multi-level timing scheme. We could simply test a kit of DIMM modules with chips manufactured in the same batch, mark the slow cells, which hopefully within common operating system pages, then ignore these pages? If we lose 20-30% of the DIMM capacity, no problem. In a prior article, I suggested disabling the entire upper half of each DRAM bank.

If component timing could be reduced to 7.5ns, then the low load DRAM latency is 15ns, and round-trip memory latency is 63.5ns. This is 21% better than the baseline of 77ns.

 

 

Fast DDR4 SDRAM Proposal (2018-10)

Memory latency is of such paramount importance that we should to a large degree abandon multi-processor systems in favor of single processor (socket) systems as detailed in Multi-Processors Must Die. The advantage of having all memory on the local node outweighs much of the benefits of additional resources in a multi-socket system, which are partially negated by the burden of remote node cache coherency penalties.

To this purpose, DRAM vendors are called upon to offer low latency DRAM suitable for use as main memory in server systems. The value of lower memory latency out weighs almost any cost incurred. Furthermore, we can sacrifice system memory capacity if necessary, as detailed in Too Much Memory.

DRAM_bank2

It is understood that this is not a trivial request. Despite its conceptual simplicity, modern DRAM is in fact incredibly complicated, with almost every detail a tightly guarded trade secret. Asking manufacturers to produce a special DRAM product for low latency could incur significant up-front costs.

Given that the value is expected to be very high, a temporary product which could be brought to market with much less effort and expense is proposed. Much of the latency in modern DRAM is attributed to the bank size and organization. The supporting elements outside of the DRAM bit array comprises a significant fraction of the die size. As cost was viewed as one of the primary drivers, the design attempts to minimize the area necessary for non-bit array elements.

Hence, the DRAM is made with as few banks as is necessary to support the sustained bandwidth objective, this being the other major factor in main memory. The bank arrangement is also structured with cost in mind? DDR4 has 16 banks.

At 8Gb density in ×4 configuration, there are 2G 4-bit words. Each bank is 512M bit, or 128M words. The full address is 31-bits (2^31) of which 4 bits is for the bank, including bank group.

The original purpose of multiplexed row and column addressing was to minimize the number pins, which contributes to system cost. The 4Kbit Mostek MK4096 had 16-pins, versus 22-pins for the Intel 2107 4Kb product.

If pin count were still important, the multiplexed addressing scheme for 2G words might have 17 signals for 4 bank + 13 row address bits in the first group followed by 14 column address bits in the second group. (the bank address needs to be among the first set).

But instead, the 8Gb (2G×4) DRAM has 21 address bits. The 4 bank and 17 row addrress bits are sent first, followed by a 10-bit column address in the second group. In other words, the bank organization is highly asymmetric at 128K (131,072) rows × 1024 columns (× 4-bits per word). Presumably this arrangement is to minimize the area necessary for the sense amp array (and IO gating).

The long path for the rows is said to be a major contributor to latency in existing DDR4 products. The proposal is then to simply disable the upper half of the rows in the bit array furthest away from the sense amps.

DRAM_bank2

Timings can now be based on the rows closest to the sense amps. This is analogous to the old practice of short-stroking hard disk drives for lower seek times.
Perhaps the term is now short-banking?

There are academic papers that propose to split the DRAM into fast and slow cells. The fast bits are the rows closer to the sense amps and the slow bits are in rows further away. The driving requirement in these proposals is to not increase the cost of the DRAM.

The arguments made in the Multi-Processors and Memory references cited earlier are that memory latency is extremely valuable, and that system memory capacity is probably much larger than necessary.

A tiered memory model would require software support, both in the operating system and at the application levels. The desire here is to have a quick drop-in solution, regardless of cost implications.

The justification is sufficient to absorb double the cost per capacity. Halving system memory capacity is also acceptable. Even better is if we could bin production parts for best characteristics, then use a control signal to disable the far rows.

Timing values of the three principal components tCAS, tRCD and tRP in current generation DDR4 is approximately 14ns each. This is a conservative value to allow for high yield. By binning parts and employing a heat sink, memory modules with 10ns timing in the principal components is possible.

The objective for the short-stroke rows would be 7ns and tRC perhaps less than 30ns?

In summary, this is only the near-term proposal. Long-term, dispensing with multiplexed row-column addressing is desired.

Special parts with higher bank counts, similar to RL-DRAM (16 banks at 1.125Gb density) may be desirable. The extreme bank count of the Intel 1Gb eDRAM (128 banks?) is probably not necessary.

If it happens that binning is an effective strategy, then it might be better to bin the standard part with half of the bit array disabled instead of having a special part.

 

References

Onur Mutlu, Professor of Computer Science at ETH Zurich
website, lecture-videos

ACACES Summer School 2018, Lecture 5 Low-Latency Memory
DRAM memory latency: temperature, row location, voltage, etc.

Low temperature operation also contributes substantially to lower DRAM latency. See Onur Fall 2017 Computer Architecture, Lecture 6, Low-Latency DRAM and Processing In Memory.

Young Hoon Son, Seongil O, Yuhwan Ro, Jae W. Lee, Jung Ho Ahn,
Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations, isca13_charm.

 

Comments

1. If a significant reduction is made in DRAM latency, the next thing to think about is L3. The large die high core count processors now have 18-19ns latency.

This reflects the long distances of perhaps 22mm in horizontal and 14.7mm vertical over the 32.18mm in horizontal and 21.56mm die.

  Skylake_XCCd

The smaller quad-core client processor die has L3 of 10-11ns with dimensions 13.31 × 9.19mm and the length of the ring is perhaps 6.3mm?

  Skylake_4c   Skylake_4c

If we have very low latency to DRAM, then we might consider on L2 miss, issuing the memory access concurrent on L3. If L3 comes back first, then the memory access is discarded?

2. What is the net effect of DRAM refresh?
Does it increase the average memory access latency?
If so, then perhaps even SRAM as main memory is also viable for critical systems?
Onur Mutlu presentation at MemCon 2013, Memory Scaling: A System Architecture Perspective, slide 22 say refresh overhead will rise sharply beyond 8Gb density?