Home,   Parent,    Memory Latency II,  Single Processor,   Low Latency Memory ,  Fast DRAM,  SRAM

Low Latency Memory (2019-09 )

In the gaming world, fanatics go to extraordinary lengths to achieve an advantage. There is no prize for second place in mortal combat. For this, vendors offer low latency non-ECC unbuffered memory with component timings of approximately 9-10ns, well below the standard 14+ns. DRAM vendors do not actually make special chips designed for gaming systems, though they may offer a part with enhanced timing. The high performance memory is mostly achieved by (lower temperature operation? and partly by) cherry picking standard parts able to meet better timing specifications. The best gaming memory can carry a price three times higher than a conventional part.

The question here is whether a 3X price premium for low latency memory is also viable in server systems. We should be able to use screened DRAM chips to make ECC RDIMMs in the same way as for modules used by gaming systems.

Obviously, lower memory latency is good. The issue in servers was that the full system architecture, including the memory subsystem was complex, having multiple hops between processors and DRAM. This is inherent in multi-processor systems with expander buffers to connect extra memory slots.

XeonE7_2

The result is that a large part of the complete memory access latency takes place outside of the DRAM chip. If considerable expense were incurred for a faster DRAM chip, there would be little performance gain at system level.

Multi-processor systems with complex architecture were important years ago. In recent years, there are enough cores on a single processor chip and socket to handle even very heavy workloads. DRAM chip density is now high enough that memory expansion buffers are no longer necessary. Direct connection between processor and DRAM is now standard while still having sufficient memory capacity for most server workloads.

Simply stepping down from a multi-processor system by itself appears to lower memory latency between 13-30% on the Intel Xeon SP processor. The 2-socket system has 89ns local node and 139ns remote node memory latency.

XeonSP_1S

Single socket memory latency is 76-77ns. Both are based on standard DRAM tCAS, tRCD and tRP timings of approximately 14.25ns each. For the single socket system, memory latency, 18-19ns occurs in the L3 cache, and the remaining 58ns is split between the memory controller, inter-chip transmission and the DRAM chip.

XeonSP_1S

The 7-cpu website reports memory latency for the Intel i7-7820X (same die as the Xeon SP LCC) at 4.3GHz as 79 cycles + 50ns using DDR4-3400 at 16-18-18, corresponding to approximately 10ns DRAM component timings. An 8ns reduction in latency at the DRAM chip, from 77ns to 69ns at the system level translates to 11% performance for database transaction processing workloads, characterized by pointer-chasing code.

Assume our system has 12 x 64GB = 768GB total system memory. The current price of the 64GB ECC LRDIMM is about $350, contributing $4,200 to our system cost. Based on the pattern in gaming memory of 3X premium for cherry picked parts, our low latency 64GB ECC DIMM is now $1050, and $12,600 for a set of 12.

Does it seem that $8,400 is a great deal to pay for 11% on a specific workload? Consider that the Intel 2nd gen Xeon SP 6252 24-core 2.1GHz processor carries a published price of $3,655 or $152 per core and the Xeon SP 8276 28-core 2.2GHz at $8,719 or $311 per core. (There are also 24-core processors at 2.4 and 2.9GHz, and 28-core processors at 2.7 and 3GHz.)

In essence, Intel has established that a 17% increase in core count at this level justifies a premium of $5000. So the 3X premium for memory is in the right ballpark if it can deliver 11% performance gain. This is just the hardware cost. Now factor in the SQL Server Enterprise Edition per-core licensing cost of $6,736 list and perhaps $4,100 discounted? And also the annual support cost.

This is a sufficient argument to justify low-latency ECC memory for server systems, which can be made now with the existing method used for gaming systems. Ideally, the right thing to do is a DRAM chip purpose designed for low latency instead of screening parts from a design for low cost. This probably a very involved effort, and DRAM vendors are extremely secretive on their technology.

Elsewhere I have suggested an interim measure of disabling the upper-half of each DRAM bank, which bypasses the slowest part of the bank. The apparent cost overhead of this is 2X, which is less than the 3X premium of cherry picked parts, though we may also be giving up 2X of system memory capacity.

DRAM_bank1a

Further measures could be to finally abandon the multiplexed row-column addressing and sub-dividing the DRAM chip into smaller banks, as is the example of RL-DRAM. We may not need to go the extreme measures in the Intel 1Gbit eDRAM chip with 128 banks citing 3.75ns latency?

eDRAM1Gb

Addendum (2019-10-17 )

See Onur Mutlu Lectures on Youtube, SAMOS Tutorial - Memory Systems - Part 4: Low-Latency Memory

At around 13:00, he mentions DRAM chips being spec'd to 85C. Operating at 55C, read latency can be reduced by 32.7% and write latency by 55.1%, sometimes more! We should have heat sinks on both sides of server R/LRDIMMs as allowed by spacing, and ducts to force air flow through the heat sinks.

In testing at low timings, errors tend to occur in specific locations. If there were a mechanism to mark bad rows on individual DIMMs? Perhaps something in the BIOS/UEFI to mark certain rows as bad, and let the OS know not to use those addresses?

Server memory has ECC to detect and correct for software errors, normally radiation induced. But another source of errors is operating DRAM close to the timing spec. Perhaps we should have 80-bit wide DIMMs for extra detect and correct capability. Furthermore, if a clean data page has a detected but uncorrectable error, instead of an OS blue screen, let SQL Server handle it.

Low Latency Memory (2018-03 )

For the last twenty years, the standard practice has been to scale server performance with multi-processor (MP) systems. What is not widely understood is that after processors integrated the memory controller, the initial step from single processor to 2-way has only moderate scaling due to the introduction of non-uniform memory access (NUMA) in the MP system architecture. Scaling versus cores within a single processor however, can be excellent. The new practice should now be to employ single processor systems for the very large majority of situations, with multi-processor systems relegated to extreme boundary cases.

Once we accept that the single processor system is the correct strategy, the next conclusion is that low latency DRAM should be pursued. Over the last twenty years, the data transfer rate for memory has increased from 100-133MHz in the SDRAM generation to over 2666MT/s in DDR4. But the full access latency (or row cycle time) has barely changed from perhaps 60ns to 45ns. This not because some fundamental limit of DRAM has been reached. Rather it is a deliberate decision to prioritize manufacturing cost. It was presumed that low latency would not have significant incremental market value to justify the higher cost.

A significant portion of the memory latency in multi-processor servers occurs in the system elements outside of DRAM. A reduction in latency on the DRAM chip may not impact total memory latency in the MP system sufficiently to outweigh the cost. However, in a single processor system with less outside contribution, the impact is sufficient to justify even a very large increase in DRAM cost. The first opportunity is for low latency DRAM that would be compatible with the existing memory interface of current or near-term next generation processors, be it DDR4 now or DDR5 in a couple of years.

The next step is for memory with a new optimized interface, which must be implemented in conjunction with the processor. The most obvious change is to demultiplex the address bus, basically RL-DRAM, but optimized for server systems. Memory latency is so important that it is likely even SRAM cost structure is viable, but the discussion here focuses on DRAM.

Memory Latency in Single and Multi-Processor Systems

There are three die options in the Skylake based Intel Xeon SP: LCC, HCC and XCC. The figure below shows representation of the Xeon SP XCC model. Memory access latency on L2 miss consists of L3 plus the DRAM access. L3 latency may be less than 19ns on the smaller LCC and HCC die.

XCC_1S_mem

Intel cites L3 at 19.5ns for the XCC die. 7-cpu reports memory access for Skylake X as L3 + 50ns = 69ns for DDR4-3400 16-18-18-36. On more conservative ECC memory at 2666MT/s 18-18-18 timing, the full memory access might be 76ns, or perhaps slightly higher.

The figure below, from Anandtech Dissecting Intel's... originally sourced from Intel, shows memory latency for a 2-way system with Xeon SP 8180 28-core processors. Also see Sizing Up Servers....

LocalRemote

Note that local node memory latency is shown as 89ns, significantly higher than latency on the single processor system. Remote node memory latency is 139ns. I am assuming that this difference is due to remote node cache coherency, but processor architects and engineers are welcome to elaborate on this. There is a slide in Notes on NUMA Architecture, Intel Software Conference 2014 Brazil stating that in local memory access, the memory request is sent concurrently with the remote node snoop, implying minimal penalty, and yet there is a difference of 13ns?

Be aware that different values are cited by various sources. The figure below is from computerbase.de.

MemoryCacheLatency2

Even if a database or other application had been architected to achieve a high degree of memory locality on a NUMA system, there is still a memory latency penalty on the multi-processor system relative to the single processor system.

Almost zero real-world databases have been architected to achieve memory locality on a NUMA system. I have heard that a handful of Oracle environments went to effort to re-architect for RAC, which is the same type of architecture necessary to achieve memory locality on NUMA. Most environments should see average memory latency of 114ns on a 2-way Xeon SP system based on 50/50 local-remote node mix. Average memory latency on a 2-way Xeon E5 v4 system might be 120ns.

Memory Latency and Performance

The main line Intel processor cores manufactured on the original 14nm process can operate with base frequency at the high 3GHz level. Cores manufactured on the 14nm+ process can operate at the low 4GHz level, base. In the high core Xeon models, this is typically de-rated to 2-2.5GHz base frequency with turbo boost at about 3GHz.

At 2.5GHz, the CPU clock is 0.4ns. Single processor memory latency of 76ns corresponds to 190 CPU-cycles. Local node memory access in a multi-processor system of 89ns is 222.5 cycles and remote node latency of 139ns is 347.5 cycles. Average 2-way memory access based on 50/50 local-remote node split is 114ns or 285 cycles.

If even a very smaller percentage of code involves a round-trip memory access to determine the next operation, then most CPU cycles are spent as no-ops. In this case, performance is essentially the inverse of memory latency. A 30% reduction in memory latency corresponds to 1.4X performance and 50% to 2X. See Memory Latency for more on this topic.

The DIMM

The Dual In-line Memory Module has been a standard for many years now. In the figure below, a memory module is shown with 9 DRAM chip packages forming a 72-bit wide word or line.

DIMM_2R

Of this, 64-bits are data and the 8 extra bits are for ECC. There is an extra chip for registered DIMMs. Desktop system used unbuffered DIMMs and do not have ECC.

A module can be double-sided. When there are chip packages for two separate sets, then the DIMM is said to be having 2 ranks. In case above, there is one set on each side of the DIMM, hence dual.

The figure below shows two separate sets on each side, for a total of four sets.

DIMM_4R

In this case, the DIMM is rank 4. The DIMM datasheet will have the actual configuration as there are exceptions. Below is a DIMM probably of rank 4.

RDIMM

Each package can have stacked DRAM die. The image below is probably NAND die as it appears to be stacked 16 high. DRAM packages with stacked die are usually 2 or 4 high? See Samsung’s Colossal 128GB DIMM.

chip_stack

 

DRAM - preliminary

The DRAM chip is an enormously complicated entity for which only some aspects are briefly summarized here. See Onur Mutlu and other sources for more.

Below is a simplified rendering of the Micron 8Gb DDR4 in the 2G x 4 organization. See the Micron datasheet for a more complete and fully connected layout (DDR4 8Gb 2Gx4).

Micron_8Gb

The die is sub-divided into 4 bank groups. Each group has 4 banks for a total of 16 banks. Within a bank, there are rows and columns. The row is a word-line and the column is a bit-line.

The 2G x 4 organization means there are 2G words of 4-bits. The address bus has 21 signals. Two bits are for addressing the bank groups, and two bit are for the banks. The remaining 17 signals are multiplexed for the row and column addresses. All 17 bits are used for the row (128x1024 = 131,072) address. Only 10 bits are used for the column address. The total address is then 4 + 17 + 10 = 31 bits, sufficient for 2G words.

Notice that only 7 bits of the column address go into the column decoder. Eight words are fetched from the bank on each access

The diagrams below are from Bounding Worst-Case DRAM Performance on Multicore Processors, on Oak Central. The original source is: "Memory Systems: Cache, DRAM, Disk" by Bruce Jacob et. al. (2008). The first image shows "Command and data movement for an individual DRAM device access on a generic DRAM device."

E1EIKI_2013_v7n1_53_f003b

The next figure is "A cycle of a DRAM device access to read data."

E1EIKI_2013_v7n1_53_f004b

The three DDR timing parameters commonly cited are CL, tRCD, and tRP. A fourth number that may be cited after the first three is tRAS. Wikipedia Memory timings has definitions for these terms. "The time to read the first bit of memory from a DRAM with the wrong row open is tRP + tRCD + CL." The tRCD, CL and tRP element are often identical and in the 13-14ns range. In DDR4, the data burst phase transfers 8 words. For DDR4-2666, the command-address clock is 0.75ns and the data transfer rate is one word every 0.375ns, so 8 transfers takes 3ns.

The data restore phase occurs shortly tCAS (CL) and overlaps with tBurst. There is a gap between the column read and the (array) precharge periods. As mentioned earlier, DRAM is very complicated. For database transaction processing and other pointer-chasing-code, the parameter of interest is tRC. This value is sometimes not cited, and it cannot be determined by the three commonly cited timing parameters.

Micron SDRAM through DDR4

The table below shows some organization details for selected Micron DRAM products from SDRAM through DDR4. The year column is the year of the datasheet, which might be a revision published after initial product launch.

 
typeDenOrgbanksRow addr.
bits
Col addr.
bits
Rows x
Columns
YearMT/stimingtRC
SDRAM256M64Mx4413118K x 2K199913315ns60ns
DDR512M128Mx4413128K x 4K200040015ns55ns
DDR22G512Mx48151132K x 2K2006106613.13ns54ns
DDR31G256Mx48141116K x 2K2008186613.91ns48ns
DDR34G1Gx48161164K x 2K2017213313.09ns47ns
DDR48G2Gx4161710 (7)128K x 1K2017320013.75ns46ns

Most of the parameters above are in the DRAM chip datasheet, but tRC is usually found in the DIMM datasheet, and sometimes buried in the back pages. The timing column is the CL-tRCD-tRP amalgamated into a single value and converted from (command/address) clock cycles to nanoseconds.

The CL-tRCD-tRP values decreased only slightly from SDRAM to DDR2. From DDR2 to current DDR4, they have not changed for the registered ECC DIMMs used in server systems.

There are unbuffered DIMM products without ECC bit made with specially screened DRAM part having lower (better) values. These products also have heat sinks for improved cooling, as temperature effects timing. Presumably, it would not be feasible to rely on this approach for server systems.

The 256Mb SDRAM datasheet is listed as 1999. This could be a 250 or 180nm manufacturing process. The 8Gb DDR4 is probably a 2016 or 2017 product, possibly on a 22nm or 2x process. From 256Mb to 8Gb, there are 5 doublings of density. In between 180nm and 22nm are five manufacturing processes: 130, 90, 65, 45 and 32nm.

The manufacturing process used to mean the transistor gate length. However, in the last 15 or so years, it is just an artificial placeholder representing a doubling of transistor density between generations. In the Intel 22 to 14nm process transition, the density increased by 2.7X (Intel's 10nm).

It would seem that one doubling of DRAM density between 256Mb at 180nm and 8Gb at 22nm is missing, but this could be the upcoming 16Gb single die. There is a convention in which the DRAM bit cell size is specified as xF2 and F is the process linear dimension.

DRAM Die Images

In Under the Hood: The race is on for 50-nm DRAM, the Samsung 1Gb DDR2 is mentioned as 50nm class in 2009 as a 6F2-based cell design, die image shown below. In this case, 50nm class is 58nm.

samsung_dram_1Gb2

Also is Under the Hood, "On the other hand, Hynix's 8F2 cell design showed a 16.5 percent larger cell than Samsung's. It should be noted, that despite the larger cell size, Hynix's 1-Gbit DDR2 SDRAM achieved an impressive chip size of 45.1 mm2, only 2.7 percent larger than Samsung's 1-Gbit DDR2 SDRAM."

Semiconductor Manufacturing & Design has a die image of the Samsung 1Gb DDR3 in How to Get 5 Gbps Out of a Samsung Graphics DRAM, shown below.

DDR3_Samsung1Gb_DDR3

Embedded has a discussion on Micron SDRAM and DDR in Analysis: Micron cell innovation changes DRAM economics.

DRAM Latency

Two items are mentioned as having significant impact on DRAM latency. One is the bank size. Another is the multiplexing of row and column addresses. In principle, it is a simple matter to sub-divide the DRAM chip into more banks, as the design of each bank and its control logic is the same. This would however increase the die size as there would be more of the control logic associated with each bank. Earlier we said that DRAM manufacturers understood that the market prioritized cost over latency.

In SDRAM 256Mbit, the bank size is 8Kx2K = 16M words, with 4 banks at x4 word. (I am using the term word to mean the data path out. The could apply to a DRAM chip at 4, 8 or 16-bits, and a DIMM at 64 or 72-bits.) For DDR4 8Gbit, bank size is 128Kx1K = 128M with 16 banks at x4 word size.

There is some improvement in latency between the manufacturing process for SDRAM in 1999 and DDR4 in 2017. Long ago, the expectation was a 30% reduction in transistor switching time corresponding to 40% increase in frequency for a logic. From about the 90nm process forwards, transistor performance improved at a much diminished rate.

In DRAM, there is both the logic sub-units made of transistors and the bit cell array made of transistors and capacitors. I do not recall mention of capacitor performance versus process generation. Presumably it is minimal? Still, this allowed the main timing parameters to remain at 13-14ns each as bank size increased from 16M to 128M at 4-bit data width.

A single bank can only transfer one 8-word burst every tRC. This takes 4 clocks, as DDR transfers 2 data words every clock. The tRC is somewhat longer than tRCD + CL + tRP. For DDR4-2666, these components are 19 clocks each for a total of 57 clocks at 0.75ns per clock. The cited tRC at 3200 MT/s is 45.75 and at 2933 MT/s is 46.32ns. Presumably then tRC at 2666 MT/s is either 61 clocks for 45.75ns or 62 clocks for 46.5ns. (61 clocks for tRC is 57 + 4?)

A single bank can transfer data for only 4 out of 61 or 62 clock cycles. DDR4 currently has 16 banks. So, it would seem that sustained memory bandwidth was important and warranted an increase in the number of banks. But latency was not deemed sufficiently important to further increase the number of banks.

Reduced Latency DRAM

Micron has a RLDRAM 3 product at up to 1.125Gb. One organization is 4M x18 and 16 banks. The datasheet is 2015 so this might be on the same manufacturing process as the 2Gb DDR3. The RLDRAM package is 13.5mm x 13.5mm versus 8mm x 14mm for the 2Gb in x16 form. This might be due to the difference in pins, 168 versus 96. The two products might have the same die size or the 1.15Gb RLDRAM could have a larger die size than the 2Gb DDR3, and may or may not be on a common manufacturing process.

The 1.125Gb RL-DRAM has 16 banks. Each bank is 16K rows by 256 columns = 4M words, or 72M bits.

The appropriate comparison for this is the 2Gbit DDR3 in 128M x 16 organization. The comparable DDR3 chip is 8 banks. Each bank is 16K x 1K = 16M words at 16-bit per word = 256M bits. So, roughly, the bank size is 3.55 times smaller in words at a similar w in terms of bits per bank.

The second aspect of RLDRAM is that the row and column addresses are not multiplexed. For the 64M x 18 organization, there are 25 address bits, even though 26 bits are required to address 64M words. There are 4-bits for the bank address. The row address is 14-bits. The column address is 7-bits, but only 5-bits go into the Column decoder. The bank is shown as 16K x 32 x 8 (x18 data). The 5-bit going into the decoder corresponds to 32 columns. I am presuming that the address has 2 word granularity?

Between the smaller bank size and the non-multiplexed address, Micron cites their RL-DRAM as having tRC minimum value of 6.67ns (8-cycles of 2400T/s, or 0.833ns per data transfer). But why only cite the minimum value? We are mostly interested in average value and possible the min-max range, excluding refresh, which all DRAM must have.

My guess would be that by having extra banks, the tRP period can be hidden if access are randomly distributed between banks. If so, then the smaller banks reduces the 13-14ns timings to 6.67ns?

It is presumed that both the smaller bank size and non-multiplexed address contributes to significantly lower latency. Some details on how much each aspect contributes would be helpful.

As we are accustomed to multi-GHz processors, it might seem strange that the multiplexed row and column address has much of a penalty. The tRCD and CAS latency are each 13-14ns in mainstream DRAM. In this regard, we should recall that processors have been at 10 pipeline stages since Pentium Pro in 1995. The post-Pentium 4 Intel processors are probably 14-16 pipeline stages, though Intel no longer shows the number of stages in their micro-architecture diagrams.

In this regard, the full start to finish time to execute an instruction is on the order of 5ns. Then factor in that DRAM has a very different manufacturing process than logic, which is not optimized for performance. It is presumed that the logic on DRAM is not pipelined except for the data transfer sub-units? (the DRR logic core runs at 1/8th of the data transfer rate, or 1/4 of the command/address clock. On DDR-2666, the core runs at 333MHz.)

Low Latency DRAM

As mentioned earlier, we are interested in two separate approaches to low latency DRAM for server systems in handling transaction processing type workloads. The near-term approach is to retain the existing mainstream DDR4 interface and DDR5 when appropriate. Ideally, the new special memory could be compatible with now or then existing processors designed for conventional memory. But some divergence may be necessary and is acceptable.

This would preclude a non-multiplexed address bus. The number of banks would be increased from 16 in the 8Gb DDR4 presumably to 64 or more. The memory controller would have to know that there are more bits for bank addressing, which is why this new memory may or may not work (at full capability?) in existing processors. (The memory controller tries to schedule requests so that accesses do not go to a given bank within a tRC interval.)

But it is possible the next generation processor could work with both conventional and the new high bank count DRAM memory. (Before conventional memory was made with multi-banks, there was a Multibank DRAM product from Mosys.)

The low latency DRAM die would be larger than a similar density conventional DRAM chip. Either the new memory would have lower DIMM capacity options or the DIMM form factor would have to allow for a taller module. A 2X reduction is capacity would not be a serious issue. If the difference in areal size were 4X, then the larger form factor would be preferable.

In the longer term strategy, more avenues for latency reduction are desired. The RL-DRAM approach of non-multiplexed row and column addresses is the right choice. Somewhere its mentioned that the RL-DRAM interface is not very different from SRAM. This could be an additional option that I looked at in SRAM as Main Memory.

Summary

We need to wake up to the fact that scaling performance with multi-processor systems should no longer be the standard default approach. Most situations would benefit greatly from the single processor system. On a hardware cost only evaluation, the cost-performance value could go either way. Two 20 core processors cost less than one 28-core, but one 24-core is about the same as two 16-core processors. When software per-core licensing is factored, the single processor system advantage is huge. Once we accept the single processor system approach, it is easy to then realize that a new low latency memory provides more huge additional value.

This needs to be communicated to both DRAM and processor manufacturers. A cost multiplier of 2 or even 4X is well worth it if system level memory latency could be reduced by 20ns. A near-term future product with a backward compatibility option would mitigate risk. The longer term approach is a clean break from the past to do what is right for the future.

Addendum

There is manufacturing variation in DRAM latency. Most mainstream DRAM is set at a conservative rating for high yield. Micron does have parts with slightly better latency, usually denoted with a E in the speed grade. Some companies (ex. G.Skill) offers memory made with specially screened parts. One product is DD4-4266 (2133 clock) at 17-18-18 timings. CL 17 at 2133MHz is 7.97ns. RCD and RP at 18 is 8.44ns. This strategy works for extreme gaming, allowing a few deep-pocket players to get an advantage. For servers, it is probably worth while to offer two or three binned parts. If two, then perhaps the top one-third as the premium and the rest as ordinary.

The other factor that affects latency is temperature. Lower temperature allows the charge state of the capacitor to be measured more quickly. Many DRAM parts lists 64ms, 8192-cycle refresh for up to 85C, and 32ms refresh for >85C to 95C. Onur's slides also says that lower temperature allows lower timing values. This is one reason why the special DIMMs for gaming have a heat sink. So, a question is whether we want this or even more elaborate cooling for our low latency strategy. Perhaps cooling should be used for the specially binned parts, and just normal cooling for the bulk parts.

Also see The Memory Guy
Super-Cooled DRAM for Big Power Savings,
Is Intel Adding Yet Another Memory Layer?,
A 1T SRAM? Sounds Too Good to be True!, Zeno Semi, A 1T (or 2T) SRAM Bit Cell,