Parent

XeonSP2

Optane Persistent Memory Assessment

The Intel view of the world is that people should configure servers with a moderate amount of DRAM, manufactured by various other vendors, and a large amount of persistent memory (PMEM). The 2nd generation Xeon Scalable Processors support Intel's Optane DC persistent memory. From the database perspective, including Microsoft SQL Server, it is argued that a hybrid memory configuration is viable for environments in which the highly active dataset is greater than system (volatile) memory capacity, but less than that with DRAM plus PMEM. This is based on the characteristics of DRAM, PMEM and solid-state storage. DDR4 DRAM modules are available up to 128GB and Intel Optane memory modules up to 512GB. The viability range is then between 1.5TB and 3.75TB per Xeon SP socket.

This window is rather narrow to accommodate a significant market size. Is this then the final word? For the current generation Xeon SP, probably so. But it is possible that a different architecture can be suitable to hybrid memory over a very broad range of requirements. The limitation of the Xeon SP is that it is designed for 12 DIMM slots over 6 memory channels.

Prior to the Xeon SP, the Intel Xeon E7 processors employed a memory expander in the form of the Scalable Memory Buffer. The Xeon E7 connects twice as many DIMM slots per socket over just 4 memory interfaces. In a modified form, a similar device could greatly broaden the range of viability for PMEM.

PMEM Viable Range Basis

The basis for the hybrid memory assessment revolve around the characteristics of DRAM, PMEM, and solid-state storage as explained in Persistent Memory and Hybrid Buffer Pool - The good, bad and ugly. Full path round-trip latency to local DRAM is around 90ns in a 2-socket system. Random access latency to Intel Optane persistent memory is around 360ns, or about 4X longer.

The best storage devices, such as Intel Optane SSDs, the Samsung 983 ZET or other small page NAND have latency somewhat over 10µsec, or about 25 times higher than PMEM. Note that a storage read IO loads an entire page into memory. The IO and memory latency terms are not comparable entities. Storage based on the mass market NAND, characterized by large page size and three or more bits per cell, typically have latency around 100µsec.

The baseline system is DRAM plus solid-state storage. System performance per processor core on a workload characterized by pointer-chasing code is a blend of memory latency and IO latency if the active dataset is larger than memory.

XeonSP1

In a hybrid memory configuration, half of the DRAM slots are given up for PMEM modules of much larger capacity.

XeonSP2

The presumption is that the increase in latency for memory accesses that previously went to DRAM in the baseline system but goes to PMEM in the hybrid configuration is more than offset by the reduction of IO. Both net total latency and the extra CPU overhead for storage system IO should be considered. There are situations in which hybrid memory could have negative impact. An example is when the reduction in IO does not offset the higher latency of PMEM accesses.

There are real-world workloads that fit into this bracket of conditions. But each use needs to be determined on a case-by-case basis. This could change with time as the dataset grows or as DRAM and PMEM relative cost and capacity evolves. The main issue is that there may not be sufficiently broad suitability to justify hybrid memory configuration as one of the standard system is a large environment.

Processor and System Architecture

The narrow range of viability for PMEM stems largely from the Xeon SP architecture having six memory channels accommodating 2 DIMMs per channel for a total of 12 DIMM slots in each processor socket. In this, we want to have one DRAM DIMM per channel for maximum round-trip memory access performance (or raw bandwidth). This leaves room for 6 PMEM modules. The current upper limit is 6 × 128GB DRAM + 6 × 512GB PMEM for a configuration of 768GB DRAM and 3TB Optane.

Obviously, a processor/system architecture having more DIMM slots (both in total and per socket) would give more flexibility for hybrid memory configurations. Prior to Xeon Scalable Processor generations, the previous generations of Xeon E7 processors from 2011 to 2016 had four memory interfaces. Each connected to a memory expander in the form of the Scalable Memory Buffer.

XeonE7

On the downstream side, there are two memory channels with 3 DIMMs per channel (DPC). Their contemporary Xeon E5 processors interfaced directly to DRAM without the SMB.

Presumably, Intel had feedback from customers that the extra-large memory capacity of the Xeon E7 processors enabled by the SMB would no longer be required for a large majority of workloads in the post-2016 timeframe.

Furthermore, neither would the 3 DPC configuration. It is possible to have 2 DPC at a fairly high transfer rate. The 3 DPC requires stepping down the memory channel data transfer rate.

There must have been strong evidence that the DRAM memory capacity was more than large enough at 12 DIMM slots per processor socket to justify the changing memory architecture between the Xeon E7 v4 and the first generation Xeon Scalable Processor.

But that was before Persistent Memory in the form of Optane attaching to the memory interface was a factor. With PMEM as factor, now we would like to have an extra large number DIMM slots. However, there are some caveats.

One reason to give up the SMB was to avoid adding latency in memory accesses. It is highly desirable for DRAM memory latency to be as low as possible. Many people fail to realize that a significantly lower latency is important enough to justify both substantially higher cost and lower capacity. Instead, it is common practice the regurgitate the now obsolete notion that DRAM should be optimized for cost. In any case, we do not want the SMB on the channels to DRAM.

On a processor with 6 memory interfaces, it might be tolerable to have 5 channels connecting directly to DRAM, and the 6th interface connecting to an SMB device having multiple channels of PMEM on the downstream side. Of course, there is no reason the SMB should only have 2 channels as in the Xeon E7 generations. At 4 channels, we could have 12 DIMM slots per SMB with 3 per downstream channel.

  SMB4

In this case, with the SMB downstream side having only PMEM modules, the lower data transfer rate of 3 DPC is not a liability. Nor is the extra latency added by the SMB between the processor and PMEM. There are about 150 or so pins for each memory channel. An SMB with 4 downstream channels would need pins for a total of 5 channels, including the upstream. Given that we have only one or two SMBs in our system, and that this is an optional feature, cost is not a big factor. For that matter, 6 downstream channels should also be possible.

It has been reported that the next generation Xeon SP will have 8 memory channels. Even in that case, the 8 DRAM plus 8 PMEM per socket is still limited in flexibility. Some mix of processor channels connecting directly to DRAM and via SMB to PMEM is the better strategy.

Persistent Memory on UPI

SMBonUPI

There are more options that could be considered. An SMB device could be made to attach to one of the UPI (formerly QPI) interfaces. Either this would be a single socket system, or possibly a 2-socket system in which only 2 UPI are used to link the two processor sockets.

The Xeon SP is a die with cores, UPI, PCI-E, and memory controllers. Below is a representation of the Xeon SP LCC die layout.

SPLCC

While silicon is more complicated than cut and paste, the UPI attached SMB would be a die with just the UPI and memory controllers, which already has 6 channels, in a 3 DPC configuration.

UPISMB3

But it is just as likely that Intel would simply employ the Xeon SP LCC die with the cores disabled (or hidden) instead going to the effort to make a separate die.

We could reduce the number of PCI-E lanes to support some interface, memory or UPI, for connecting to a large number of PMEM slots via an expander device. Since we are moving a very large portion of the highly active dataset from storage to PMEM, we do not need anywhere near as much bandwidth to PCI-E as before.

Optane/Persistent Memory Summary

In summary, the existing use case for Optane as persistent memory with the 2nd generation Xeon SP is limited. Furthermore, each use must be evaluated on a case-by-case basis. If there are insufficient use cases, then adopting the hybrid memory configuration as one of the standard systems is problematic and it becomes an exception case. However, there are options for future processor/system architectures that could greatly expand the usefulness of a hybrid memory configuration.

 

Access Frequency vs. Incremental Memory

In general, we expect the frequency of memory access versus incremental memory size to be something like the following:

MemoryAccess1

The first blocks of memory are absolutely essentially, as this is for key data structures accessed by almost all operations. The next range is hot data, frequently accessed. As memory size increases, the data buffered is of progressively decreasing importance.

At least, this is how memory should be allocated. So long as there is sufficient memory, the remainder can be handled with disk IO.

In the scenario with Optane, we give up half of the DIMM slots populated with DRAM for Optane modules. The Optane modules have 4× the capacity of current DDR4 modules.

MemoryAccess1

The justification for Optane then requires the following:
Of the one-half DRAM capacity given up, that data now resides in PMem. Additional data that used to incur IO are now brought into PMem.

The benefit of bringing the former storage resident data into PMem must outweigh the higher latency of PMem compared to DRAM for the portion of data that had to be moved from DRAM to PMem.

 

 Too Much Memory,

Addendum

Elsewhere, I have argued that the cost of DRAM memory was inconsequential and this was last year when it was $16/GB. It is now less than $6/GB. For database transaction processing, performance per core is king, and that criteria trumps almost every other criteria.

Performance per core or thread in traditional databases is achieved with low latency memory on a single die processor. No multi-socket NUMA complexity. For this, we can give up capacity and afford memory at several times the cost of DRAM. Something like eDRAM or Reduced-Latency (RL-)DRAM without multiplexed row/column addressing would be desirable.

Even better is an operating system that can handle multi-tier memory sub-systems. Fast memory should attach directly to the processor (if not on the processor die?). The option of having some combination of convention DRAM and Optane attach via an SMB would complete the solution.

The Intel C112/C114 Scalable Memory Buffer package has 873 pins, not all used. In addition to DDR3/4 signals, there are many power and ground pins. Presumably an SMB with 4 downstream channels would have 1500 pins and 6 channels might have just over 2000 pins?

Original 3DXPoint

Additional References

SNIA Persistent Memory Summit?
Converging Memory and Storage Persistent Frank Hady