Home,   Parent,   Query OptimizerBenchmarks ProcessorsStorage ScriptsExecStats

RISC vs. CISC

RISC vs. CISC - Modern Analogy (2018-08)

Both can achieve a significant reduction in latency for a single socket system, but will probably have lower memory capacity. This should not be a problem now that all-flash is the more practical operational storage choice after factoring space and power consumption.

Origins of RISC
Intel Pentium Pro
Multi-Processor Systems - 1990's
Multi-Core Processors - 2000's, Paxville to Dunnington

https://en.wikichip.org/wiki/intel/microarchitectures/haswell_(client)

https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)

The early 2-way systems were really just client systems with 2 processor sockets. In the 2000's, 2-way systems began to feature server oriented memory and IO capability. With the advent of quad-core processors, many people realized 4-way was overkill for medium database workloads.

The Intel 45nm architecture with codename Nehalem finally integrated the memory controller. The Nehalem codename with an EP suffix was a quad-core in the Xeon 5500 series for 2-way systems in 2009. The EX suffix applied to an 8-core die as the Xeon 7500 series for 4S and 8S systems in 2010 with expanded memory capacity.

The 32nm Westmere EP 6-core followed as the Xeon 5600 series for 2S in 2010 and the EX 10-core for 4S+ as the Xeon E7 series in 2011, also with expanded memory capacity.

In 2012, 32nm 8-core Sandy Bridge became Xeon E5, for both 2S and 4S with normal memory capacity.

From 2013/14 on, Intel had 3 die options for their server processor. In 22nm Ivy Bridge, this was 6, 10 and 15-cores. The 2S/4S with normal memory in the Xeon E5 v2 lines. The 4S+ with expanded memory capacity in the Xeon E7 line.

   Ivy Bridge  IvyBridge
Ivy Bridge HCC die layout, 15 cores, 541mm2, 25.38 × 21.32 mm, 2013-14.

In the 2014/15, 22nm Haswell die options were 8, 12 and 18 cores in the Xeon E5/7 v3 product lines.

   Haswell  Haswell
Haswell HCC die layout, 18 cores, 661mm2, 31.81 × 20.78 mm, 2014-15.

    MemLat_2S

Not only does it have inherently has non-uniform memory access (NUMA), even the complexity of the interconnect rings or grid results the on-die L3 cache check a 17-18.5ns affair.

Twenty years ago, people were thrilled that the 2-way multi-processor could be purchased from a nominally highly price than the single processor system. Regardless of whether a specific situation would benefit from a second processor, ready access to MP systems facilitated multi-threaded software development.

Today, there are multi-core processors with very many cores, up to 28 in the Xeon SP. The questions are:
  1) Is there a specific need for multiple processor (sockets)?
  2) Is there a disadvantage in having multiple sockets?

Obviously, with the processor having integrated memory controllers, the 2S system has twice the nominal memory bandwidth and capacity of a 1S systems.

On the first, many server applications do not generate much bandwidth.
On the second, many people follow the old rule of: more memory is better.

The purpose of memory in database servers was to bring IO down to a manageable level. Memory capacity has long since surpassed requirements, but many people are obsessed in bringing disk IO down to noise levels. More recently, all-flash is the more practical choice for many storage uses with exceptions like data archiving.

Today, it is still possible to have fairly low-cost 2S system. The 8-core Xeon SP 4108 is $417, or $51.1 per core. On the other hand, the 16-core Xeon 6130 is $1900, or $118.75 per core. The 28-core 8176 is 8,790 or $314 per core.

So it would seem that there is purpose for the 2S systems in enabling lower cost per core in some circumstances.
Except that cores are not equivalent between 1S and 2S systems.

The 2 Processor/Socket (2S) System

The modern multi-processor system with multi-core processors is a complex entity. The figure below shows 8-cores in a single ring. The higher-core count processors since Xeon E5/7 v2 (Ivy Bridge) actually have two sets of rings, and Xeon SP (Skylake) has a mesh.

    MemLat_2S

The full path memory latency is about 90ns for local node and 140ns for remote node in an MP server system, and the important server systems are typically MP.

In principle, it would be a good idea for software to be aware of the underlying system architecture. On NUMA systems, the application should try to achieve a high degree of memory locality. Threads running on cores in socket should preferentially work with data in the local memory node. For this to happen on a database, the application and database must be architected with NUMA in mind, and have a common strategy.

In practice, almost every one has already built their important business systems without consideration for NUMA optimization, and almost no one is interested in re-architecting for NUMA.

Present Day Schwerpunkt

The weak point in modern computer systems is the huge disparity between the processor clock cycle, typically 0.2-0.4 ns (corresponding to 5 and 2.5GHz respectively), and the roundtrip memory access latency, on the order of 100ns.

Of this, perhaps 46ns occurs at the DDR4 SDRAM interface. This is the full tRC value. If only the tRCD and tCAS components are involved, then it might be 31.5ns (14.25ns for each of tRCD and tCAS, plus 8 data transfers at 2666MT/s).

On a 2S system for an application without NUMA optimization, the expectation is 50/50 local and remote node access. A memory round-trip of 115ns is 287.5 cycles of a 2.5GHz core, and the Intel Skylake core is really designed to run at 4GHz+. If even 3% of instructions involve a serialize memory access outside of cache, in which that memory access must complete before the next step can be determine, then 90% of CPU-cycles are dead cycles.

Memory Latency

If memory latency is so important, then why it not been addressed? Because the obstacles, most of the self-inflicted variety, are apparently insurmountable.

There are memory products with lower latency than mainstream DDR4. All have lower density than DDR4, implying higher cost per GB. If latency were reduced by 30ns to 16ns on the memory interface, then the 2S system average memory latency is reduced from 115 to 85ns. This is a 35% reduction. It might translate to a 31% performance gain, which is strong but may or may not be compelling enough to justify the action.

Single Socket - All Memory Access Local


notes4
Intel Software Conference, Brazil 2014

Slide 4, Local Memory Access Example says: if not present in any local node cache, then request data from memory, and snoop remote node. Local memory latency is the maximum of the two. The slide does not explicitly say that the DRAM request is sent simultaneous with the remote CPU snoop, nor does it say that the steps are serialized.

Slide 5, Remote Memory Access Example is: local node requests data from remote node over QPI, remote CPU IMC requests to its DRAM, and snoops caches in its node. Again, not specific if the DRAM and snoop are concurrent or serial.

Initial Note and Ramblings

RISC Revisited

Memory in Database Systems

In the very early days of the Relational Database, a mini-computer like the VAX 11/780 has a maximum memory of 2MB in 128KB increments, presumably with 288 x 4Kbit DRAM chips on one board? Just getting the executable code, the core system tables and the index upper and intermediate levels into main was absolutely critical. Any incremental memory to this purpose has tremendous effect in reducing disk IO. And so, DBA's learned more memory is good, but have not bothered on action-effect details.

For 40 years, DRAM manufacturers were told that only capacity and price were important, as is reflected in the mainstream products available today.

The early discrete transistors had feature sizes on the order of 50 µm. In 1971, the Intel 4004 consolidated a simple 4-bit microprocessor on a single chip using a 10µm process with 2,250 transistors on a 12mm2 die. In 1978, the Intel 8086 16-bit microprocessor on a 3 µm process having 29,000 transistors on a 33mm2 die was introduced.

That the Intel X86 processor line ultimately won is immaterial. The main RISC tenants were reasonably correct. Yet, are we carrying unnecessary baggage from the past.

RISC

At the time of the original, formal RISC proposal by Patterson at Berkeley and Hennessy at Stanford around 1980, there were doubters of the correctness concepts. The early RISC prototype chips, particularly Berkeley RISC-II, were sufficiently successful. This led to the original SPARC processor in 1987, 20k gates, 100k, 1.2µm?
MB86900 used in the commercially successful
Sun-4,
succeeded by MicroSPARC 800k, 1992, 225mm2, 0.8µm and then UltraSPARC I.

The Stanford group led to the MIPS R2000 110k, 80mm2, 2um, 1986, R2000. By 1990, many companies accepted that the main tenants of RISC were correct and represented the best path forward. HP produced PA-7000 in 1991 and DEC produced Alpha (initially DECchip) 21064. IBM is complicated, in principle, pre-dating Berkeley-Stanford RISC with the ROMP and RT/PC. Later, IBM had not only the RS/6000 that was rebranded POWER, but also PowerPC, and a superset in RS64. (Anyone who can bring clarity to the IBM RISC strategy or strategies is welcome to it.)

Even Intel had a RISC processor in the 80960 or i960 in 1984, but elected not to pursue it with the design team redirected to Pentium Pro (with 486 DX4 in-between). Intel accepted the RISC tenants (and had conducted a study on a 64-bit RISC processor with 8 GP registers, or something like this). In the early 1990’s, RISC was already ten years old. Given that RISC itself was a rethink of what processors should be going forward, perhaps another rethink could produce an even better idea. The (logic) transistor budget for 1997 on the 250nm process would have been on the order of 20M at the die size limit of somewhat over 400mm2 (SRAM cache transistors have much higher areal density than logic transistors). Deschutes (Pentium II on 250nm) was 7.5M transistors, 113mm2. Perhaps 2M transistors were used in 32K of combined L1, 6 transistors per bit, plus ECC and tags. This idea for post-RISC became Explicitly Parallel Instruction Computing (EPIC), which was really VLIW, but Intel must have their own branded-terms, because no idea is a good idea unless it comes from happy valley.

It is rather curious that Intel did not appreciate EPIC might be good for scientific computing, but not suitable serial logic intensive business computing? In hindsight, business applications would favor many light cores, while scientific computing ultimately went the GPU route.

The reason this should be of interest is that after the RISC versus CISC was won by X86, the techniques of building good multi-processor systems was established in the 1990s. Multi-processors proved to be enormously popular in server systems, both for the aggregate compute capability and the lower queuing delays. Since then, many generations of Moore’s Law have elapsed. Over this period, computer system architecture has been mostly evolutionary. But what was once important is no longer as important today. Contemporary systems suffer serious problems for which there are only band-aid work-arounds and not true solutions.

The formal RISC concept argued that the semiconductor building blocks (chips or die) support higher transistor density and higher operating frequency with each process generation.

That the Intel X86 processor line ultimately won is immaterial. The Intel 8086 line was not entirely the CISC processor that the RISC originators were targeting. The original RISC concept was more targeted at the processor implemented on many chips. In a way, the 8086 was a simplified form of CISC but implemented as a single die processor. The x86 won in impart by implementing the concepts of RISC with the exception of the instruction set architecture, and the remaining part by executing to Moore's Law. In theory, the X86 instruction would be either at a 15% performance disadvantage to RISC processors at equal die size (cost) or at a cost disadvantage to match performance, as considerable silicon real estate must be spent on the decoders. It so happened that Intel had excellent industry leading manufacturing technology and the enormous advantage of economies of scale.

The RISC (Berkeley – Patterson, and Stanford - Hennessy) concept was forward looking, anticipating the best strategy to make optimal use of (near to mid-term) future semiconductor technology. But to prove the concept, the first RISC processors had to be built within the constraints of the day. The Berkeley RISC-I processor in 1982 was 32 instructions with 44,420 transistors on 2um process. This was followed by RISC-II with 40,760 transistor 39-instruction in 1983.

For reference, the Intel 80286 in 1982 was 134,000 transistors on a 1.5um process with 47mm2 die size. The RISC-I/II were 32-bit with many registers, while the 80286 was 16-bit with fewer registers.

At the time, Moore’s Law was in its adolescence phase, having successfully seen a reduction of gate length from 10um in the 197x (4004) to 6un (8008?) to 3um in 8086 and then to 1.5um. In each process generation, linear dimensions are reduced to 0.707 of the previous generation, supporting 2X transistor density and in principle at 1.41X increase in frequency. Later, actual frequency increase was probably 1.5X per generation.

There were two implications of Moore’s Law. One, increasing transistor density would eventually allow even a full 32-bit processor to be implement on a single chip. Two, while transistor switching frequency was expected to increase at the exponential rate, the expectation was that chip-to-chip communication rate and latency would improve at a much lower rate. Hence, a multi-chip processor would eventually be at serious disadvantage to single chip processors.

The RISC argument incorporated a strategy for the future manufacturing processes with ever increasing transistor density. One element was to employ pipeline execution. Another was superscalar.

For the proof-of-concept, the immediate goal was to fit a full 32-bit processor on to a single chip.

The theory was that it would be simpler to implement both pipelining and superscalar execution on the simpler instruction set with regular features.

Mostek MK4096 16-pin, versus 22-pin for Intel 2107, both were about 300ns access time, so multiplexing the row and column did not have impact?

    E1EIKI_2013_v7n1_53_f003b

    E1EIKI_2013_v7n1_53_f004b

DRAM access tRCD, tCAS, the data burst (4 clocks, 8 data transfers), overlapped with data restore, and tRP. The full sequence is tRC, everything but the last is tRAS.

Modern DDR4 DRAM is a complicated entity in itself. The 8Gbit density has been available for a couple of years now. One product is 2G x4. Internally it is divided into 16 banks. Addressing is 17-bits row (131,072). Technically there are 10-bits columns, but the lower 3-bits are ignored for 8-word bursts. The organization is then 16-banks, 128K rows, 128 columns + 32 (8 words x4) (bits: 4+17+7+5 = 31).

Conservative specifications of approximately 14ns in tRCD, tCAS and tRP are possible. For DDR4-2666, where 2666 is the data transfers rate, 1333MHz is the command/address clock (internal clock is 1/4 or 1/8?), then the clock cycle is 0.75ns, so 19-19-19 timings is 14.25 ns each. Micron has E parts with about 13.5ns timings. The full tRC is 46.16ns.

The presumed mode of operation is that 8-word bursts are accessed randomly. Each bank can be accessed once per tRC. As long as any given bank is not accessed more than once per tRC (61 clocks, or 122 data transfers, or 15.3 8-word bursts), the DRAM can sustain the rated data transfer rate. A DIMM module could be single, dual, or quad ranked, potentially increasing the effective number of banks.

There is manufacturing variation. By binning for the better parts, timings as low as 10ns can be achieved, and this marketed into gaming systems, for those willing to accept some risk of memory errors.

Making lower latency DRAM even with conservative timings can be done simply by reducing the bank size. Micron has a 1.125Gb RLDRAM 3 part, contemporary with either 2 or 4Gb DDR3, organized as 16 banks, 16K row, 32 columns and 8-words of x18(4+14+5+3+4=30bits + parity) that has tRC between 6.67 to 8ns. It can be presumed then that the 8Gb part divided into 128 banks should be capable of similar timings.

    DRAM_4bank

    4way_IMC

RISC Revisited

In the 1980's, around the 2µm process (2000nm!), a single chip die could have 40,000+ transistors. The argument for RISC was made. By simplifying the Central Processing Unit of complex processors, the entire CPU could fit on a single die. While not all instructions could be implemented in silicon, the benefits of eliminating inter-chip delays outweighed the advantages multi-chip processors.

The Intel 80286 in 1982 was manufactured on 1.5µm with 134K transistors.
The Berkeley RISC-I in 1982 was 44,420 transistors.
The first MIPS may have been the R2000 on 2.0µm, 80mm2, 110,000 transistors in 1987.
Wikipedia cites early SPARC as 1987, 1300nm, and 0.1-1.8M transistors. Presumably 100K transistors was for the core, and 1.7M was for the L1 cache?

Now, it is time to abandon the multi-chip system architectures of the multi-processor type, to eliminate its impact on memory latency.

    1S_2S


10 micron
4004
6 micron
8080
3 micron 3um, 8086, 29k tr, 33m2, 1978,
R2000 R2000 110k, 80mm2, 2um, 1986,
CVAX CVAX 78034 134k, 71.78mm2, 1987 2.0um
1.5 micron
Alpha 21064 Alpha 21064 1um-0.75um, 1.68m tr, 233.52mm2, 1993 PA-7000 1991, 1um, 580k, 201.6
PA-RISC PA-7100 850k, 204.49mm2, 0.8um 1992,
transistor count

Microprocessor_chronology

higher reliance on compiler
Standard instruction sizes and few formats - little decode
Load Store memory access
More registers - (all) general purpose
fewer addressing modes
less (or no) micro-programming
single cycle instructions
pipelining