Joe Chang, SQL Server Performance Engineer (former Consultant)

jchang6 @   US Cell: (407) 791-1126   Deerfield, MA & Dallas, TX

Website:   LinkedIn:  Joe's Linkedin   Other:


20+ years SQL Server, Storage systems, Windows Server OS, Performance Testing and Monitoring
Reverse engineered the SQL Server Query Optimizer cost model and formulas
Quantified the true cost of SQL Server execution plan operations by processor and system architecture
Exposed unusual behavior in SQL Server scaling at high parallelism and NUMA issues
Originator of SQL Clone, a shell database having objects and data distribution statistics, but no data,
 now a standard feature in SQL Server as of version 2014 SP2 and 2016 SP1
Decoded SQL Server statistics binary structure for write-your-own statistics
C/C# programming, data visualization with NPlot, 3rd party components for Excel and PDF
Publisher of SQL ExecStats for cross referencing execution plans to index usage.
Security Clearance: 2009


Deloitte, Dallas, TX, 2019-present

Tax Manager - still doing SQL Server and other HW/SW performance.

Independent Consultant, QDPMA, 2000 to 2019 (& SolidQ 2007-19)

Database Architect and Performance Consultant
Major Projects:

Public Storage 2017-18

A redo of the original line-of-business system was undertaken 4-5 years ago. The company elected to go with a less expensive consulting team, which built on Entity Framework. This was stopped after a few years of unclear progress and the project was completed with in-house staff on straight C# etc. The software was deployed to production with traffic brought over from the previous generation on an incremental basis. It was clear that even with new much more powerful hardware, there was insufficient compute to run the new code. And there was also a mysterious problem with occasional system lock-ups.
At this price was no longer an object and I came to take a look. The standard SQL performance review, re-indexes and re-coding was deployed to production while waiting for the mystery problem to appear. The problem symptoms manifested in procedures or other code not being able to compile, and also a rather small procedure cache with nothing older than 30 min. The primary candidate for this is virtual address space consumption by operations using resources outside of the buffer pool - 8KB pages.
The usual suspects were examined, network packet size larger than 8060 bytes, CLR, and others but none were the true source. Ultimately, this was trace the sp_xml_preparedocument calls without the sp_xml_removedocument afterwards. The new XML data type and query() method should have been used, but the easier fix was to put in the sp_xml_removedocument.

SCOR (Reinsurance) 2014-16:

Support a planned 4X increase in volume from a corporate merger. Initially brought in as infrastructure (server and storage) expert. Rewrote critical SQL code sequences for correctness, efficiency, repeatability. Eliminated unnecessary recompiles. Identified and resolved blocking to achieve scaling at high concurrency. Restructured primary/clustered keys. Complete overhaul of nonclustered indexes with a drastic reduction. Worked around tables with computed columns defined by scalar functions at first. Then worked out plan to remove computed columns, touching several hundred procedures.

Mercury Payment Systems 2012-13

Database re-architecture to support 10X volume increase. On investigation, determined that objectives could be achieved by tuning key SQL code. Identified the true critical root source bottleneck being in the latency for remote server connection initialization, proving objectives could be achieved without re-architecture.

CRM – 4th Gen Programming 2011

The client had built a CRM app with a 4GL product achieving OK performance, but a subsequent patch resulted in daily system seize-ups. Vendor engineers came on site but could not find anything wrong. I was recommended by an Oracle expert. A CPU filtered profiler trace showed some SQL called with API Server cursor. A short-term unfiltered trace was then run, showing that cursors were opened but never closed. A nuance was that the client app (with permanently open connections) generating API Server cursor calls did not have symptoms, while another application using connection pooling did. The victim of unclosed cursors was the sp_reset_connection call. Vendor was not able to provide a quick fix and probably did not understand why this was a problem either. Moving from 32-bit to 64-bit would alleviate the problem. Just resetting the client app to force close connections and cursors was a workaround.

Evolver (kCura Relativity Project) 2011

Client had encountered serious performance problems with kCura Relativity in supporting a very large legal document discovery case with a large number of active users. A new high bandwidth storage system was specified and procured. Developed a plan for migrating the 2TB database off an iSCSI storage system on 1GbE within a limited maintenance windows. SQL Server did not produce efficient parallel execution plans for the style of SQL that Relativity generated. Demonstrated that an alternative style did allow SQL Server to generate parallel execution plans 100-1000 times faster. The kCura SQL expert instead directed developers to tackle the problem symptoms which did not help.

FBI CODIS (Unisys) 2010-11

Corrected serious flaws in the original server and storage configurations. Identified memory round-trip latency, not bandwidth as the key determinant. Rewrote critical SQL sequences to meet project requirements to support processing 50M DNA specimen in 8 hours. Notable code was the double pivot with false aggregation. Eliminated abusively frequent user defined function calls. Recognized that even though the contracted performance requirements could be met, the 10-hour turn-around time for QA cycles posed serious problems. Either the project schedule was not going to be met or the code would have an excessive number of bugs, and probably both. A moderate cost hardware configuration allowed QA cycles to be turned inside of 2 hours.

Customs and Border Patrol (Agilex) 2009

Typical government IT contracting, spending money generating useless paperwork instead of directly tackling the important applications.

US Army Funds Control Management (ATT/Ingenuity) 2007-09.

Mandate was to substantially improve performance as quickly as possible in the new US Army Funds Control management system. Instructions were to do so without changing the business logic. The project was facing cancellation in as little as six weeks if corrective action was not achieved. Performance gains far above requirements were achieved, with the necessary gains in two weeks by rewriting key business logic SQL. In the critical period of the first two weeks, coding rework was done on the production system.

Imceda/Quest LiteSpeed 2004-06

Worked with developers in identify Windows API and C code strategies for backup and restore performance on high-bandwidth IO sub-systems and scaling on (HP Integrity) NUMA system architecture. Wrote papers on system and storage configurations for high bandwidth IO. Advised on tuning strategies in buffer size and count, balancing performance against available 32-bit VAS, as opposed to work by others on peak performance, allocating excessively large and overly numerous buffers when contiguous address space was not likely to be available in a production system.

Cendant Hotel Reservation System Proof of concept (Microsoft Consulting Services 2004)

Existing system on 3 Sun E10000 Starfire server (each with 60 processors) and Informix. Each server could support 300 searches per second. The goal was to determine if 2-socket Opteron systems could support 150 searches per second. A team of Cendant internal people and Oracle engineers ported code from Informix to Oracle. MCS was told that this was to be an unbiased apple-to-apples comparison between Oracle and SQL Server. I was to port the code to SQL Server accordingly.
In the actual course of the project, the Oracle team kept talking about how they would achieve an advantage over SQL Server using native 64-bit code able to use 8GB of memory. SQL Server 2000 Enterprise Edition was precluded, and hence limited to 2GB in standard edition. That being the case, I felt no obligation to abide by the original instructions, and rewrote key code structures, achieving 3000 tps, versus 50 for Oracle on their port. It took the Oracle team 3 weeks to figure out my re-coding logic was compatible with both RDBMS', and both would have comparable performance. Cendant then declared both implementations to be equal, and selected Oracle because that’s what they wanted in the first place.

Public Storage 2002-3

A local consulting firm built a new line-of-business system, specifying deployment on a Unisys E7000 with 16 processors and a storage with 12 HDDs. In the critical first day of the month payment processing, the system would seize up for reasons not understood. This was quickly determined to be caused by a sub-procedure that was called once at night to reconcile an entire day or month of transactions, and then called in volume during the day to process mostly single transactions. The night time call activated a temp table statistics re-compute, producing a parallel execution plan, scanning moderately large tables.
The day time calls retained this plan, resulting in the system being unresponsive until another re-compute occurred. Immediate action was to put in a force recompile each morning after the night run. There continued to be problems with the payment procedure but having different symptoms. The execution plan for the procedure looked to be efficient. A cursor was used to process line items even though most day time transactions had just 1 item. It was shown that SQL Server 2000 could run this procedure efficiently with up to 4 concurrently running threads, but not more on a NUMA system. This problem would not have occurred on a 4-way SMP system. The cursor was removed. Also, SQL Server 2005 fixed the NUMA cursor issue, as cursors were a part of the TPC-E benchmark.

S1/Edify 2000-1

Unisys had sold 12 ProFusion 8-way Xeon servers to a large foreign bank, bringing in S1/Edify for software and implementation. One pair of systems were for the database and the others for application servers. There were some index deficiencies in the database, easily fixed. The app servers were expected to support 1000 users, in part to justify the platform cost. Each user required 2 threads. Problems were traced to the thread stack size (1M), and a multi-threaded heap allocation method (MpHeap on NT 4.0) with poor choice of caching, and another critical segment that made avoidable large memory allocations. Fixes allowed the 8-way application servers to meet objectives even though more 2-way servers would have been a better solution.

Others: National Guard G1 Lifecycle, Kaiser Permanente

Intel Corporation, Portland, OR, 1997-1999

Senior Software Engineer
Microprocessor Products Group, ISV Performance Lab
- Proved that scaling Exchange 5.5 on the 8-way ProFusion was possible, after everyone else gave up hope. Compaq requested assistance from the Intel benchmark team for the first time in a long while.

Performance Microprocessor Division, Strategic Marketing
- Coordinate efforts of architects, design team managers and platform architects in formalizing product strategy.
- Made the successful argument for one more year of IA-32 in 4-way servers when the official POR had just decided to terminate IA-32 processors after Tanner/Cascade (Pentium III Xeon on 180nm), rejecting the ESG argument for 2 years overlap with Itanium in 4-way. Xeon was subsequently continued indefinitely.
- Explained to WPG why Tanner with 1MB off-die L2 cache would not be competitive with Coppermine having 256KB on-die cache at 3-cycle latency. (Tanner was subsequently terminated in favor of Cascades.)
- Contributed the idea of a bus expander for PCI at 66MHz to get around the 2 slot limit.
- Initiated the effort to relabel the upcoming Pentium III 180nm product (Coppermine) for 133.3MHz bus at multiplier 5 as 667MHz before any official document was distributed to OEMs showing the frequency rounded to the integer down from the numeric value.
- Proposed the SKU-price stability concept
At the time, the Intel flagship desktop processor SKU would be introduced at $700, decline to $500 the next quarter, followed by successive price reductions each quarter, ending at below $200 after 5 quarters.

It was necessary to increase the performance at each price point every quarter for competitive reasons. In the SKU-price stability strategy, the SKU that introduced at $700 would remain at that price, and new SKUs would be introduced at each price point as necessary.
(It was several years before the product line could be adjusted to be amenable to this strategy, sometime in the 2005-6 time frame. Over time, this evolved into the Core i7, i5, i3 and Pentium stratification.)

Insignia Solutions, Mountain View, CA, 1995-1997

System Engineer

Acer America, San Jose, CA, 1994-1995

Product Manager, Server Systems

US Navy, 1989-1994

Unrestricted Line Officer, Nuclear Propulsion, USS Arkansas CGN-41


California Institute of Technology

BS Physics

Professional Certifications

Microsoft Certified System Engineer, 1995
Novell Certified NetWare Engineer, 1994

Speaking Engagements and Publications

PASS Summit 2018, Seattle
Rethink Server Sizing     Joe Chang   Date: Nov 7, Time: 4:45PM - 6:00PM, Room: 618

Data Architecture Summit 2017, Chicago Is Performance Still Important

SQL Saturday Bratislava, Jun 2015 SQL Saturday New York, May 2015
SQL Saturday Redmond, May 2014 SQL Saturday Atlanta, May 2014
SQL Saturday Budapest, Mar 2014 Silicon Valley Code Camp , 2013 Oct 6
SQL Saturday Riyadh, 23 May 2013 SQL Rally Nordic, Copenhagen, 2-3 Oct 2012
SQL Bits 7, 2010,   SQL Server Connections 2003, 2004, 2005,   HP World 2004 ,   CMG 2004

Presentations at SQL Server Users Groups
Munich, Germany, 10 Jan 2013 Execution Plan Cost Formulas
Stockholm, Sweden, 24 Jan 2011 Malmo, Sweden, 25 Jan 2011
Copenhagen, Denmark, 26 Jan 2011 Munich, Germany, 16 Dec 2010
SQL Performance for Developers
Melbourne, Australia, 9 Aug, 2010 Curacao, 22 Apr 2010
Selected Articles

SQL Server Cost Based Optimizer
Queries barely over the Cost Threshold for Parallelism
Hyper-Threading Performance

In-Memory Database Technology
Big Iron Systems and NUMA System Architecture
Intel Server Strategy Shift with Sandy Bridge

Storage Performance 2013
Storage Performance for Data Warehouses
IO Queue Depth Strategy
Solid-State Devices