When TOPS No Longer Equals Performance: Where Is the True Bottleneck in AI Compute?

In recent years, almost every AI chip launch has revolved around a single keyword: TOPS. Whether in data center GPUs, automotive SoCs, or edge AI processors, each generation boasts higher compute numbers—from tens or hundreds of TOPS to thousands or even tens of thousands. On the surface, AI compute seems no longer a problem, but in real-world system design and applications, performance bottlenecks still frequently occur. This has led engineers to reconsider a key question: is TOPS really the core metric that determines AI performance?

TOPS (Tera Operations Per Second) represents the theoretical number of operations a chip can perform per second under a specific precision. Most AI chips report INT8, INT4, or other low-precision operations as the standard because these are most relevant to inference scenarios and allow for impressive compute numbers. However, TOPS alone does not equate to real-world performance—it is more like the maximum horsepower of an engine and does not reflect whether the system can sustain that performance over time.

Take NVIDIA as an example: the AI compute of recent generations of data center GPUs has already reached the tens of thousands of TOPS. Products like the H100 and B200 offer extremely high theoretical compute in low-precision AI modes, sufficient for running large language models and generative AI inference. In the endpoint and edge markets, NVIDIA’s Jetson series, Qualcomm, MediaTek, Apple, and Google have also released NPU SoCs with tens to hundreds of TOPS for image recognition, speech processing, and on-device AI inference. From the datasheet perspective, AI compute appears to be fully addressed.

In practice, however, AI inference is not just about computation. Every operation requires reading weights and feature data from memory beforehand and writing results back after computation. As model sizes grow and data reuse increases, system performance is often limited by data movement rather than the compute units themselves. This explains why, in many applications, actual performance does not scale linearly with the chip’s stated TOPS.

This brings the focus back to memory architecture. To support high-TOPS computation, AI chip vendors have been enhancing memory bandwidth. Data center GPUs increasingly adopt HBM, leveraging stacked packaging and ultra-high bandwidth to shorten the distance between compute units and external memory. In SoC and NPU designs, the proportion of on-chip SRAM continues to rise, becoming a critical resource.

SRAM in AI chips is more than just temporary storage. It handles high-frequency, low-latency data access, supporting weight caching, feature map buffering, and intermediate result storage. For CNNs, Transformers, and similar models, repeated data access is extremely frequent. If every access requires going back to external DRAM, latency increases and power consumption rises sharply. Therefore, many AI architectures choose to keep critical data in on-chip SRAM to fully realize the compute potential represented by TOPS.

Because SRAM plays a key role, its reliability and test coverage become increasingly important. Under high-frequency, long-duration operation, SRAM is prone to issues like read disturb, coupling interference, and aging. Any error can affect not just a single operation but the stability of the entire AI inference result. As a result, memory testing and repair mechanisms have become indispensable in AI chip design.

From the rapid growth of TOPS to the continuous evolution of memory architecture, it is clear that the competitive focus of AI chips is shifting. Compute power remains important, but the true differentiator is often who can make data flow more smoothly and memory more reliable. When TOPS is no longer just a marketing number but can be fully realized, the true value of an AI chip is finally unleashed.