12  AI Acceleration

DALL·E 3 Prompt: Create an intricate and colorful representation of a System on Chip (SoC) design in a rectangular format. Showcase a variety of specialized machine learning accelerators and chiplets, all integrated into the processor. Provide a detailed view inside the chip, highlighting the rapid movement of electrons. Each accelerator and chiplet should be designed to interact with neural network neurons, layers, and activations, emphasizing their processing speed. Depict the neural networks as a network of interconnected nodes, with vibrant data streams flowing between the accelerator pieces, showcasing the enhanced computation speed.

Machine learning has emerged as a transformative technology across many industries. However, deploying ML capabilities in real-world edge devices faces challenges due to limited computing resources. Specialized hardware acceleration has become essential to enable high-performance machine learning under these constraints. Hardware accelerators optimize compute-intensive operations like inference using custom silicon optimized for matrix multiplications. This provides dramatic speedups over general-purpose CPUs, unlocking real-time execution of advanced models on size, weight and power-constrained devices.

This chapter provides essential background on hardware acceleration techniques for embedded machine learning and their tradeoffs. The goal is to equip readers to make informed hardware selections and software optimizations to develop performant on-device ML capabilities.

Learning Objectives
  • Understand why hardware acceleration is needed for AI workloads

  • Survey key accelerator options like GPUs, TPUs, FPGAs, and ASICs and their tradeoffs

  • Learn about programming models, frameworks, compilers for AI accelerators

  • Appreciate the importance of benchmarking and metrics for hardware evaluation

  • Recognize the role of hardware-software co-design in building efficient systems

  • Gain exposure to cutting-edge research directions like neuromorphics and quantum computing

  • Understand how ML is beginning to augment and enhance hardware design

12.1 Introduction

Machine learning has emerged as a transformative technology across many industries, enabling systems to learn and improve from data. To deploy machine learning capabilities in real-world environments, there is a growing demand for embedded ML solutions - where models are built into edge devices like smartphones, home appliances and autonomous vehicles. However, these edge devices have limited computing resources compared to data center servers.

To enable high-performance machine learning on resource-constrained edge devices, specialized hardware acceleration has become essential. Hardware acceleration refers to using custom silicon chips and architectures to offload compute-intensive ML operations from the main processor. In neural networks, the most intensive computations are the matrix multiplications during inference. Hardware accelerators can optimize these matrix operations, providing 10-100x speedups over general-purpose CPUs. This acceleration unlocks the ability to run advanced neural network models in real-time on devices with size, weight and power constraints.

This chapter overviews hardware acceleration techniques for embedded machine learning and their design tradeoffs. The goal of this chapter is to equip readers with essential background on embedded ML acceleration. This will enable informed hardware selection and software optimization to develop high-performance machine learning capabilities on edge devices.

12.2 Background and Basics

12.2.1 Historical Background

The origins of hardware acceleration date back to the 1960s, with the advent of floating point math co-processors to offload calculations from the main CPU. One early example was the Intel 8087 chip released in 1980 to accelerate floating point operations for the 8086 processor. This established the practice of using specialized processors to handle math-intensive workloads efficiently.

In the 1990s, the first graphics processing units (GPUs) emerged to process graphics pipelines for rendering and gaming rapidly. Nvidia’s GeForce 256 in 1999 was one of the earliest programmable GPUs capable of running custom software algorithms. GPUs exemplify domain-specific fixed-function accelerators as well as evolving into parallel programmable accelerators.

In the 2000s, GPUs were applied to general-purpose computing under GPGPU. Their high memory bandwidth and computational throughput made them well-suited for math-intensive workloads. This included breakthroughs in using GPUs to accelerate training of deep learning models such as AlexNet in 2012.

In recent years, Google’s Tensor Processing Units (TPUs) represent customized ASICs specifically architected for matrix multiplication in deep learning. Their optimized tensor cores achieve higher TeraOPS/watt than CPUs or GPUs during inference. Ongoing innovation includes model compression techniques like pruning and quantization to fit larger neural networks on edge devices.

This evolution demonstrates how hardware acceleration has focused on solving compute-intensive bottlenecks, from floating point math to graphics to matrix multiplication for ML. Understanding this history provides a crucial context for specialized AI accelerators today.

12.2.2 The Need for Acceleration

The evolution of hardware acceleration is closely tied to the broader history of computing. In the early decades, chip design was governed by Moore’s Law and Dennard Scaling, which observed that the number of transistors on an integrated circuit double every year and that as transistors become smaller their peformance (speed) increased while power density (power per unit area) remains constant, respectively. These two laws were held through the single-core era. Figure 12.1 shows the trends of different microprocessor metrics. As the figure denotes, Dennard Scaling fails around the mid-2000s, notice how the clock speed (frequency) remains almost constant even as the number of transistors kept increasing.

However, as Patterson and Hennessy (2016) describe, technological constraints eventually forced a transition to the multicore era, with chips containing multiple processing cores to deliver gains in performance. As power limitations prevented further scaling, this led to “dark silicon” (Dark Silicon) where not all chip areas could be simultaneously active (Xiu 2019).

Patterson, David A, and John L Hennessy. 2016. Computer Organization and Design ARM Edition: The Hardware Software Interface. Morgan kaufmann.
Xiu, Liming. 2019. “Time Moore: Exploiting Moore’s Law from the Perspective of Time.” IEEE Solid-State Circuits Mag. 11 (1): 39–55. https://doi.org/10.1109/mssc.2018.2882285.

The concept of dark silicon emerged as a consequence of these constraints. “Dark silicon” refers to portions of the chip that cannot be powered on at the same time due to thermal and power limitations. Essentially, as the density of transistors increased, the proportion of the chip that could be actively used without overheating or exceeding power budgets shrank.

This phenomenon meant that while chips had more transistors, not all could be operational simultaneously, limiting potential performance gains. This power crisis necessitated a shift to the accelerator era, with specialized hardware units tailored for specific tasks to maximize efficiency. The explosion in AI workloads further drove demand for customized accelerators. Enabling factors included new programming languages, software tools, and manufacturing advances.

Figure 12.1: Microprocessor trends. Credit: Karl Rupp.

Fundamentally, hardware accelerators are evaluated on performance, power, and silicon area (PPA). The nature of the target application - whether memory-bound or compute-bound - heavily influences the design. For example, memory-bound workloads demand high bandwidth and low latency access, while compute-bound applications require maximal computational throughput.

12.2.3 General Principles

The design of specialized hardware accelerators involves navigating complex trade-offs between performance, power efficiency, silicon area, and workload-specific optimizations. This section outlines core considerations and methodologies for achieving an optimal balance based on application requirements and hardware constraints.

Performance Within Power Budgets

Performance refers to the throughput of computational work per unit time, commonly measured in floating point operations per second (FLOPS) or frames per second (FPS). Higher performance enables completing more work, but power consumption rises with activity.

Hardware accelerators aim to maximize performance within set power budgets. This requires careful balancing of parallelism, clock frequency of the chip, operating voltage of the chip, workload optimization and other techniques to maximize operations per watt.

  • Performance = Throughput * Efficiency
  • Throughput ~= Parallelism * Clock Frequency
  • Efficiency = Operations / Watt

For example, GPUs achieve high throughput via massively parallel architectures. However, their efficiency is lower than customized application-specific integrated circuits (ASICs) like Google’s TPU that optimize for a specific workload.

Managing Silicon Area and Costs

Chip area directly impacts manufacturing cost. Larger die sizes require more materials, lower yields, and higher defect rates. Mulit-die packages help scale designs but add packaging complexity. Silicon area depends on:

  • Computational resources - e.g. number of cores, memory, caches
  • Manufacturing process node - smaller transistors enable higher density
  • Programming model - programmed accelerators require more flexibility

Accelerator design involves squeezing maximim performance within area constraints. Techniques like pruning and compression help fit larger models on chip.

Workload-Specific Optimizations

The target workload dictates optimal accelerator architectures. Some of the key considerations include:

  • Memory vs Compute boundedness: Memory-bound workloads require more memory bandwidth, while compute-bound apps need arithmetic throughput.
  • Data locality: Data movement should be minimized for efficiency. Near-compute memory helps.
  • Bit-level operations: Low precision datatypes like INT8/INT4 optimize compute density.
  • Data parallelism: Multiple replicated compute units allow parallel execution.
  • Pipelining: Overlapped execution of operations increases throughput.

Understanding workload characteristics enables customized acceleration. For example, convolutional neural networks use sliding window operations that are optimally mapped to spatial arrays of processing elements.

By navigating these architectural tradeoffs, hardware accelerators can deliver massive performance gains and enable emerging applications in AI, graphics, scientific computing and other domains.

Sustainable Hardware Design

In recent years, AI sustainability has become a pressing concern driven by two key factors - the exploding scale of AI workloads and their associated energy consumption.

First, the size of AI models and datasets has rapidly grown. For example, the amount of compute used to train state-of-the-art models doubles every 3.5 months based on OpenAI’s AI compute trends. This exponential growth requires massive computational resources in data centers.

Second, the energy usage of AI training and inference presents sustainability challenges. Data centers running AI applications now consume substantial amounts of energy, contributing to high carbon emissions. It’s estimated that training a large AI model can have a carbon footprint of 626,000 pounds of CO2 equivalent, almost 5 times the lifetime emissions of an average car.

As a result, AI research and practice must prioritize energy efficiency and carbon impact alongside accuracy. There is increasing focus on model efficiency, data center design, hardware optimization and other solutions to improve sustainability. Striking a balance between AI progress and environmental responsibility has emerged as a key consideration and an area of active research across the field.

The scale of AI systems is expected to keep growing. Developing sustainable AI is crucial for managing the environmental footprint and enabling widespread beneficial deployment of this transformative technology.

We will learn about Sustainable AI in a later chapter where we will go into more detail about it.

12.3 Accelerator Types

Hardware accelerators can take on many forms. They can exist as a widget (like the Neural Engine in the Apple M1 chip) or as entire chips specially designed to perform certain tasks very well. In this section, we will examine processors for machine learning workloads along the spectrum from highly specialized ASICs to more general-purpose CPUs. We first focus on custom hardware purpose-built for AI to understand the most extreme optimizations possible when design constraints are removed. This establishes a ceiling for performance and efficiency.

We then progressively consider more programmable and adaptable architectures with discussions of GPUs and FPGAs. These make tradeoffs in customization to maintain flexibility. Finally, we cover general-purpose CPUs which sacrifice optimizations for a particular workload in exchange for versatile programmability across applications.

By structuring the analysis along this spectrum, we aim to illustrate the fundamental tradeoffs in accelerator design between utilization, efficiency, programmability, and flexibility. The optimal balance point depends on the constraints and requirements of the target application. This spectrum perspective provides a framework for reasoning about hardware choices for machine learning and the capabilities required at each level of specialization.

Figure 12.2 illustrates the complex interplay between flexibility, performance, functional diversity, and area of architecture design. Notice how the ASIC is on the bottom-right corner, with minimal area, flexibility, and power consumption and maximal performance, due to its highly specialized application-specific nature. A key tradeoff is functinoal diversity vs performance: general purpose architechtures can serve diverse applications but their application performance is degraded as compared to more customized architectures.

The progression begins with the most specialized option, ASICs purpose-built for AI, to ground our understanding in the maximum possible optimizations before expanding to more generalizable architectures. This structured approach aims to elucidate the accelerator design space.

Figure 12.2: Design tradeoffs. Credit: S. Huang, Waeijen, and Corporaal (2022)
Huang, Shihua, Luc Waeijen, and Henk Corporaal. 2022. “How Flexible Is Your Computing System?” ACM Transactions on Embedded Computing Systems (TECS) 21 (4): 1–41.

12.3.1 Application-Specific Integrated Circuits (ASICs)

An Application-Specific Integrated Circuit (ASIC) is a type of integrated circuit (IC) that is custom-designed for a specific application or workload, rather than for general-purpose use. Unlike CPUs and GPUs, ASICs do not support multiple applications or workloads. Rather, they are optimized to perform a single task extremely efficiently. The Google TPU is an example of an ASIC.

ASICs achieve this efficiency by tailoring every aspect of the chip design - the underlying logic gates, electronic components, architecture, memory, I/O, and manufacturing process - specifically for the target application. This level of customization allows removing any unnecessary logic or functionality required for general computation. The result is an IC that maximizes performance and power efficiency on the desired workload. The efficiency gains from application-specific hardware are so substantial that these software-centric firms are dedicating enormous engineering resources to designing customized ASICs.

The rise of more complex machine learning algorithms has made the performance advantages enabled by tailored hardware acceleration a key competitive differentiator, even for companies traditionally concentrated on software engineering. ASICs have become a high-priority investment for major cloud providers aiming to offer faster AI computation.

Advantages

ASICs provide significant benefits over general purpose processors like CPUs and GPUs due to their customized nature. The key advantages include the following.

Maximized Performance and Efficiency

The most fundamental advantage of ASICs is the ability to maximize performance and power efficiency by customizing the hardware architecture specifically for the target application. Every transistor and design aspect is optimized for the desired workload - no unnecessary logic or overhead is needed to support generic computation.

For example, Google’s Tensor Processing Units (TPUs) contain architectures tailored exactly for the matrix multiplication operations used in neural networks. To design the TPU ASICs, Google’s engineering teams need to clearly define the chip specifications, write the architecture description using Hardware Description Languages like Verilog, synthesize the design to map it to hardware components, and carefully place-and-route transistors and wires based on the fabrication process design rules. This complex design process, known as very-large-scale integration (VLSI), allows them to build an IC optimized just for machine learning workloads.

As a result, TPU ASICs achieve over an order of magnitude higher efficiency in operations per watt than general purpose GPUs on ML workloads by maximizing performance and minimizing power consumption through a full-stack custom hardware design.

Specialized On-Chip Memory

ASICs incorporate on-chip SRAM and caches specifically optimized to feed data to the computational units. For example, Apple’s M1 system-on-a-chip contains special low-latency SRAM to accelerate the performance of its Neural Engine machine learning hardware. Large local memory with high bandwidth enables keeping data as close as possible to the processing elements. This provides tremendous speed advantages compared to off-chip DRAM access, which is up to 100x slower.

Data locality and optimizing memory hierarchy is crucial for both high throughput and low power.Below is a table “Numbers Everyone Should Know” from Jeff Dean.

Operation Latency Notes
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns 3 μs
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 μs
Read 4 KB randomly from SSD 150,000 ns 150 μs
Read 1 MB sequentially from memory 250,000 ns 250 μs
Round trip within same datacenter 500,000 ns 0.5 ms
Read 1 MB sequentially from SSD 1,000,000 ns 1 ms
Disk seek 10,000,000 ns 10 ms
Read 1 MB sequentially from disk 20,000,000 ns 20 ms
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
Custom Datatypes and Operations

Unlike general purpose processors, ASICs can be designed to natively support custom datatypes like INT4 or bfloat16 that are widely used in ML models. For instance, Nvidia’s Ampere GPU architecture has dedicated bfloat16 Tensor Cores to accelerate AI workloads. Low precision datatypes enable higher arithmetic density and performance. ASICs can also directly incorporate non-standard operations common in ML algorithms as primitive operations - for example, natively supporting activation functions like ReLU makes execution more efficient. We encourage you to refer to the Efficient Numeric Representations chapter for additional details.

High Parallelism

ASIC architectures can leverage much higher parallelism tuned for the target workload versus general purpose CPUs or GPUs. More computational units tailored for the application means more operations execute simultaneously. Highly parallel ASICs achieve tremendous throughput for data parallel workloads like neural network inference.

Advanced Process Nodes

Cutting edge manufacturing processes allow packing more transistors into smaller die areas, increasing density. ASICs designed specifically for high volume applications can better amortize the costs of bleeding edge process nodes.

Disadvantages

Long Design Timelines

The engineering process of designing and validating an ASIC can take 2-3 years. Synthesizing the architecture using hardware description languages, taping out the chip layout, and fabricating the silicon on advanced process nodes involves long development cycles. For example, to tape out a 7nm chip, teams need to carefully define specifications, write the architecture in HDL, synthesize the logic gates, place components, route all interconnections, and finalize the layout to send for fabrication. This very large scale integration (VLSI) flow means ASIC design and manufacturing can traditionally take 2-5 years.

There are a few key reasons why the long design timelines of ASICs, often 2-3 years, can be challenging for machine learning workloads:

  • ML algorithms evolve rapidly: New model architectures, training techniques, and network optimizations are constantly emerging. For example, Transformers became hugely popular in NLP in just the last few years. By the time an ASIC finishes tapeout, the optimal architecture for a workload may have changed.
  • Datasets grow quickly: ASICs designed for certain model sizes or datatypes can become undersized relative to demand. For instance, natural language models are scaling exponentially with more data and parameters. A chip designed for BERT might not accommodate GPT-3.
  • ML applications change frequently: The industry focus shifts between computer vision, speech, NLP, recommender systems etc. An ASIC optimized for image classification may have less relevance in a few years.
  • Faster design cycles with GPUs/FPGAs: Programmable accelerators like GPUs can adapt much quicker by upgrading software libraries and frameworks. New algorithms can be deployed without hardware changes.
  • Time-to-market needs: Getting a competitive edge in ML requires rapidly experimenting with new ideas and deploying them. Waiting several years for an ASIC is not aligned with fast iteration.

The pace of innovation in ML is not well matched to the multi-year timescale for ASIC development. Significant engineering efforts are required to extend ASIC lifespan through modular architectures, process scaling, model compression, and other techniques. But the rapid evolution of ML makes fixed function hardware challenging.

High Non-Recurring Engineering Costs

The fixed costs of taking an ASIC from design to high volume manufacturing can be very capital intensive, often tens of millions of dollars. Photomask fabrication for taping out chips in advanced process nodes, packaging, and one-time engineering efforts are expensive. For instance, a 7nm chip tapeout alone could cost tens of millions of dollars. The high non-recurring engineering (NRE) investment narrows ASIC viability to high-volume production use cases where the upfront cost can be amortized.

Complex Integration and Programming

ASICs require extensive software integration work including drivers, compilers, OS support, and debugging tools. They also need expertise in electrical and thermal packaging. Additionally, programming ASIC architectures efficiently can involve challenges like workload partitioning and scheduling across many parallel units. The customized nature necessitates significant integration efforts to turn raw hardware into fully operational accelerators.

While ASICs provide massive efficiency gains on target applications by tailoring every aspect of the hardware design to one specific task, their fixed nature results in tradeoffs in flexibility and development costs compared to programmable accelerators, which must be weighed based on the application.

12.3.2 Field-Programmable Gate Arrays (FPGAs)

FPGAs are programmable integrated circuits that can be reconfigured for different applications. Their customizable nature provides advantages for accelerating AI algorithms compared to fixed ASICs or inflexible GPUs. While Google, Meta, and NVIDIA which are looking at putting ASICs in data centers, Microsoft deployed FPGAs in their data centers (Putnam et al. 2014) in 2011 to efficiently serve diverse data center workloads.

Advantages

FPGAs provide several benefits over GPUs and ASICs for accelerating machine learning workloads.

Flexibility Through Reconfigurable Fabric

The key advantage of FPGAs is the ability to reconfigure the underlying fabric to implement custom architectures optimized for different models, unlike fixed-function ASICs. For example, quant trading firms use FPGAs to accelerate their algorithms because they change frequently, and the low NRE cost of FPGAs is more viable than taping out new ASICs. Figure 12.3 contains a table comparison of three different FPGAs.

Figure 12.3: Comparison of FPGAs. Credit: Gwennap (n.d.).
Gwennap, Linley. n.d. “Certus-NX Innovates General-Purpose FPGAs.”

FPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a base amount of these resources, and engineers program the chips by compiling HDL code into bitstreams that rearrange the fabric into different configurations. This makes FPGAs adaptable as algorithms evolve.

While FPGAs may not achieve the utmost performance and efficiency of workload-specific ASICs, their programmability provides more flexibility as algorithms change. This adaptability makes FPGAs a compelling choice for accelerating evolving machine learning applications. For machine learning workloads, Microsoft has deployed FPGAs in its Azure data centers to serve diverse applications, instead of using ASICs. The programmability enables optimization across changing ML models.

Customized Parallelism and Pipelining

FPGA architectures can leverage spatial parallelism and pipelining by tailoring the hardware design to mirror the parallelism in ML models. For example, Intel’s HARPv2 FPGA platform splits the layers of an MNIST convolutional network across separate processing elements to maximize throughput. Unique parallel patterns like tree ensemble evaluations are also possible on FPGAs. Deep pipelines with optimized buffering and dataflow can be customized to each model’s structure and datatypes. This level of tailored parallelism and pipelining is not feasible on GPUs.

Low Latency On-Chip Memory

Large amounts of high bandwidth on-chip memory enables localized storage for weights and activations. For instance, Xilinx Versal FPGAs contain 32MB of low latency RAM blocks along with dual-channel DDR4 interfaces for external memory. Bringing memory physically closer to the compute units reduces access latency. This provides significant speed advantages over GPUs that must traverse PCIe or other system buses to reach off-chip GDDR6 memory.

Native Support for Low Precision

A key advantage of FPGAs is the ability to natively implement any bit width for arithmetic units, such as INT4 or bfloat16 used in quantized ML models. For example, Intel’s Stratix 10 NX FPGAs have dedicated INT8 cores that can achieve up to 143 INT8 TOPS at ~1 TOPS/W Intel® Stratix® 10 NX FPGA. Lower bit widths increase arithmetic density and performance. FPGAs can even support mixed precision or dynamic precision tuning at runtime.

Disadvatages

Lower Peak Throughput than ASICs

FPGAs cannot match the raw throughput numbers of ASICs customized for a specific model and precision. The overheads of the reconfigurable fabric compared to fixed function hardware result in lower peak performance. For example, the TPU v5e pods allow up to 256 chips to be connected with more than 100 petaOps of INT8 performance while FPGAs can offer up to 143 INT8 TOPS or 286 INT4 TOPS Intel® Stratix® 10 NX FPGA.

This is because FPGAs are composed of basic building blocks - configurable logic blocks, RAM blocks, and interconnects. Vendors provide a set amount of these resources. To program FPGAs, engineers write HDL code and compile into bitstreams that rearrange the fabric, which has inherent overheads versus an ASIC purpose-built for one computation.

Programming Complexity

To optimize FPGA performance, engineers must program the architectures in low-level hardware description languages like Verilog or VHDL. This requires hardware design expertise and longer development cycles versus higher level software frameworks like TensorFlow. Maximizing utilization can be challenging despite advances in high-level synthesis from C/C++.

Reconfiguration Overheads

To change FPGA configurations requires reloading a new bitstream, which has considerable latency and storage size costs. For example, partial reconfiguration on Xilinx FPGAs can take 100s of milliseconds. This makes dynamically swapping architectures in real-time infeasible. The bitstream storage also consumes on-chip memory.

Diminishing Gains on Advanced Nodes

While smaller process nodes benefit ASICs greatly, they provide less advantages for FPGAs. At 7nm and below, effects like process variation, thermal constraints, and aging disproportionately impact FPGA performance. The overheads of configurable fabric also diminish gains vs fixed function ASICs.

Case Study

FPGAs have found widespread application in various fields, including medical imaging, robotics, and finance, where they excel in handling computationally intensive machine learning tasks. In the context of medical imaging, an illustrative example is the application of FPGAs for brain tumor segmentation, a traditionally time-consuming and error-prone process. For instance, Xiong et al. developed a quantized segmentation accelerator, which they retrained using the BraTS19 and BraTS20 datasets. Their work yielded remarkable results, achieving over 5x and 44x performance improvements, as well as 11x and 82x energy efficiency gains compared to GPU and CPU implementations, respectively (Xiong et al. 2021).

Xiong, Siyu, Guoqing Wu, Xitian Fan, Xuan Feng, Zhongcheng Huang, Wei Cao, Xuegong Zhou, et al. 2021. MRI-Based Brain Tumor Segmentation Using FPGA-Accelerated Neural Network.” BMC Bioinf. 22 (1): 421. https://doi.org/10.1186/s12859-021-04347-6.

12.3.3 Digital Signal Processors (DSPs)

The first digital signal processor core was built in 1948 by Texas Instruments (“The Evolution of Audio DSPs”). Traditionally, DSPs would have logic to allow them to directly access digital/audio data in memory, perform an arithmetic operation (multiply-add-accumulate–MAC–was one of the most common operations) and then write the result back to memory. The DSP would also include specialized analog components to retrieve said digital/audio data.

Once we entered the smartphone era, DSPs started encompassing more sophisticated tasks. They required Bluetooth, Wi-Fi, and cellular connectivity. Media also became much more complex. Today, it’s not common to have entire chips dedicated to just DSP, but a System on Chip would include DSPs in addition to general-purpose CPUs. For example, Qualcomm’s Hexagon Digital Signal Processor claims to be a “world-class processor with both CPU and DSP functionality to support deeply embedded processing needs of the mobile platform for both multimedia and modem functions.” Google Tensors, the chip in the Google Pixel phones, also includes both CPUs and specialized DSP engines.

Advatages

DSPs architecturally provide advantages in vector math throughput, low latency memory access, power efficiency, and support for diverse datatypes - making them well-suited for embedded ML acceleration.

Optimized Architecture for Vector Math

DSPs contain specialized data paths, register files, and instructions optimized specifically for vector math operations commonly used in machine learning models. This includes dot product engines, MAC units, and SIMD capabilities tailored for vector/matrix calculations. For example, the CEVA-XM6 DSP (“Ceva SensPro Fuses AI and Vector DSP”) has 512-bit vector units to accelerate convolutions. This efficiency on vector math workloads is far beyond general CPUs.

Low Latency On-Chip Memory

DSPs integrate large amounts of fast on-chip SRAM memory to hold data locally for processing. Bringing memory physically closer to the computation units reduces access latency. For example, Analog’s SHARC+ DSP contains 10MB of on-chip SRAM. This high-bandwidth local memory provides speed advantages for real-time applications.

Power Efficiency

DSPs are engineered to provide high performance per watt on digital signal workloads. Efficient data paths, parallelism, and memory architectures enable trillions of math operations per second within tight mobile power budgets. For example, Qualcomm’s Hexagon DSP can deliver 4 trillion operations per second (TOPS) while consuming minimal watts.

Support for Integer and Floating Point Math

Unlike GPUs which excel at single or half precision, DSPs can natively support both 8/16-bit integer and 32-bit floating point datatypes used across ML models. Some DSPs even support dot product acceleration at INT8 precision for quantized neural networks.

Disadvatages

DSPs make architectural tradeoffs that limit peak throughput, precision, and model capacity compared to other AI accelerators. But their advantages in power efficiency and integer math make them a strong edge compute option. So while DSPs provide some benefits over CPUs, they also come with limitations for machine learning workloads:

Lower Peak Throughput than ASICs/GPUs

DSPs cannot match the raw computational throughput of GPUs or customized ASICs designed specifically for machine learning. For example, Qualcomm’s Cloud AI 100 ASIC delivers 480 TOPS on INT8, while their Hexagon DSP provides 10 TOPS. DSPs lack the massive parallelism of GPU SM units.

Slower Double Precision Performance

Most DSPs are not optimized for higher precision floating point needed in some ML models. Their dot product engines focus on INT8/16 and FP32 which provides better power efficiency. But 64-bit floating point throughput is much lower. This can limit usage in models requiring high precision.

Constrained Model Capacity

The limited on-chip memory of DSPs constrains the model sizes that can be run. Large deep learning models with hundreds of megabytes of parameters would exceed on-chip SRAM capacity. DSPs are best suited for small to mid-sized models targeted for edge devices.

Programming Complexity

Efficiently programming DSP architectures requires expertise in parallel programming and optimizing data access patterns. Their specialized microarchitectures have more learning curve than high-level software frameworks. This makes development more complex.

12.3.4 Graphics Processing Units (GPUs)

The term graphics processing unit existed since at least the 1980s. There had always been a demand for graphics hardware in both video game consoles (high demand, needed to be relatively lower cost) and scientific simulations (lower demand, but needed higher resolution, could be at a high price point).

The term was popularized, however, in 1999 when NVIDIA launched the GeForce 256 mainly targeting the PC games market sector (Lindholm et al. 2008). As PC games became more sophisticated, NVIDIA GPUs became more programmable over time as well. Soon, users realized they could take advantage of this programmability and run a variety of non-graphics related workloads on GPUs and benefit from the underlying architecture. And so, starting in the late 2000s, GPUs became general-purpose graphics processing units or GP-GPUs.

Lindholm, Erik, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA Tesla: A Unified Graphics and Computing Architecture.” IEEE Micro 28 (2): 39–55. https://doi.org/10.1109/mm.2008.31.

Intel Arc Graphics and AMD Radeon RX have also developed their GPUs over time.

Advatages

High Computational Throughput

The key advantage of GPUs is their ability to perform massively parallel floating point calculations optimized for computer graphics and linear algebra (Raina, Madhavan, and Ng 2009). Modern GPUs like Nvidia’s A100 offers up to 19.5 teraflops of FP32 performance with 6912 CUDA cores and 40GB of graphics memory that is tightly coupled with 1.6TB/s of graphics memory bandwidth.

Raina, Rajat, Anand Madhavan, and Andrew Y. Ng. 2009. “Large-Scale Deep Unsupervised Learning Using Graphics Processors.” In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, edited by Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman, 382:873–80. ACM International Conference Proceeding Series. ACM. https://doi.org/10.1145/1553374.1553486.

This raw throughput stems from the highly parallel streaming multiprocessor (SM) architecture tailored for data-parallel workloads (Zhihao Jia, Zaharia, and Aiken 2019). Each SM contains hundreds of scalar cores optimized for float32/64 math. With thousands of SMs on chip, GPUs are purpose-built for matrix multiplication and vector operations used throughout neural networks.

For example, Nvidia’s latest H100 GPU provides 4000 TFLOPs of FP8, 2000 TFLOPs of FP16, 1000 TFLOPs of TF32, 67 TFLOPs of FP32 and 34 TFLOPs of FP64 Compute performance, which can dramatically accelerate large batch training on models like BERT, GPT-3, and other transformer architectures. The scalable parallelism of GPUs is key to speeding up computationally intensive deep learning.

Mature Software Ecosystem

Nvidia provides extensive runtime libraries like cuDNN and cuBLAS that are highly optimized for deep learning primitives. Frameworks like TensorFlow and PyTorch integrate with these libraries to enable GPU acceleration with no direct programming. CUDA provides lower-level control for custom computations.

This ecosystem enables quickly leveraging GPUs via high-level Python without GPU programming expertise. Known workflows and abstractions provide a convenient on-ramp for scaling up deep learning experiments. The software maturity supplements the throughput advantages.

Broad Availability

The economies of scale of graphics processing make GPUs broadly accessible in data centers, cloud platforms like AWS and GCP, and desktop workstations. Their availability in research environments has provided a convenient platform for ML experimentation and innovation. For example, nearly every state-of-the-art deep learning result has involved GPU acceleration because of this ubiquity. The broad access supplements the software maturity to make GPUs the standard ML accelerator.

Programmable Architecture

While not fully flexible as FPGAs, GPUs do provide programmability via CUDA and shader languages to customize computations. Developers can optimize data access patterns, create new ops, and tune precisions for evolving models and algorithms.

Disadvatages

While GPUs have become the standard accelerator for deep learning, their architecture also comes with some key downsides.

Less Efficient than Custom ASICs

The statement “GPUs are less efficient than ASICs” could spark intense debate within the ML/AI field and cause this book to explode 🤯.

Typically, GPUs are perceived as less efficient than ASICs because the latter are custom-built for specific tasks and thus can operate more efficiently by design. GPUs, with their general-purpose architecture, are inherently more versatile and programmable, catering to a broad spectrum of computational tasks beyond ML/AI.

However, modern GPUs, however, have evolved to include specialized hardware support for essential AI operations, such as generalized matrix multiplication (GEMM) and other matrix operations, native support for quantization, native support for pruning which are critical for running ML models effectively. These enhancements have significantly improved the efficiency of GPUs for AI tasks, to the point where they can rival the performance of ASICs for certain applications.

Consequently, some might argue that contemporary GPUs represent a convergence of sorts, incorporating specialized, ASIC-like capabilities within a flexible, general-purpose processing framework. This adaptability has blurred the lines between the two types of hardware, with GPUs offering a strong balance of specialization and programmability that is well-suited to the dynamic needs of ML/AI research and development.

High Memory Bandwidth Needs

The massively parallel architecture requires tremendous memory bandwidth to supply thousands of cores as shown in Figure 1. For example, the Nvidia A100 GPU requires 1.6TB/sec to fully saturate its compute. GPUs rely on wide 384-bit memory buses to high bandwidth GDDR6 RAM, but even the fastest GDDR6 tops out around 1 TB/sec. This dependence on external DRAM incurs latency and power overheads.

Programming Complexity

While tools like CUDA help, optimally mapping and partitioning ML workloads across the massively parallel GPU architecture remains challenging. Achieving both high utilization and memory locality requires low-level tuning (Zhe Jia et al. 2018). Abstractions like TensorFlow can leave performance on the table.

Jia, Zhe, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.” ArXiv Preprint. https://arxiv.org/abs/1804.06826.
Limited On-Chip Memory

GPUs have relatively small on-chip memory caches compared to the large working set requirements of ML models during training. They are reliant on high bandwidth access to external DRAM, which ASICs minimize with large on-chip SRAM.

Fixed Architecture

Unlike FPGAs, the fundamental GPU architecture cannot be altered post-manufacture. This constraint limits adapting to novel ML workloads or layers. The CPU-GPU boundary also creates data movement overheads.

Case Study

The recent groundbreaking research conducted by OpenAI (Brown et al. 2020) with their GPT-3 model. GPT-3, a language model consisting of 175 billion parameters, demonstrated unprecedented language understanding and generation capabilities. Its training, which would have taken months on conventional CPUs, was accomplished in a matter of days using powerful GPUs, thus pushing the boundaries of natural language processing (NLP) capabilities.

12.3.5 Central Processing Units (CPUs)

The term CPUs has a long history that dates back to 1955 (Weik 1955) while the first microprocessor CPU–the Intel 4004–was invented in 1971 (Who Invented the Microprocessor?). Compilers compile high-level programming languages like Python, Java, or C to assembly instructions (x86, ARM, RISC-V, etc.) for CPUs to process. The set of instructions a CPU understands is called the “instruction set” and must be agreed upon by both the hardware and software running atop it (See section 5 for a more in-depth description on instruction set architectures–ISAs).

Weik, Martin H. 1955. A Survey of Domestic Electronic Digital Computing Systems. Ballistic Research Laboratories.

An overview of significant developments in CPUs:

  • Single-core Era (1950s- 2000): This era is known for seeing aggressive microarchitectural improvements. Techniques like speculative execution (executing an instruction before the previous one was done), out-of-order execution (re-ordering instructions to be more effective), and wider issue widths (executing multiple instructions at once) were implemented to increase instruction throughput. The term “System on Chip” also originated in this era as different analog components (components designed with transistors) and digital components (components designed with hardware description languages that are mapped to transistors) were put on the same platform to achieve some task.
  • Multi-core Era (2000s): Driven by the decrease of Moore’s Law, this era is marked by scaling the number of cores within a CPU. Now tasks can be split across many different cores each with its own datapath and control unit. Many of the issues arising in this era pertained to how to share certain resources, which resources to share, and how to maintain coherency and consistency across all the cores.
  • Sea of accelerators (2010s): Again, driven by the decrease of Moore’s law, this era is marked by offloading more complicated tasks to accelerators (widgets) attached the the main datapath in CPUs. It’s common to see accelerators dedicated to various AI workloads, as well as image/digital processing, and cryptography. In these designs, CPUs are often described more as arbiters, deciding which tasks should be processed rather than doing the processing itself. Any task could still be run on the CPU rather than the accelerators, but the CPU would generally be slower. However, the cost of designing and especially programming the accelerator became be a non-trivial hurdle that led to a spike of interest in design-specific libraries (DSLs).
  • Presence in data centers: Although we often hear that GPUs dominate the data center marker, CPUs are still well suited for tasks that don’t inherently possess a large amount of parallelism. CPUs often handle serial and small tasks and coordinate the data center as a whole.
  • On the edge: Given the tighter resource constraints on the edge, edge CPUs often only implement a subset of the techniques developed in the sing-core era because these optimizations tend to be heavy on power and area consumption. Edge CPUs still maintain a relatively simple datapath with limited memory capacities.

Traditionally, CPUs have been synonymous with general-purpose computing–a term that has also changed as the “average” workload a consumer would run changes over time. For example, floating point components were once considered reserved for “scientific computing” so it was usually implemented as a co-processor (a modular component that worked in tandem with the datapath) and seldom deployed to average consumers. Compare this attitude to today, where FPUs are built into every datapath.

Advatages

While limited in raw throughput, general-purpose CPUs do provide some practical benefits for AI acceleration.

General Programmability

CPUs support diverse workloads beyond ML, providing flexible general-purpose programmability. This versatility comes from their standardized instruction sets and mature compiler ecosystems that allow running any application from databases and web servers to analytics pipelines (Hennessy and Patterson 2019).

Hennessy, John L., and David A. Patterson. 2019. “A New Golden Age for Computer Architecture.” Commun. ACM 62 (2): 48–60. https://doi.org/10.1145/3282307.

This avoids the need for dedicated ML accelerators and enables leveraging existing CPU-based infrastructure for basic ML deployment. For example, X86 servers from vendors like Intel and AMD can run common ML frameworks using Python and TensorFlow packages alongside other enterprise workloads.

Mature Software Ecosystem

For decades, highly optimized math libraries like BLAS, LAPACK, and FFTW have leveraged vectorized instructions and multithreading on CPUs (Dongarra 2009). Major ML frameworks like PyTorch, TensorFlow, and SciKit-Learn are designed to integrate seamlessly with these CPU math kernels.

Dongarra, Jack J. 2009. “The Evolution of High Performance Computing on System z.” IBM J. Res. Dev. 53: 3–4.

Hardware vendors like Intel and AMD also provide low-level libraries to fully optimize performance for deep learning primitives (AI Inference Acceleration on CPUs). This robust, mature software ecosystem allows quickly deploying ML on existing CPU infrastructure.

Wide Availability

The economies of scale of CPU manufacturing, driven by demand across many markets like PCs, servers, and mobile, make them ubiquitously available. Intel CPUs, for example, have powered most servers for decades (Ranganathan 2011). This wide availability in data centers reduces hardware costs for basic ML deployment.

Ranganathan, Parthasarathy. 2011. “From Microprocessors to Nanostores: Rethinking Data-Centric Systems.” Computer 44 (1): 39–48. https://doi.org/10.1109/mc.2011.18.

Even small embedded devices typically integrate some CPU, enabling edge inference. The ubiquity reduces need for purchasing specialized ML accelerators in many situations.

Low Power for Inference

Optimizations like vector extensions in ARM Neon and Intel AVX provide power efficient integer and floating point throughput optimized for “bursty” workloads like inference (Ignatov et al. 2018). While slower than GPUs, CPU inference can be deployed in power-constrained environments. For example, ARM’s Cortex-M CPUs now deliver over 1 TOPS of INT8 performance under 1W, enabling keyword spotting and vision applications on edge devices (ARM).

Disadvatages

While providing some advantages, general-purpose CPUs also come with limitations for AI workloads.

Lower Throughput than Accelerators

CPUs lack the specialized architectures for massively parallel processing that GPUs and other accelerators provide. Their general-purpose design results in lower computational throughput for the highly parallelizable math operations common in ML models (N. P. Jouppi et al. 2017a).

Jouppi, Norman P., Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017a. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12. ISCA ’17. New York, NY, USA: ACM. https://doi.org/10.1145/3079856.3080246.
Not Optimized for Data Parallelism

The architectures of CPUs are not specifically optimized for data parallel workloads inherent to AI (Sze et al. 2017). They allocate substantial silicon area to instruction decoding, speculative execution, caching, and flow control that provide little benefit for the array operations used in neural networks (AI Inference Acceleration on CPUs). However, modern CPUs are equipped with vector instructions like AVX-512 specifically to accelerate certain key operations like matrix multiplication.

GPU streaming multiprocessors, for example, devote most transistors to floating point units instead of complex branch prediction logic. This specialization allows much higher utilization for ML math.

Higher Memory Latency

CPUs suffer from higher latency accessing main memory relative to GPUs and other accelerators (DDR). Techniques like tiling and caching can help, but the physical separation from off-chip RAM bottlenecks data-intensive ML workloads. This emphasizes the need for specialized memory architectures in ML hardware.

Power Inefficiency Under Heavy Workloads

While suitable for intermittent inference, sustaining near-peak throughput for training results in inefficient power consumption on CPUs, especially mobile CPUs (Ignatov et al. 2018). Accelerators explicitly optimize the dataflow, memory, and computation for sustained ML workloads. For training large models, CPUs are energy-inefficient.

12.3.6 Comparison

Accelerator Description Key Advantages Key Disadvantages
ASICs Custom ICs designed for target workload like AI inference Maximizes perf/watt.
Optimized for tensor ops
Low latency on-chip memory
Fixed architecture lacks flexibility
High NRE cost
Long design cycles
FPGAs Reconfigurable fabric with programmable logic and routing Flexible architecture
Low latency memory access
Lower perf/watt than ASICs
Complex programming
GPUs Originally for graphics, now used for neural network acceleration High throughput
Parallel scalability
Software ecosystem with CUDA
Not as power efficient as ASICs.
Require high memory bandwidth
CPUs General purpose processors Programmability
Ubiquitous availability
Lower performance for AI workloads

In general, CPUs provide a readily available baseline, GPUs deliver broadly accessible acceleration, FPGAs offer programmability, and ASICs maximize efficiency for fixed functions. The optimal choice depends on the scale, cost, flexibility and other requirements of the target application.

Although first developed for data center deployment, where [cite some benefit that google cites], Google has also put considerable effort into developing Edge TPUs. These Edge TPUs maintain the inspiration from systolic arrays but are tailored to the limited resources accessible at the edge.

12.4 Hardware-Software Co-Design

Hardware-software co-design is based on the principle that AI systems achieve optimal performance and efficiency when the hardware and software components are designed in tight integration. This involves an iterative, collaborative design cycle where the hardware architecture and software algorithms are concurrently developed and refined with continuous feedback between teams.

For example, a new neural network model may be prototyped on an FPGA-based accelerator platform to obtain real performance data early in the design process. These results provide feedback to both the hardware designers on potential optimizations as well as the software developers on refinements to the model or framework to better leverage the hardware capabilities. This level of synergy is difficult to achieve with the common practice of software being developed independently to deploy on fixed commodity hardware.

Co-design is particularly critical for embedded AI systems which face significant resource constraints like low power budgets, limited memory and compute capacity, and real-time latency requirements. Tight integration between algorithm developers and hardware architects helps unlock optimizations across the stack to meet these restrictions. Enabling techniques include algorithmic improvements like neural architecture search and pruning along with hardware advances like specialized dataflows and memory hierarchies.

By bringing hardware and software design together, rather than developing them separately, holistic optimizations can be made that maximize performance and efficiency. The next sections provide more details on specific co-design approaches.

12.4.1 The Need for Co-Design

There are several key factors that make a collaborative hardware-software co-design approach essential for building efficient AI systems.

Increasing Model Size and Complexity

State-of-the-art AI models have been rapidly growing in size, enabled by advances in neural architecture design and availability of large datasets. For example, the GPT-3 language model contains 175 billion parameters (Brown et al. 2020), requiring huge computational resources for training. This explosion in model complexity necessitates co-design to develop efficient hardware and algorithms in tandem. Techniques like model compression (Cheng et al. 2018) and quantization must be co-optimized with the hardware architecture.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, edited by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Cheng, Yu, Duo Wang, Pan Zhou, and Tao Zhang. 2018. “Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges.” IEEE Signal Process Mag. 35 (1): 126–36. https://doi.org/10.1109/msp.2017.2765695.

Constraints of Embedded Deployment

Deploying AI applications on edge devices like mobile phones or smart home appliances introduces significant constraints on resources such as energy, memory, and silicon area (Sze et al. 2017). To enable real-time inference under these restrictions requires co-exploring hardware optimizations like specialized dataflows and compression with efficient neural network design and pruning techniques. Co-design maximizes performance within the tight deployment constraints.

Rapid Evolution of AI Algorithms

The field of AI is evolving extremely rapidly, with new model architectures, training methodologies, and software frameworks constantly emerging. For example, Transformers have become hugely popular for NLP just in the last few years (Young et al. 2018). Keeping pace with these algorithmic innovations requires hardware-software co-design to quickly adapt platforms and avoid accrued technical debt.

Young, Tom, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. “Recent Trends in Deep Learning Based Natural Language Processing [Review Article].” IEEE Comput. Intell. Mag. 13 (3): 55–75. https://doi.org/10.1109/mci.2018.2840738.

Complex Hardware-Software Interactions

There are many subtle interactions and tradeoffs between hardware architectural choices and software optimizations that have significant impacts on overall efficiency. For instance, techniques like tensor partitioning and batching affect parallelism. Data access patterns impact memory utilization. Co-design provides a cross-layer perspective to unravel these dependencies.

Need for Specialization

AI workloads benefit from specialized operations like low precision math and customized memory hierarchies. This motivates incorporating custom hardware tailored to neural network algorithms rather than relying solely on flexible software running on generic hardware (Sze et al. 2017). But to realize the benefits, the software stack must explicitly target the custom hardware operations.

Demand for Higher Efficiency

With growing model complexity, there are diminishing returns and overhead from optimizing only the hardware or software in isolation (Putnam et al. 2014). Inevitable tradeoffs arise that require a global optimization across layers. Jointly co-designing hardware and software provides large compound efficiency gains.

Putnam, Andrew, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, et al. 2014. “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services.” ACM SIGARCH Computer Architecture News 42 (3): 13–24. https://doi.org/10.1145/2678373.2665678.

12.4.2 Principles of Hardware-Software Co-Design

To build high-performance and efficient AI systems, there must be tight integration and co-optimization between the underlying hardware architecture and software stack. Neither can be designed in isolation - maximizing their synergies requires a holistic approach known as hardware-software co-design.

The key goal is tailoring the hardware capabilities to match the algorithms and workloads run by the software. This requires a feedback loop between hardware architects and software developers to converge on optimized solutions. Several techniques enable effective co-design:

Hardware-Aware Software Optimization

The software stack can be optimized to better leverage the underlying hardware capabilities:

  • Parallelism: Parallelize matrix computations like convolution or attention layers to maximize throughput on vector engines.
  • Memory Optimization: Tune data layouts to improve cache locality based on hardware profiling. This maximizes reuse and minimizes expensive DRAM access.
  • Compression: Lerverage sparsity in the models to reduce storage space as well as save on computation by zero-skipping operations.
  • Custom Operations: Incorporate specialized ops like low precision INT4 or bfloat16 into models to capitalize on dedicated hardware support.
  • Dataflow Mapping: Explicitly map model stages to computational units to optimize data movement on hardware.

Algorithm-Driven Hardware Specialization

Hardware can be tailored to better suit the characteristics of ML algorithms:

  • Custom Datatypes: Support low precision INT8/4 or bfloat16 in hardware for higher arithmetic density.
  • On-Chip Memory: Increase SRAM bandwidth and lower access latency to match model memory access patterns.
  • Domain-Specific Ops: Add hardware units for key ML functions like FFTs or matrix multiplication to reduce latency and energy.
  • Model Profiling: Use model simulation and profiling to identify computational hotspots and guide hardware optimization.

The key is collaborative feedback - insights from hardware profiling guide software optimizations, while algorithmic advances inform hardware specialization. This mutual enhancement provides multiplicative efficiency gains compared to isolated efforts.

Algorithm-Hardware Co-exploration

Jointly exploring innovations in neural network architectures along with custom hardware design is a powerful co-design technique. This allows finding ideal pairings tailored to each other’s strengths (Sze et al. 2017).

Sze, Vivienne, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. “Efficient Processing of Deep Neural Networks: A Tutorial and Survey.” Proc. IEEE 105 (12): 2295–2329. https://doi.org/10.1109/jproc.2017.2761740.
Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” ArXiv Preprint. https://arxiv.org/abs/1704.04861.
Jacob, Benoit, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. 2018. “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.” In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2704–13. IEEE Computer Society. https://doi.org/10.1109/CVPR.2018.00286.
Gale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of Sparsity in Deep Neural Networks.” ArXiv Preprint abs/1902.09574. https://arxiv.org/abs/1902.09574.
Mishra, Asit K., Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. 2021. “Accelerating Sparse Deep Neural Networks.” CoRR abs/2104.08378. https://arxiv.org/abs/2104.08378.

For instance, the shift to mobile architectures like MobileNets (Howard et al. 2017) was guided by edge device constraints like model size and latency. The quantization (Jacob et al. 2018) and pruning techniques (Gale, Elsen, and Hooker 2019) that unlocked these efficient models became possible thanks to hardware accelerators with native low-precision integer support and pruning support (Mishra et al. 2021).

Attention-based models have thrived on massively parallel GPUs and ASICs where their computation maps well spatially, as opposed to RNN architectures reliant on sequential processing. Co-evolution of algorithms and hardware unlocked new capabilities.

Effective co-exploration requires close collaboration between algorithm researchers and hardware architects. Rapid prototyping on FPGAs (C. Zhang et al. 2015) or specialized AI simulators allows quickly evaluating different pairings of model architectures and hardware designs pre-silicon.

Zhang, Chen, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Optimizing Cong. 2015. FPGA-Based Accelerator Design for Deep Convolutional Neural Networks Proceedings of the 2015 ACM.” In SIGDA International Symposium on Field-Programmable Gate Arrays-FPGA, 15:161–70.

For example, Google’s TPU architecture evolved in conjunction with optimizations to TensorFlow models to maximize performance on image classification. This tight feedback loop yielded models tailored for the TPU that would have been unlikely in isolation.

Studies have shown 2-5x higher performance and efficiency gains with algorithm-hardware co-exploration compared to isolated algorithm or hardware optimization efforts (Suda et al. 2016). Parallelizing the joint development also reduces time-to-deployment.

Suda, Naveen, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. “Throughput-Optimized OpenCL-Based FPGA Accelerator for Large-Scale Convolutional Neural Networks.” In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 16–25. ACM. https://doi.org/10.1145/2847263.2847276.

Overall, exploring the tight interdependencies between model innovation and hardware advances unlocks opportunities not visible when tackled sequentially. This synergistic co-design yields solutions greater than the sum of their parts.

12.4.3 Challenges

While collaborative co-design can improve efficiency, adaptability, and time-to-market, it also comes with engineering and organizational challenges.

Increased Prototyping Costs

More extensive prototyping is required to evaluate different hardware-software pairings. The need for rapid, iterative prototypes on FPGAs or emulators increases validation overhead. For example, Microsoft found more prototypes needed for co-design of an AI accelerator versus sequential design (Fowers et al. 2018).

Fowers, Jeremy, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, et al. 2018. “A Configurable Cloud-Scale DNN Processor for Real-Time AI.” In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 1–14. IEEE; IEEE. https://doi.org/10.1109/isca.2018.00012.

Team and Organizational Hurdles

Co-design requires close coordination between traditionally disconnected hardware and software groups. This could introduce communication issues or misaligned priorities and schedules. Navigating different engineering workflows is also challenging. Some organizational inertia to adopting integrated practices may exist.

Simulation and Modeling Complexity

Capturing subtle interactions between hardware and software layers for joint simulation and modeling adds significant complexity. Full cross-layer abstractions are difficult to construct quantitatively pre-implementation. This makes holistic optimizations harder to quantify ahead of time.

Over-Specialization Risks

Tight co-design bears the risk of overfitting optimizations to current algorithms, sacrificing generality. For example, hardware tuned exclusively for Transformer models could underperform on future techniques. Maintaining flexibility requires foresight.

Adoption Challenges

Engineers comfortable with established discrete hardware or software design practices may resist adopting unfamiliar collaborative workflows. Projects could face friction in transitioning to co-design, despite long-term benefits.

12.5 Software for AI Hardware

At this time it should be obvious that specialized hardware accelerators like GPUs, TPUs, and FPGAs are essential to delivering high-performance artificial intelligence applications. But to leverage these hardware platforms effectively, an extensive software stack is required, spanning the entire development and deployment lifecycle. Frameworks and libraries form the backbone of AI hardware, offering sets of robust, pre-built code, algorithms, and functions specifically optimized to perform a wide array of AI tasks on the different hardware. They are designed to simplify the complexities involved in utilizing the hardware from scratch, which can be time-consuming and prone to error. Software plays an important role in the following:

  • Providing programming abstractions and models like CUDA and OpenCL to map computations onto accelerators.
  • Integrating accelerators into popular deep learning frameworks like TensorFlow and PyTorch.
  • Compilers and tools to optimize across the hardware-software stack.
  • Simulation platforms to model hardware and software together.
  • Infrastructure to manage deployment on accelerators.

This expansive software ecosystem is as important as the hardware itself in delivering performant and efficient AI applications. This section provides an overview of the tools available at each layer of the stack shown in Figure 12.4 to enable developers to build and run AI systems powered by hardware acceleration.

flowchart TD
  A(Model Development) --> B(Model Optimization)
  B --> C(Deployment onto Simulator) & D(Deployment onto Hardware)
Figure 12.4: AI software stack spanning development, optimization, simulation, and deployment.

12.5.1 Programming Models

Programming models provide abstractions to map computations and data onto heterogeneous hardware accelerators:

  • CUDA: Nvidia’s parallel programming model to leverage GPUs using extensions to languages like C/C++. Allows launching kernels across GPU cores (Luebke 2008).
  • OpenCL: Open standard for writing programs spanning CPUs, GPUs, FPGAs and other accelerators. Specifies a heterogeneous computing framework (Munshi 2009).
  • OpenGL/WebGL: 3D graphics programming interfaces that can map general-purpose code to GPU cores (Segal and Akeley 1999).
  • Verilog/VHDL: Hardware description languages (HDLs) used to configure FPGAs as AI accelerators by specifying digital circuits (Gannot and Ligthart 1994).
  • TVM: Compiler framework providing Python frontend to optimize and map deep learning models onto diverse hardware back-ends (Chen et al. 2018).
Luebke, David. 2008. CUDA: Scalable Parallel Programming for High-Performance Scientific Computing.” In 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 836–38. IEEE. https://doi.org/10.1109/isbi.2008.4541126.
Munshi, Aaftab. 2009. “The OpenCL Specification.” In 2009 IEEE Hot Chips 21 Symposium (HCS), 1–314. IEEE. https://doi.org/10.1109/hotchips.2009.7478342.
Segal, Mark, and Kurt Akeley. 1999. “The OpenGL Graphics System: A Specification (Version 1.1).”
Gannot, G., and M. Ligthart. 1994. “Verilog HDL Based FPGA Design.” In International Verilog HDL Conference, 86–92. IEEE. https://doi.org/10.1109/ivc.1994.323743.
Chen, Tianqi, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, et al. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning.” In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 578–94.

Key challenges include expressing parallelism, managing memory across devices, and matching algorithms to hardware capabilities. Abstractions must balance portability with allowing hardware customization. Programming models enable developers to harness accelerators without hardware expertise. More of these details are discussed in the AI frameworks section.

12.5.2 Libraries and Runtimes

Specialized libraries and runtimes provide software abstractions to access and maximize utilization of AI accelerators:

  • Math Libraries: Highly optimized implementations of linear algebra primitives like GEMM, FFTs, convolutions etc. tailored to target hardware. Nvidia cuBLAS, Intel MKL, and Arm compute libraries are examples.
  • Framework Integrations: Libraries to accelerate deep learning frameworks like TensorFlow, PyTorch, and MXNet on supported hardware. For example, cuDNN for accelerating CNNs on Nvidia GPUs.
  • Runtimes: Software to handle execution on accelerators, including scheduling, synchronization, memory management and other tasks. Nvidia TensorRT is an inference optimizer and runtime.
  • Drivers and Firmware: Low-level software to interface with hardware, initialize devices, and handle execution. Vendors like Xilinx provide drivers for their accelerator boards.

For instance, PyTorch integrators use cuDNN and cuBLAS libraries to accelerate training on Nvidia GPUs. The TensorFlow XLA runtime optimizes and compiles models for accelerators like TPUs. Drivers initialize devices and offload operations.

The challenges include efficiently partitioning and scheduling workloads across heterogeneous devices like multi-GPU nodes. Runtimes must also minimize overhead of data transfers and synchronization.

Libraries, runtimes and drivers provide optimized building blocks that deep learning developers can leverage to tap into accelerator performance without hardware programming expertise. Their optimization is essential for production deployments.

12.5.3 Optimizing Compilers

Optimizing compilers play a key role in extracting maximum performance and efficiency from hardware accelerators for AI workloads. They apply optimizations spanning algorithmic changes, graph-level transformations, and low-level code generation.

  • Algorithm Optimization: Techniques like quantization, pruning, and neural architecture search to enhance model efficiency and match hardware capabilities.
  • Graph Optimizations: Graph-level optimizations like operator fusion, rewriting, and layout transformations to optimize performance on target hardware.
  • Code Generation: Generating optimized low-level code for accelerators from high-level models and frameworks.

For example, the TVM open compiler stack applies quantization for a BERT model targeting Arm GPUs. It fuses pointwise convolution operations and transforms weight layout to optimize memory access. Finally it emits optimized OpenGL code to run the workload on the GPU.

Key compiler optimizations include maximizing parallelism, improving data locality and reuse, minimizing memory footprint, and exploiting custom hardware operations. Compilers build and optimize machine learning workloads holistically across hardware components like CPUs, GPUs, and other accelerators.

However, efficiently mapping complex models introduces challenges like efficiently partitioning workloads across heterogeneous devices. Production-level compilers also require extensive time tuning on representative workloads. Still, optimizing compilers are indispensable in unlocking the full capabilities of AI accelerators.

12.5.4 Simulation and Modeling

Simulation software is important in hardware-software co-design. It enables joint modeling of proposed hardware architectures and software stacks:

  • Hardware Simulation: Platforms like Gem5 allow detailed simulation of hardware components like pipelines, caches, interconnects, and memory hierarchies. Engineers can model hardware changes without physical prototyping (Binkert et al. 2011).
  • Software Simulation: Compiler stacks like TVM support simulation of machine learning workloads to estimate performance on target hardware architectures. This assists with software optimizations.
  • Co-simulation: Unified platforms like the SCALE-Sim (Samajdar et al. 2018) integrate hardware and software simulation into a single tool. This enables what-if analysis to quantify the system-level impacts of cross-layer optimizations early in the design cycle.
Binkert, Nathan, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, et al. 2011. “The Gem5 Simulator.” ACM SIGARCH Computer Architecture News 39 (2): 1–7. https://doi.org/10.1145/2024716.2024718.
Samajdar, Ananda, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. 2018. “Scale-Sim: Systolic Cnn Accelerator Simulator.” ArXiv Preprint abs/1811.02883. https://arxiv.org/abs/1811.02883.

For example, an FPGA-based AI accelerator design could be simulated using Verilog hardware description language and synthesized into a Gem5 model. Verilog is well-suited for describing the digital logic and interconnects that make up the accelerator architecture. Using Verilog allows the designer to specify the datapaths, control logic, on-chip memories, and other components that will be implemented in the FPGA fabric. Once the Verilog design is complete, it can be synthesized into a model that simulates the behavior of the hardware, such as using the Gem5 simulator. Gem5 is useful for this task because it allows modeling of full systems including processors, caches, buses, and custom accelerators. Gem5 supports interfacing Verilog models of hardware to the simulation, enabling unified system modeling.

The synthesized FPGA accelerator model could then have ML workloads simulated using TVM compiled onto it within the Gem5 environment for unified modeling. TVM allows optimized compilation of ML models onto heterogeneous hardware like FPGAs. Running TVM-compiled workloads on the accelerator within the Gem5 simulation provides an integrated way to validate and refine the hardware design, software stack, and system integration before ever needing to physically realize the accelerator on a real FPGA.

This type of co-simulation provides estimations of overall metrics like throughput, latency, and power to guide co-design before expensive physical prototyping. They also assist with partitioning optimizations between hardware and software to guide design tradeoffs.

However, limitations exist in accurately modeling subtle low-level interactions between components. Quantified simulations are an estimate but cannot wholly replace physical prototypes and testing. Still, unified simulation and modeling provides invaluable early insights into system-level optimization opportunities during the co-deign process.

12.6 Benchmarking AI Hardware

Benchmarking is a critical process that quantifies and compares the performance of various hardware platforms designed to speed up artificial intelligence applications. It guides purchasing decisions, development focus, and performance optimization efforts for both hardware manufacturers and software developers.

The benchmarking chapter explores this topic in great detail and why it has become an indispensable part of the AI hardware development cycle and how it impacts the broader technology landscape. Here, we will briefly review the main concepts but refer you to the chapter for more details.

Benchmarking suites such as MLPerf, Fathom, and AI Benchmark offer a set of standardized tests that can be used across different hardware platforms. These suites measure AI accelerator performance across various neural networks and machine learning tasks, from basic image classification to complex language processing. By providing a common ground for comparison, they help ensure that performance claims are consistent and verifiable. These “tools” are applied not only to guide the development of hardware but also to ensure that the software stack leverages the full potential of the underlying architecture.

  • MLPerf: Includes a broad set of benchmarks covering both training (Mattson et al. 2020) and inference (Reddi et al. 2020) for a range of machine learning tasks.
  • Fathom: Focuses on core operations found in deep learning models, emphasizing their execution on different architectures (Adolf et al. 2016).
  • AI Benchmark: Targets mobile and consumer devices, assessing AI performance in end-user applications (Ignatov et al. 2018).
Mattson, Peter, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, et al. 2020. MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance.” IEEE Micro 40 (2): 8–16. https://doi.org/10.1109/mm.2020.2974843.
Reddi, Vijay Janapa, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, et al. 2020. MLPerf Inference Benchmark.” In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 446–59. IEEE; IEEE. https://doi.org/10.1109/isca45697.2020.00045.
Adolf, Robert, Saketh Rama, Brandon Reagen, Gu-yeon Wei, and David Brooks. 2016. “Fathom: Reference Workloads for Modern Deep Learning Methods.” In 2016 IEEE International Symposium on Workload Characterization (IISWC), 1–10. IEEE; IEEE. https://doi.org/10.1109/iiswc.2016.7581275.
Ignatov, Andrey, Radu Timofte, William Chou, Ke Wang, Max Wu, Tim Hartley, and Luc Van Gool. 2018. AI Benchmark: Running Deep Neural Networks on Android Smartphones,” 0–0.

Benchmarks also have performance metrics that are the quantifiable measures used to evaluate the effectiveness of AI accelerators. These metrics provide a comprehensive view of an accelerator’s capabilities and are used to guide the design and selection process for AI systems. Common metrics include:

  • Throughput: Usually measured in operations per second, this metric indicates the volume of computations an accelerator can handle.
  • Latency: The time delay from input to output in a system, vital for real-time processing tasks.
  • Energy Efficiency: Calculated as computations per watt, representing the trade-off between performance and power consumption.
  • Cost Efficiency: This evaluates the cost of operation relative to performance, an essential metric for budget-conscious deployments.
  • Accuracy: Particularly in inference tasks, the precision of computations is critical and sometimes balanced against speed.
  • Scalability: The ability of the system to maintain performance gains as the computational load scales up.

Benchmark results give insights beyond just numbers - they can reveal bottlenecks in the software and hardware stack. For example, benchmarks may show how increased batch size improves GPU utilization by providing more parallelism. Or how compiler optimizations boost TPU performance. These learnings enable continuous optimization (Zhihao Jia, Zaharia, and Aiken 2019).

Jia, Zhihao, Matei Zaharia, and Alex Aiken. 2019. “Beyond Data and Model Parallelism for Deep Neural Networks.” In Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31 - April 2, 2019, edited by Ameet Talwalkar, Virginia Smith, and Matei Zaharia. mlsys.org. https://proceedings.mlsys.org/book/265.pdf.
Zhu, Hongyu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. “Benchmarking and Analyzing Deep Neural Network Training.” In 2018 IEEE International Symposium on Workload Characterization (IISWC), 88–100. IEEE; IEEE. https://doi.org/10.1109/iiswc.2018.8573476.

Standardized benchmarking provides quantified, comparable evaluation of AI accelerators to inform design, purchasing, and optimization. But real-world performance validation remains essential as well (H. Zhu et al. 2018).

12.7 Challenges and Solutions

AI accelerators offer impressive performance improvements, but their integration into the broader AI landscape is often hindered by significant portability and compatibility challenges. The crux of the issue lies in the diversity of the AI ecosystem — a vast array of machine learning accelerators, frameworks and programming languages exists, each with its unique features and requirements.

12.7.1 Portability/Compatibility Issues

Developers frequently encounter difficulties when attempting to transfer their AI models from one hardware environment to another. For example, a machine learning model developed for a desktop environment in Python using the PyTorch framework, optimized for an Nvidia GPU, may not easily transition to a more constrained device such as the Arduino Nano 33 BLE. This complexity stems from stark differences in programming requirements — Python and PyTorch on the desktop versus a C++ environment on an Arduino, not to mention the shift from x86 architecture to ARM ISA.

These divergences highlight the intricacy of portability within AI systems. Moreover, the rapid advancement in AI algorithms and models means that hardware accelerators must continually adapt, creating a moving target for compatibility. The absence of universal standards and interfaces compounds the issue, making it challenging to deploy AI solutions consistently across various devices and platforms.

Solutions and Strategies

To address these hurdles, the AI industry is moving towards several solutions:

Standardization Initiatives

The Open Neural Network Exchange (ONNX) is at the forefront of this pursuit, proposing an open and shared ecosystem that promotes model interchangeability. ONNX facilitates the use of AI models across various frameworks, allowing for models trained in one environment to be efficiently deployed in another, which significantly reduces the need for time-consuming rewrites or adjustments.

Cross-Platform Frameworks

Complementing the standardization efforts, cross-platform frameworks such as TensorFlow Lite and PyTorch Mobile have been developed specifically to create cohesion between diverse computational environments ranging from desktops to mobile and embedded devices. These frameworks offer streamlined, lightweight versions of their parent frameworks, ensuring compatibility and functional integrity across different hardware types without sacrificing performance. This ensures that developers can create applications with the confidence that they will work on a multitude of devices, bridging a gap that has traditionally posed a considerable challenge in AI development.

Hardware-agnostic Platforms

The rise of hardware-agnostic platforms has also played an important role in democratizing the use of AI. By creating environments where AI applications can be executed on various accelerators, these platforms remove the burden of hardware-specific coding from developers. This abstraction not only simplifies the development process but also opens up new possibilities for innovation and application deployment, free from the constraints of hardware specifications.

Advanced Compilation Tools

In addition, the advent of advanced compilation tools like TVM—an end-to-end tensor compiler—offers an optimized path through the jungle of diverse hardware architectures. TVM equips developers with the means to fine-tune machine learning models for a broad spectrum of computational substrates, ensuring optimal performance and avoiding the need for manual model adjustment each time there is a shift in the underlying hardware.

Community and Industry Collaboration

The collaboration between open-source communities and industry consortia cannot be understated. These collective bodies are instrumental in forming shared standards and best practices that all developers and manufacturers can adhere to. Such collaboration fosters a more unified and synergistic AI ecosystem, significantly diminishing the prevalence of portability issues and smoothing the path toward global AI integration and advancement. Through these combined efforts, the field of AI is steadily moving toward a future where seamless model deployment across various platforms becomes a standard, rather than an exception.

Solving the portability challenges is crucial for the AI field to realize the full potential of hardware accelerators in a dynamic and diverse technological landscape. It requires a concerted effort from hardware manufacturers, software developers, and standard bodies to create a more interoperable and flexible environment. With continued innovation and collaboration, the AI community can pave the way for seamless integration and deployment of AI models across a multitude of platforms.

12.7.2 Power Consumption Concerns

Power consumption is a crucial issue in the development and operation of data center AI accelerators, like Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) (N. P. Jouppi et al. 2017b) (Norrie et al. 2021) (N. Jouppi et al. 2023). These powerful components are the backbone of contemporary AI infrastructure, but their high energy demands contribute to the environmental impact of technology and drive up operational costs significantly. As data processing needs become more complex, with the popularity of AI and deep learning increasing, there’s a pressing demand for GPUs and TPUs that can deliver the necessary computational power more efficiently. The impact of such advancements is two-fold: they can lower the environmental footprint of these technologies and also reduce the cost of running AI applications.

———, et al. 2017b. “In-Datacenter Performance Analysis of a Tensor Processing Unit.” In Proceedings of the 44th Annual International Symposium on Computer Architecture, 1–12. ISCA ’17. New York, NY, USA: ACM. https://doi.org/10.1145/3079856.3080246.
Norrie, Thomas, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman Jouppi, and David Patterson. 2021. “The Design Process for Google’s Training Chips: Tpuv2 and TPUv3.” IEEE Micro 41 (2): 56–63. https://doi.org/10.1109/mm.2021.3058217.
Jouppi, Norm, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, et al. 2023. TPU V4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings.” In Proceedings of the 50th Annual International Symposium on Computer Architecture. ISCA ’23. New York, NY, USA: ACM. https://doi.org/10.1145/3579371.3589350.

Emerging hardware technologies are at the cusp of revolutionizing power efficiency in this sector. Photonic computing, for instance, uses light rather than electricity to carry information, offering a promise of high-speed processing with a fraction of the power usage. We delve deeper into this and other innovative technologies in the “Emerging Hardware Technologies” section, exploring their potential to address current power consumption challenges.

At the edge of the network, AI accelerators are engineered to process data on devices like smartphones, IoT sensors, and smart wearables. These devices often work under severe power limitations, necessitating a careful balancing act between performance and power usage. A high-performance AI model may provide quick results but at the cost of depleting battery life swiftly and increasing thermal output, which may affect the device’s functionality and durability. The stakes are higher for devices deployed in remote or hard-to-reach areas, where consistent power supply cannot be guaranteed, underscoring the need for low-power consuming solutions.

The challenge of power efficiency at the edge is further compounded by latency issues. Edge AI applications in fields such as autonomous driving and healthcare monitoring require not just speed but also precision and reliability, as delays in processing can lead to serious safety risks. For these applications, developers are compelled to optimize both the AI algorithms and the hardware design to strike an optimal balance between power consumption and latency.

This optimization effort is not just about making incremental improvements to existing technologies; it’s about rethinking how and where we process AI tasks. By designing AI accelerators that are both power-efficient and capable of quick processing, we can ensure these devices serve their intended purposes without unnecessary energy use or compromised performance. Such developments could propel the widespread adoption of AI across various sectors, enabling smarter, safer, and more sustainable use of technology.

12.7.3 Overcoming Resource Constraints

Resource constraints also pose a significant challenge for Edge AI accelerators, as these specialized hardware and software solutions must deliver robust performance within the limitations of edge devices. Due to power and size limitations, edge AI accelerators often have restricted computation, memory, and storage capacity (L. Zhu et al. 2023). This scarcity of resources necessitates a careful allocation of processing capabilities to execute machine learning models efficiently.

Zhu, Ligeng, Lanxiang Hu, Ji Lin, Wei-Ming Chen, Wei-Chen Wang, Chuang Gan, and Song Han. 2023. PockEngine: Sparse and Efficient Fine-Tuning in a Pocket.” In 56th Annual IEEE/ACM International Symposium on Microarchitecture. ACM. https://doi.org/10.1145/3613424.3614307.
Lin, Ji, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. 2023. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” arXiv.
Li, Yuhang, Xin Dong, and Wei Wang. 2020. “Additive Powers-of-Two Quantization: An Efficient Non-Uniform Discretization for Neural Networks.” In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=BkgXT24tDS.
Wang, Tianzhe, Kuan Wang, Han Cai, Ji Lin, Zhijian Liu, Hanrui Wang, Yujun Lin, and Song Han. 2020. APQ: Joint Search for Network Architecture, Pruning and Quantization Policy.” In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2075–84. IEEE. https://doi.org/10.1109/CVPR42600.2020.00215.

Moreover, managing constrained resources demands innovative approaches, including model quantization (Lin et al. 2023) (Li, Dong, and Wang 2020), pruning (Wang et al. 2020), and optimizing inference pipelines. Edge AI accelerators must strike a delicate balance between providing meaningful AI functionality and not exhausting the available resources, all while maintaining low power consumption. Overcoming these resource constraints is crucial to ensure the successful deployment of AI at the edge, where many applications, from IoT to mobile devices, rely on the efficient use of limited hardware resources to deliver real-time and intelligent decision-making.

12.8 Emerging Technologies

Thus far we have discussed AI hardware technology in the context of conventional von Neumann architecture design and CMOS-based implementation. These specialized AI chips offer benefits like higher throughput and power efficiency but rely on traditional computing principles. The relentless growth in demand for AI compute power is driving innovations in integration methods for AI hardware.

Two leading approaches have emerged for maximizing compute density - wafer-scale integration and chiplet-based architectures, which we will discuss in this section. Looking much further ahead, we will look into emerging technologies that diverge from conventional architectures and adopt fundamentally different approaches for AI-specialized computing.

Some of these unconventional paradigms include neuromorphic computing which mimics biological neural networks, quantum computing that leverages quantum mechanical effects, and optical computing utilizing photons instead of electrons. Beyond novel computing substrates, new device technologies are enabling additional gains through better memory and interconnect.

Examples include memristors for in-memory computing and nanophotonics for integrated photonic communication. Together, these technologies offer the potential for orders of magnitude improvements in speed, efficiency, and scalability compared to current AI hardware. We will examine these in this section.

12.8.1 Integration Methods

Integration methods refer to the approaches used to combine and interconnect the various computational and memory components in an AI chip or system. The goal of integration is to maximize performance, power efficiency, and density by closely linking the key processing elements.

In the past, AI compute was primarily performed on CPUs and GPUs built using conventional integration methods. These discrete components were manufactured separately then connected together on a board. However, this loose integration creates bottlenecks like data transfer overheads.

As AI workloads have grown, there is increasing demand for tighter integration between compute, memory, and communication elements. Some key drivers of integration include:

  • Minimizing data movement: Tight integration reduces latency and power for moving data between components. This improves efficiency.
  • Customization: Tailoring all components of a system to AI workloads allows optimizations throughout the hardware stack.
  • Parallelism: Integrating a large number of processing elements enables massively parallel computation.
  • Density: Tighter integration allows packing more transistors and memory into a given area.
  • Cost: Economies of scale from large integrated systems can reduce costs.

In response, new manufacturing techniques like wafer-scale fabrication and advanced packaging now allow much higher levels of integration. The goal is to create unified, specialized AI compute complexes tailored for deep learning and other AI algorithms. Tighter integration is key to delivering the performance and efficiency needed for the next generation of AI.

Wafer-scale AI

Wafer-scale AI takes an extremely integrated approach, manufacturing an entire silicon wafer as one gigantic chip. This differs drastically from conventional CPUs and GPUs which cut each wafer into many smaller individual chips. Figure 12.5 shows a comparison between Cerebras Wafer Scale Engine 2, which’s the largest chip ever built, and the largest GPU. While some GPUs may contain billions of transistors, they still pale in comparison to the scale of a wafer-size chip with over a trillion transistors.

The wafer-scale approach also diverges from more modular system-on-chip designs that still have discrete components communicating by bus. Instead, wafer-scale AI enables full customization and tight integration of computation, memory, and interconnects across the entire die.

Figure 12.5: Wafer-scale vs. GPU. Credit: Cerebras.

By designing the wafer as one integrated logic unit, data transfer between elements is minimized. This provides lower latency and power consumption compared to discrete system-on-chip or chiplet designs. While chiplets can offer flexibility by mixing and matching components, communication between chiplets is a challenge. The monolithic nature of wafer-scale integration eliminates these inter-chip communication bottlenecks.

However, the ultra-large scale also poses difficulties for manufacturability and yield with wafer-scale designs. Defects in any region of the wafer can make (certian parts of) the chip unusable. And specialized lithography techniques are required to produce such large dies. So wafer-scale integration pursues the maximum performance gains from integration but requires overcoming substantial fabrication challenges. The following video will provide additional context.

Chiplets for AI

Chiplet design refers to a semiconductor architecture in which a single integrated circuit (IC) is constructed from multiple smaller, individual components known as chiplets. Each chiplet is a self-contained functional block, typically specialized for a specific task or functionality. These chiplets are then interconnected on a larger substrate or package to create a complete, cohesive system. Figure 12.6 illustrates this concept. For AI hardware, chiplets enable mixing different types of chips optimized for tasks like matrix multiplication, data movement, analog I/O, and specialized memories. This heterogeneous integration differs greatly from wafer-scale integration where all logic is manufactured as one monolithic chip. Companies like Intel and AMD have adopted chiplet design for their CPUs.

Chiplets are interconnected using advanced packaging techniques like high-density substrate interposers, 2.5D/3D stacking, and wafer-level packaging. This allows combining chiplets fabricated with different process nodes, specialized memories, and various optimized AI engines.

Figure 12.6: Chiplet partitioning. Credit: Vivet et al. (2021).
Vivet, Pascal, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, Cesar Fuguet, Ivan Miro-Panades, Guillaume Moritz, et al. 2021. IntAct: A 96-Core Processor with Six Chiplets 3D-Stacked on an Active Interposer with Distributed Interconnects and Integrated Power Management.” IEEE J. Solid-State Circuits 56 (1): 79–97. https://doi.org/10.1109/jssc.2020.3036341.

Some key advantages of using chiplets for AI include:

  • Flexibility: Flexibility: Chiplets allow combining different chip types, process nodes, and memories tailored for each function. This is more modular versus a fixed wafer-scale design.
  • Yield: Smaller chiplets have higher yield than a gigantic wafer-scale chip. Defects are contained to individual chiplets.
  • Cost: Leverages existing manufacturing capabilities versus requiring specialized new processes. Reduces costs by reusing mature fabrication.
  • Compatibility: Can integrate with more conventional system architectures like PCIe and standard DDR memory interfaces.

However, chiplets also face integration and performance challenges:

  • Lower density compared to wafer-scale, as chiplets are limited in size.
  • Added latency when communicating between chiplets versus monolithic integration. Requires optimization for low-latency interconnect.
  • Advanced packaging adds complexity versus wafer-scale integration, though this is arguable.

The key objective of chiplets is finding the right balance between modular flexibility and integration density for optimal AI performance. Chiplets aim for efficient AI acceleration while working within the constraints of conventional manufacturing techniques. Overall, chiplets take a middle path between the extremes of wafer-scale integration and fully discrete components. This provides practical benefits but may sacrifice some computational density and efficiency versus a theoretical wafer-size system.

12.8.2 Neuromorphic Computing

Neuromorphic computing is an emerging field aiming to emulate the efficiency and robustness of biological neural systems for machine learning applications. A key difference from classical Von Neumann architectures is the merging of memory and processing in the same circuit (Schuman et al. 2022; Marković et al. 2020; Furber 2016), as illustrated in Figure 12.7. This integrated approach is inspired by the structure of the brain. A key advantage is the potential for orders of magnitude improvement in energy efficient computation compared to conventional AI hardware. For example, some estimates project 100x-1000x gains in energy efficiency versus current GPU-based systems for equivalent workloads.

Marković, Danijela, Alice Mizrahi, Damien Querlioz, and Julie Grollier. 2020. “Physics for Neuromorphic Computing.” Nature Reviews Physics 2 (9): 499–510. https://doi.org/10.1038/s42254-020-0208-2.
Furber, Steve. 2016. “Large-Scale Neuromorphic Computing Systems.” J. Neural Eng. 13 (5): 051001. https://doi.org/10.1088/1741-2560/13/5/051001.
Figure 12.7: Comparison of the von Neumann architecture with the neuromorphic architecture. Credit: Schuman et al. (2022).
Schuman, Catherine D., Shruti R. Kulkarni, Maryam Parsa, J. Parker Mitchell, Prasanna Date, and Bill Kay. 2022. “Opportunities for Neuromorphic Computing Algorithms and Applications.” Nature Computational Science 2 (1): 10–19. https://doi.org/10.1038/s43588-021-00184-y.

Intel and IBM are leading commercial efforts in neuromorphic hardware. Intel’s Loihi and Loihi 2 chips (Davies et al. 2018, 2021) offer programmable neuromorphic cores with on-chip learning. IBM’s Northpole (Modha et al. 2023) device comprises more than 100 million magnetic tunnel junction synapses and 68 billion transistors. These specialized chips deliver benefits like low power consumption for edge inference.

Davies, Mike, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, et al. 2018. “Loihi: A Neuromorphic Manycore Processor with on-Chip Learning.” IEEE Micro 38 (1): 82–99. https://doi.org/10.1109/mm.2018.112130359.
Davies, Mike, Andreas Wild, Garrick Orchard, Yulia Sandamirskaya, Gabriel A. Fonseca Guerra, Prasad Joshi, Philipp Plank, and Sumedh R. Risbud. 2021. “Advancing Neuromorphic Computing with Loihi: A Survey of Results and Outlook.” Proc. IEEE 109 (5): 911–34. https://doi.org/10.1109/jproc.2021.3067593.
Modha, Dharmendra S., Filipp Akopyan, Alexander Andreopoulos, Rathinakumar Appuswamy, John V. Arthur, Andrew S. Cassidy, Pallab Datta, et al. 2023. “Neural Inference at the Frontier of Energy, Space, and Time.” Science 382 (6668): 329–35. https://doi.org/10.1126/science.adh1174.
Maass, Wolfgang. 1997. “Networks of Spiking Neurons: The Third Generation of Neural Network Models.” Neural Networks 10 (9): 1659–71. https://doi.org/10.1016/s0893-6080(97)00011-7.

Spiking neural networks (SNNs) (Maass 1997) are computational models suited for neuromorphic hardware. Unlike deep neural networks that communicate via continuous values, SNNs use discrete spikes more akin to biological neurons. This allows efficient event-based computation rather than constant processing. Additionally, SNNs take into account temporal characteristics of input data in addition to spatial characteristics. This better mimics biological neural networks, where timing of neuronal spikes plays an important role. However, training SNNs remains challenging due to the added temporal complexity. Figure 12.8 provides an overview of the spiking methodlogy: (a) Diagram of a neuron; (b) Measuring an action potential propagated along the axon of a neuron. Only the action potential is detectable along the axon; (c) The neuron’s spike is approximated with a binary representation; (d) Event-Driven Processing; (e) Active Pixel Sensor and Dynamic Vision Sensor. You can also watch the video linked below for a more detailed explanation.

Figure 12.8: Neuromoprhic spiking. Credit: Eshraghian et al. (2023).
Eshraghian, Jason K., Max Ward, Emre O. Neftci, Xinxin Wang, Gregor Lenz, Girish Dwivedi, Mohammed Bennamoun, Doo Seok Jeong, and Wei D. Lu. 2023. “Training Spiking Neural Networks Using Lessons from Deep Learning.” Proc. IEEE 111 (9): 1016–54. https://doi.org/10.1109/jproc.2023.3308088.

Specialized nanoelectronic devices called memristors (Chua 1971) serve as the synaptic components in neuromorphic systems. Memristors act as non-volatile memory with adjustable conductance, emulating the plasticity of real synapses. By combining memory and processing functions, memristors enable in-situ learning without separate data transfers. However, memristor technology has not yet reached maturity and scalability for commercial hardware.

Chua, L. 1971. “Memristor-the Missing Circuit Element.” #IEEE_J_CT# 18 (5): 507–19. https://doi.org/10.1109/tct.1971.1083337.

Recently, the integration of photonics with neuromorphic computing (Shastri et al. 2021) has emerged as an active research area. Using light for computation and communication allows high speeds and reduced energy consumption. However, fully realizing photonic neuromorphic systems requires overcoming design and integration challenges.

Neuromorphic computing offers promising capabilities for efficient edge inference but still faces obstacles around training algorithms, nanodevice integration, and system design. Ongoing multidisciplinary research across computer science, engineering, materials science, and physics will be key to unlocking the full potential of this technology for AI use cases.

12.8.3 Analog Computing

Analog computing is an emerging approach that uses analog signals and components like capacitors, inductors, and amplifiers rather than digital logic for computing. It represents information as continuous electrical signals instead of discrete 0s and 1s. This allows the computation to directly reflect the analog nature of real-world data, avoiding digitization errors and overhead.

Analog computing has generated renewed interest for efficient AI hardware, particularly for inference directly on low-power edge devices. Operations like multiplication and summation at the core of neural networks can be performed with very low energy consumption using analog circuits. This makes analog well-suited for deploying ML models on energy-constrained end nodes. Startups like Mythic are developing analog AI accelerators.

While analog computing was popular in early computers, the boom of digital logic led to its decline. However, analog is compelling for niche applications requiring extreme efficiency (Haensch, Gokmen, and Puri 2019). It contrasts with digital neuromorphic approaches that still use digital spikes for computation. Analog may allow lower precision computation but requires expertise in analog circuit design. Tradeoffs around precision, programming complexity, and fabrication costs remain active areas of research.

Haensch, Wilfried, Tayfun Gokmen, and Ruchir Puri. 2019. “The Next Generation of Deep Learning Hardware: Analog Computing.” Proc. IEEE 107 (1): 108–22. https://doi.org/10.1109/jproc.2018.2871057.
Hazan, Avi, and Elishai Ezra Tsur. 2021. “Neuromorphic Analog Implementation of Neural Engineering Framework-Inspired Spiking Neuron for High-Dimensional Representation.” Front. Neurosci. 15: 627221. https://doi.org/10.3389/fnins.2021.627221.

Neuromorphic computing, which aims to emulate biological neural systems for efficient ML inference, can for instance use analog circuits to implement the key components and behaviors of brains. For example, researchers have designed analog circuits to model neurons and synapses using capacitors, transistors, and operational amplifiers (Hazan and Ezra Tsur 2021). The capacitors can exhibit the spiking dynamics of biological neurons, while the amplifiers and transistors provide weighted summation of inputs to mimic dendrites. Variable resistor technologies like memristors can realize analog synapses with spike-timing dependent plasticity - the ability to strengthen or weaken connections based on spiking activity.

Startups like SynSense have developed analog neuromorphic chips containing these biomimetic components (Bains 2020). This analog approach results in very low power consumption and high scalability for edge devices versus complex digital SNN implementations.

Bains, Sunny. 2020. “The Business of Building Brains.” Nature Electronics 3 (7): 348–51. https://doi.org/10.1038/s41928-020-0449-1.

However, training analog SNNs on chip remains an open challenge. Overall, analog realization is a promising technique for delivering the efficiency, scalability, and biological plausibility envisioned with neuromorphic computing. The physics of analog components combined with neural architecture design could enable large improvements in inference efficiency over conventional digital neural networks.

12.8.4 Flexible Electronics

While much of the new hardware technology in the ML workspace has been focused on optimizing and making systems more efficient, there’s a parallel trajectory aiming to adapt hardware for specific applications (Gates 2009; Musk et al. 2019; Tang et al. 2023; Tang, He, and Liu 2022; Kwon and Dong 2022). One such avenue is the development of flexible electronics for AI use cases.

Gates, Byron D. 2009. “Flexible Electronics.” Science 323 (5921): 1566–67. https://doi.org/10.1126/science.1171230.
Tang, Xin, Hao Shen, Siyuan Zhao, Na Li, and Jia Liu. 2023. “Flexible Braincomputer Interfaces.” Nature Electronics 6 (2): 109–18. https://doi.org/10.1038/s41928-022-00913-9.
Tang, Xin, Yichun He, and Jia Liu. 2022. “Soft Bioelectronics for Cardiac Interfaces.” Biophysics Reviews 3 (1). https://doi.org/10.1063/5.0069516.

Flexible electronics refer to electronic circuits and devices fabricated on flexible plastic or polymer substrates rather than rigid silicon. This allows the electronics to bend, twist, and conform to irregular shapes, unlike conventional rigid boards and chips. Figure 12.9 shows an example of a flexible device prototype that wirelessly measures body temperature, which can be seamlessly integrated into clothing or skin patches. The flexibility and bendability of emerging electronic materials allows them to be integrated into thin, lightweight form factors well-suited for embedded AI and TinyML applications.

Flexible AI hardware can conform to curvy surfaces and operate efficiently with microwatt power budgets. Flexibility also enables rollable or foldable form factors to minimize device footprint and weight, which is ideal for small, portable smart devices and wearables incorporating TinyML. Another key advantage of flexible electronics compared to conventional technologies is lower manufacturing costs and simpler fabrication processes, which could democratize access to these technologies. While silicon masks and fabrication costs typically cost millions of dollars, flexible hardware typically costs only tens of cents to manufacture (T.-C. Huang et al. 2011; Biggs et al. 2021). The potential to fabricate flexible electronics directly onto plastic films using high-throughput printing and coating processes can reduce costs and improve manufacturability at scale versus rigid AI chips (Musk et al. 2019).

Huang, Tsung-Ching, Kenjiro Fukuda, Chun-Ming Lo, Yung-Hui Yeh, Tsuyoshi Sekitani, Takao Someya, and Kwang-Ting Cheng. 2011. “Pseudo-CMOS: A Design Style for Low-Cost and Robust Flexible Electronics.” IEEE Trans. Electron Devices 58 (1): 141–50. https://doi.org/10.1109/ted.2010.2088127.
Biggs, John, James Myers, Jedrzej Kufel, Emre Ozer, Simon Craske, Antony Sou, Catherine Ramsdale, Ken Williamson, Richard Price, and Scott White. 2021. “A Natively Flexible 32-Bit Arm Microprocessor.” Nature 595 (7868): 532–36. https://doi.org/10.1038/s41586-021-03625-w.
Figure 12.9: Flexible device prototype. Credit: Jabil Circuit.

The field is enabled by advances in organic semiconductors and nanomaterials that can be deposited on thin, flexible films. However, fabrication remains challenging compared to mature silicon processes. Flexible circuits typically exhibit lower performance than rigid equivalents right now. Still, they promise to transform electronics into lightweight, bendable materials.

Flexible electronics use cases are well-suited for intimate integration with the human body. Potential medical AI applications include biointegrated sensors, soft assistive robots, and implants to monitor or stimulate the nervous system intelligently. Specifically, flexible electrode arrays could enable higher density, less invasive neural interfaces compared to rigid equivalents.

Therefore, flexible electronics are ushering in a new era of wearables and body sensors, largely due to innovations in organic transistors. These components allow for more lightweight and bendable electronics, which are ideal for wearables, electronic skin, and body-conforming medical devices.

In terms of biocompatibility, they are well-suited for bioelectronic devices, opening avenues for applications in both brain and cardiac interfaces. For example, research in flexible brain–computer interfaces and soft bioelectronics for cardiac applications demonstrates the potential for wide-ranging medical applications.

Companies and research institutions are not only developing and investing great amounts of resources in flexible electrodes, as showcased in Neuralink’s work (Musk et al. 2019), but are also pushing the boundaries to integrate machine learning models within the systems (Kwon and Dong 2022). These smart sensors aim for a seamless, long-lasting symbiosis with the human body.

Musk, Elon et al. 2019. “An Integrated Brain-Machine Interface Platform with Thousands of Channels.” J. Med. Internet Res. 21 (10): e16194. https://doi.org/10.2196/16194.
Kwon, Sun Hwa, and Lin Dong. 2022. “Flexible Sensors and Machine Learning for Heart Monitoring.” Nano Energy 102: 107632. https://doi.org/10.1016/j.nanoen.2022.107632.
Segura Anaya, L. H., Abeer Alsadoon, N. Costadopoulos, and P. W. C. Prasad. 2017. “Ethical Implications of User Perceptions of Wearable Devices.” Sci. Eng. Ethics 24 (1): 1–28. https://doi.org/10.1007/s11948-017-9872-8.
Goodyear, Victoria A. 2017. “Social Media, Apps and Wearable Technologies: Navigating Ethical Dilemmas and Procedures.” Qualitative Research in Sport, Exercise and Health 9 (3): 285–302. https://doi.org/10.1080/2159676x.2017.1303790.
Farah, Martha J. 2005. “Neuroethics: The Practical and the Philosophical.” Trends Cogn. Sci. 9 (1): 34–40. https://doi.org/10.1016/j.tics.2004.12.001.
Roskies, Adina. 2002. “Neuroethics for the New Millenium.” Neuron 35 (1): 21–23. https://doi.org/10.1016/s0896-6273(02)00763-8.

Ethically, the incorporation of smart, machine-learning-driven sensors within the body raises important questions . Issues surrounding data privacy, informed consent, and the long-term societal implications of such technologies are the focus of ongoing work in neuroethics and bioethics (Segura Anaya et al. 2017; Goodyear 2017; Farah 2005; Roskies 2002). The field is progressing at a pace that necessitates parallel advancements in ethical frameworks to guide the responsible development and deployment of these technologies. Overall, while there are limitations and ethical hurdles to overcome, the prospects for flexible electronics are expansive and hold immense promise for future research and applications.

12.8.5 Memory Technologies

Memory technologies are critical to AI hardware, but conventional DDR DRAM and SRAM create bottlenecks. AI workloads require high bandwidth (>1 TB/s) and extreme scientific applications of AI require extremely low latency (<50 ns) to feed data to compute units (Duarte et al. 2022), high density (>128Gb) to store large model parameters and data sets, and excellent energy efficiency (<100 fJ/b) for embedded use (Verma et al. 2019). New memories are needed to meet these demands. Emerging options include several new technologies:

Duarte, Javier, Nhan Tran, Ben Hawks, Christian Herwig, Jules Muhizi, Shvetank Prakash, and Vijay Janapa Reddi. 2022. FastML Science Benchmarks: Accelerating Real-Time Scientific Edge Machine Learning.” ArXiv Preprint abs/2207.07958. https://arxiv.org/abs/2207.07958.
Verma, Naveen, Hongyang Jia, Hossein Valavi, Yinqi Tang, Murat Ozatay, Lung-Yen Chen, Bonan Zhang, and Peter Deaville. 2019. “In-Memory Computing: Advances and Prospects.” IEEE Solid-State Circuits Mag. 11 (3): 43–55. https://doi.org/10.1109/mssc.2019.2922889.
  • Resistive RAM (ReRAM) can improve density with simple, passive arrays. However, challenges around variability remain (Chi et al. 2016).
  • Phase change memory (PCM) exploits the unique properties of chalcogenide glass. Crystalline and amorphous phases have different resistances. Intel’s Optane DCPMM provides fast (100ns), high endurance PCM. But challenges include limited write cycles and high reset current (Burr et al. 2016).
  • 3D stacking can also boost memory density and bandwidth by vertically integrating memory layers with TSV interconnects (Loh 2008). For example, HBM provides 1024-bit wide interfaces.
Burr, Geoffrey W., Matthew J. BrightSky, Abu Sebastian, Huai-Yu Cheng, Jau-Yi Wu, Sangbum Kim, Norma E. Sosa, et al. 2016. “Recent Progress in Phase-Change\(<\)?Pub _Newline ?\(>\)Memory Technology.” IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6 (2): 146–62. https://doi.org/10.1109/jetcas.2016.2547718.
Loh, Gabriel H. 2008. 3D-Stacked Memory Architectures for Multi-Core Processors.” ACM SIGARCH Computer Architecture News 36 (3): 453–64. https://doi.org/10.1145/1394608.1382159.

New memory technologies are critical to unlock the next level of AI hardware performance and efficiency through their innovative cell architectures and materials. Realizing their benefits in commercial systems remains an ongoing challenge.

In-Memory Computing is gaining traction as a promising avenue for optimizing machine learning and high-performance computing workloads. At its core, the technology co-locates data storage and computation to improve energy efficiency and reduce latency Wong et al. (2012). Two key technologies under this umbrella are Resistive RAM (ReRAM) and Processing-In-Memory (PIM).

Wong, H.-S. Philip, Heng-Yuan Lee, Shimeng Yu, Yu-Sheng Chen, Yi Wu, Pang-Shiu Chen, Byoungil Lee, Frederick T. Chen, and Ming-Jinn Tsai. 2012. MetalOxide RRAM.” Proc. IEEE 100 (6): 1951–70. https://doi.org/10.1109/jproc.2012.2190369.
Chi, Ping, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. “Prime: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory.” ACM SIGARCH Computer Architecture News 44 (3): 27–39. https://doi.org/10.1145/3007787.3001140.

ReRAM (Wong et al. 2012) and PIM (Chi et al. 2016) serve as the backbone for in-memory computing by storing and computing data in the same location. ReRAM focuses on issues of uniformity, endurance, retention, multibit operation, and scalability. On the other hand, PIM involves CPU units integrated directly into memory arrays, specialized for tasks like matrix multiplication which are central in AI computations.

These technologies find applications in AI workloads and high-performance computing, where the synergy of storage and computation can lead to significant performance gains. The architecture is particularly useful for compute-intensive tasks common in machine learning models.

While in-memory computing technologies like ReRAM and PIM offer exciting prospects for efficiency and performance, they come with their own set of challenges such as issues with data uniformity and scalability in ReRAM (Imani, Rahimi, and S. Rosing 2016). Nonetheless, the field is ripe for innovation, and addressing these limitations can potentially open new frontiers in both AI and high-performance computing.

Imani, Mohsen, Abbas Rahimi, and Tajana S. Rosing. 2016. “Resistive Configurable Associative Memory for Approximate Computing.” In Proceedings of the 2016 Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 1327–32. IEEE; Research Publishing Services. https://doi.org/10.3850/9783981537079\_0454.

12.8.6 Optical Computing

In AI acceleration, a burgeoning area of interest lies in novel technologies that deviate from traditional paradigms. Some emerging technologies mentioned above such as flexible electronics, in-memory computing or even neuromorphics computing are close to becoming a reality, given their ground-breaking innovations and applications. One of the promising and leading the next-gen frontiers are optical computing technologies H. Zhou et al. (2022). Companies like [LightMatter] are pioneering the use of light photonics for calculations, thereby utilizing photons instead of electrons for data transmission and computation.

Zhou, Hailong, Jianji Dong, Junwei Cheng, Wenchan Dong, Chaoran Huang, Yichen Shen, Qiming Zhang, et al. 2022. “Photonic Matrix Multiplication Lights up Photonic Accelerator and Beyond.” Light: Science &Amp; Applications 11 (1): 30. https://doi.org/10.1038/s41377-022-00717-8.
Shastri, Bhavin J., Alexander N. Tait, T. Ferreira de Lima, Wolfram H. P. Pernice, Harish Bhaskaran, C. D. Wright, and Paul R. Prucnal. 2021. “Photonics for Artificial Intelligence and Neuromorphic Computing.” Nat. Photonics 15 (2): 102–14. https://doi.org/10.1038/s41566-020-00754-y.

Optical computing utilizes photons and photonic devices rather than traditional electronic circuits for computing and data processing. It takes inspiration from fiber optic communication links that already rely on light for fast, efficient data transfer (Shastri et al. 2021). Light can propagate with much less loss compared to electrons in semiconductors, enabling inherent speed and efficiency benefits.

Some specific advantages of optical computing include:

  • High throughput: Photons can transmit with bandwidths >100 Tb/s using wavelength division multiplexing.
  • Low latency: Photons interact on femtosecond timescales, millions of times faster than silicon transistors.
  • Parallelism: Multiple data signals can propagate through the same optical medium simultaneously.
  • Low power: Photonic circuits utilizing waveguides and resonators can achieve complex logic and memory with only microwatts of power.

However, optical computing currently faces significant challenges:

  • Lack of optical memory equivalent to electronic RAM
  • Requires conversion between optical and electrical domains.
  • Limited set of available optical components compared to rich electronics ecosystem.
  • Immature integration methods to combine photonics with traditional CMOS chips.
  • Complex programming models required to handle parallelism.

As a result, optical computing is still in the very early research stage despite its promising potential. But technical breakthroughs could enable it to complement electronics and unlock performance gains for AI workloads. Companies like Lightmatter are pioneering early optical AI accelerators. Long term, it could represent a revolutionary computing substrate if key challenges are overcome.

12.8.7 Quantum Computing

Quantum computers leverage unique phenomena of quantum physics like superposition and entanglement to represent and process information in ways not possible classically. Instead of binary bits, the fundamental unit is the quantum bit or qubit. Unlike classical bits limited to 0 or 1, qubits can exist in a superposition of both states simultaneously due to quantum effects.

Multiple qubits can also be entangled, leading to exponential information density but introducing probabilistic results. Superposition enables parallel computation on all possible states, while entanglement allows nonlocal correlations between qubits.

Quantum algorithms carefully manipulate these inherently quantum mechanical effects to solve problems like optimization or search more efficiently than their classical counterparts in theory.

  • Faster training of deep neural networks by exploiting quantum parallelism for linear algebra operations.
  • Efficient quantum ML algorithms making use of the unique capabilities of qubits.
  • Quantum neural networks with inherent quantum effects baked into the model architecture.
  • Quantum optimizers leveraging quantum annealing or adiabatic algorithms for combinatorial optimization problems.

However, quantum states are fragile and prone to errors that require error-correcting protocols. The non-intuitive nature of quantum programming also introduces challenges not present in classical computing.

  • Noisy and fragile quantum bits difficult to scale up. The largest quantum computer today has less than 100 qubits.
  • Restricted set of available quantum gates and circuits relative to classical programming.
  • Lack of datasets and benchmarks to evaluate quantum ML in practical domains.

While meaningful quantum advantage for ML remains far off, active research at companies like D-Wave, Rigetti, and IonQ is advancing quantum computer engineering and quantum algorithms. Major technology companies like Google, IBM, and Microsoft are actively exploring quantum computing. Google recently announced a 72-qubit quantum processor called Bristlecone and plans to build a 49-qubit commercial quantum system. Microsoft also has an active research program in topological quantum computing and collaborates with quantum startup IonQ

Quantum techniques may first make inroads for optimization before more generalized ML adoption. Realizing the full potential of quantum ML awaits major milestones in quantum hardware development and ecosystem maturity.

12.10 Conclusion

Specialized hardware acceleration has become indispensable for enabling performant and efficient artificial intelligence applications as models and datasets explode in complexity. In this chapter, we examined the limitations of general-purpose processors like CPUs for AI workloads. Their lack of parallelism and computational throughput cannot train or run state-of-the-art deep neural networks quickly. These motivations have driven innovations in customized accelerators.

We surveyed GPUs, TPUs, FPGAs and ASICs specifically designed for the math-intensive operations inherent to neural networks. By covering this spectrum of options, we aimed to provide a framework for reasoning through accelerator selection based on constraints around flexibility, performance, power, cost, and other factors.

We also explored the role of software in actively enabling and optimizing AI acceleration. This spans programming abstractions, frameworks, compilers and simulators. We discussed hardware-software co-design as a proactive methodology for building more holistic AI systems by closely integrating algorithm innovation and hardware advances.

But there is so much more to come! Exciting frontiers like analog computing, optical neural networks, and quantum machine learning represent active research directions that could unlock orders of magnitude improvements in efficiency, speed, and scale compared to present paradigms.

In the end, specialized hardware acceleration remains indispensable for unlocking the performance and efficiency necessary to fulfill the promise of artificial intelligence from cloud to edge. We hope this chapter actively provided useful background and insights into the rapid innovation occurring in this domain.