DGX A100 5 Miracles

DGX A100 5 Miracles

The world's largest 7nm chip

The fastest available HBM2 memory

Fabricated on the TSMC 7nm N7 manufacturing process, the NVIDIA Ampere architecture-based GA100 GPU that powers A100 includes 54.2 billion transistors with a die size of 826 mm2.

Those tiny transistors represent electrical gates that switch on and off when performing calculations. The smaller the transistor, less the energy it requires to function.  By following the 7nm process, NVIDIA Ampere architecture is delivering highly powerful GPUs that go further Moore’s law promises.

 

3rd Generation NVLINK and NVSwitch

Strongest scaling within the system and
The biggest scaling across nodes

Deep Learning algorithms are becoming stronger and more complex with the increasing number of hidden layers processing information. This complexity is craving for computing resources from multiple GPUs. For a seamless connectivity between GPUs to act as one accelerator, a powerful interconnection is needed. GPUs high-speed interconnection is possible with NVIDIA® NVLink®. NVIDIA® NVSwitch® incorporates multiple NVLinks® ensuring full GPU communication and full NVLink® speed. This combined interconnectivity helps scale performance for multiple GPUs.

 

NVIDIA NVLink

NVIDIA® A100 PCIe with NVLink® GPU-to-GPU connection

 

NVIDIA® A100 with NVLink® GPU-to-GPU connections

 

 

NVIDIA NVSwitch

The NVSwitch® topology diagram shows the connection of two GPUs for simplicity.

Eight or 16 GPUs connect all-to-all through NVSwitch® in the same way.

 

 

3rd generation tensor cores

TF32 with Sparsity
Faster, more flexible and easier to use

 

NVIDIA Tensor Core technology has brought dramatic speedups to AI, reducing training time from weeks to hours and providing massive acceleration to inference. The NVIDIA Ampere architecture provides a huge performance boost and delivers new precisions to cover the full spectrum required by researchers— TF32, FP64, FP16, INT8, and INT4—accelerating and simplifying AI adoption and extending the power of NVIDIA Tensor Cores to HPC.

Tensor Float 32

As AI networks and datasets continue to expand exponentially, their computing appetite has similarly grown. Lower-precision math has brought huge performance speedups, but they’ve historically required some code changes. A100 brings a new precision, TF32, which works just like FP32 while delivering speedups of up to 20X for AI—without requiring any code change.

    NVIDIA V100 FP32                NVIDIA A100 Tensor Core TF32 with Sparsity

 

FP64 Tensor Cores

A100 brings the power of Tensor Cores to HPC, providing the biggest milestone since the introduction of double-precision GPU computing for HPC. By enabling matrix operations in FP64 precision, a whole range of HPC applications that need double-precision math can now get a 2.5X boost in performance and efficiency compared to prior generations of GPUs.

NVIDIA V100 FP64         NVIDIA A100 Tensor         Core FP64

 

FP16 Tensor Cores

A100 Tensor Cores enhance FP16 for deep learning, providing a 2X speedup compared to NVIDIA Volta™ for AI. This dramatically boosts throughput and cuts time to convergence.

NVIDIA V100 Tensor Core FP16                NVIDIA A100 Tensor CoreFP16 with Sparsity

 

INT8 Precision

First introduced in NVIDIA Turing™, INT8 Tensor Cores dramatically accelerate inference throughput and deliver huge boosts in efficiency. INT8 in the NVIDIA Ampere architecture delivers 10X the comparable throughput of Volta for production deployments. This versatility enables industry-leading performance for both high-batch and real-time workloads in core and edge data centers.

                                         NVIDIA V100 INT8                   NVIDIA A100 Tensor CoreINT8 with Sparsity

 

Sparsity acceleration

2x more performance compared
To dense matrix operations

In network science, a sparse network has much fewer links than the possible maximum number of links within that network (the opposite is a dense network). The study of sparse networks is a relatively new area primarily stimulated by the study of real networks, such as social and computer networks. (Wikipedia, 2020)

For years, researchers have been playing a kind of Jenga with numbers in their efforts to accelerate AI using sparsity. They try to pull out of a neural network as many unneeded parameters as possible without unraveling AI’s uncanny accuracy. The goal is to reduce the mounds of matrix multiplication deep learning requires, shortening the time to good results.

The NVIDIA Ampere architecture introduces third-generation Tensor Cores in NVIDIA A100 GPUs that take advantage of the fine-grained sparsity in network weights. They offer up to 2x the maximum throughput of dense math without sacrificing accuracy of the matrix multiply-accumulate jobs at the heart of deep learning.

 

 

The NVIDIA Ampere architecture takes advantage of the prevalence of small values in neural networks in a way that benefits the widest possible swath of AI applications. Specifically, it defines a method for training a neural network with half its weights removed, or what’s known as 50 percent sparsity.

 

 

New Multi-Instance GPU

Divide yout GPU into smaller instances

The new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.

For Cloud Service Providers (CSPs), who have multi-tenant use cases, MIG ensures one client cannot impact the work or scheduling of other clients, in addition to providing enhanced isolation for customers.

With MIG, each instance’s processors have separate and isolated paths through the entire memory system - the on-chip crossbar ports, L2 cache banks, memory controllers, and DRAM address busses are all assigned uniquely to an individual instance. This ensures that an individual user’s workload can run with predictable throughput and latency, with the same L2 cache allocation and DRAM bandwidth, even if other tasks are thrashing their own caches or saturating their DRAM interfaces. MIG can partition available GPU compute resources (including streaming multiprocessors or SMs, and GPU engines such as copy engines or decoders), to provide a defined quality of service (QoS) with fault isolation for different clients such as VMs, containers or processes. MIG enables multiple GPU Instances to run in parallel on a single, physical A100 GPU.

With NVIDIA A100 GPU, users will be able to see and schedule jobs on their new virtual GPU Instances as if they were physical GPUs. MIG works with Linux operating systems, supports containers using Docker Engine, with support for Kubernetes and virtual machines using hypervisors such as Red Hat Virtualization and VMware vSphere coming soon.