C++ | Jun Kwan

This post is based on this self-paced course offered by NVIDIA’s Deep Learning Institute. NVIDIA’s stdpar lets you offload standard C++20 algorithms onto the GPU without CUDA kernels or new syntax. To show how this works in practice, we’ll use DAXPY, which is a simple but memory-intensive linear algebra operation, and see how a few small code changes take it from a single-threaded CPU loop to full GPU execution. DAXPY as a Bandwidth Benchmark DAXPY stands for Double-precision AX Plus Y, and it computes the following equation. ...