Programming model of GPUs using CUDA
Types of Parallelism
Task Parallelism: different operations performed on the same or different data
- Example: text editor that has spellchecker on top. Different operations done at the same time on the same data.
- usually has a modest number of tasks unleashing a modest amount of parallelism
- because in one application how many differnt type of operations would you run at the same time
Data Parallelism: same operations performed on different data
- Example: a screen where we render different pixels, same operations but different data used.
- potentially massive amounts of data unleashing massive amounts of parallelism
- taking the same program and running it on a larger dataset, like rendering an image on a higher resolution screen
- most suitable for GPUs
System Organization
CPU: host RAM (main memory): host memory
GPU: device (device memory): global memory
The CPU and the GPU have seperate memories and cannot access each others memories
We can use PCie to transfer the data across the two, another way which is specific
to Nvidia is Nvlink.
Steps:
- allocate GPU memory
- copy data from CPU memory to GPU memory
- perform computation on GPU
- copy data from GPU memory to CPU memory
- deallocate GPU memory
Vector addition
sequential code:
for (int i = 0; i < arraySize; i++) {
c[i] = a[i] + b[i];
}
To make this operation faster on the GPU, we can assign a thread per vector element so we can run all the computation in parallel instead of sequentially.
grid is an array of threads
block is a group of threads inside a grid
[look up] threads in the same block can collaborate in ways that threads in different blocks cannot
Launch a grid
- threads in the same grid execute the same function known as a kernel
- grid can be launched by calling a kernel and configuring it with appropriate block and grid sizes
[look up] Current maximum number of threads per block in modern GPUs is 1024
keywords: gridDim.x: number of blocks in the grid blockIdx.x: position of blocks in the grid along the x-axis blockDim.x: number of threads in the block threadIdx.x: position of the thread in the block along the x-axis
To find the global index of a thread based on all the threads in the grid
you do the following:
blockDim.x * blockIdx.x + threadIdx.x
This is very important to remember.
In the case of vector additions, we need to find the global index of a thread because we want to assign each thread to an element of the vector by matching their index.
Compilation with NVCC
Splits the code into 2 parts. C/C++ compilation:
- host C/C++ code is passed into the host C/C++ compiler
- host assembly is generated
- host assembly (e.g x86, ARM)
- device assembly is run on the CPU
CUDA compilation:
- CUDA code is compiled into .ptx a virtual ISA
- device Just-in-Time compiler takes the .ptx and compiles it to device assembly
- happens when we run the program, runtime
- device assembly (e.g SASS)
- device assembly is run on the GPU
When we have a call to the kernel the CPU will call the device assembly code to run on the GPU.
kernels are asynchronous
What is CUDA?
The way graphics works is very similar to the way other domains works but with different emphasis. CUDA started with supercomputing but now it is heavily used in the AI world as well.
CUDA is written in C, started out as a language in a compiler. It is now much more than just a language, it has a lot of different things going on.
It is essentially an abstraction, a python function that does image processing runs CUDA in the background
CUDA sees the CPU and the GPU as one -> heterogeneous computing (using more than one type of processor, CPU and GPU)
example program:
- Load the config file in the CPU
- Fetch an API from the interntet on the CPU
- Do some image processing on the GPU
We as the programmer are able to tell CUDA where the instructions go, it will look like one program on the surface.
CUDA allows communication between the CPU and GPU using the PCIE buses. But there is something called confidential computing from Nvidia where there is a fully encrypted channel that encrypt the communication on the PCIE buses.
CUDA is like an interpreter as it takes in the commands that you give it and it turns them into the command streams that the hardware needs to control it.
Originally GPUs were made to push pixels onto a screen. There are these deep pipelines for graphical processing, but it turns out that pipelines for pixers are the same as pipelines for matrix operations for AI and other things.
- compilers create the binary files
- runtimes that control the hardware to dispatch the binary to them
- assembly language to actually run the program
CUDA can take in many different types of software and it interacts with different type of hardware.
It then it makes the software think there is only one type of hardware and the makes the hardware think there is only one type of software.