Introduction
In scientific computing and engineering, calculating derivatives is a fundamental operation. When dealing with functions of multiple variables, computing partial derivatives becomes essential for gradient computations, optimization algorithms, and solving partial differential equations. Traditional CPU implementations can be time-consuming for large datasets due to their sequential nature. This is where CUDA (Compute Unified Device Architecture) comes into play, allowing us to harness the parallel processing power of NVIDIA GPUs.
In this blog post, we’ll explore how to compute the partial derivatives of a two-variable function using CUDA. We’ll focus on a simplified kernel that does not handle edge cases, accepting that the derivative arrays will have one less element in each dimension. This approach simplifies the code and can lead to performance improvements.
Mathematical Background
Consider a function f(x,y) of two variables. The partial derivatives with respect to x and y can be approximated using the forward difference method:
-
Partial derivative with respect to x:
\(\frac{\partial f}{\partial x} \approx \frac{f(x + {\Delta x}, y) – f(x, y)}{\Delta x} \)
-
Partial derivative with respect to y:
\(\frac{\partial f}{\partial y} \approx \frac{f(x, y + {\Delta y}) – f(x, y)}{\Delta y} \)
By discretizing the function over a grid and applying these formulas, we can compute the derivatives at each grid point.
Implementation Steps
1. Defining the Function and Discrete Domain
I’ll consider a simple polynomial two variable function for demonstration purposes:
I’ll discretize the domain in this way:
- x: from 0 to 2 step 0.001.
- y: from 0 to 0.5 step 0.001.
The analytical partial derivatives are:
-
With respect to x:
\(\frac{\partial f}{\partial x} = 6 x^2 y – 10 x y^2 – 9 y^3 \)
-
With respect to y:
\(\frac{\partial f}{\partial y} = 2 x^3 – 10 x^2 y – 27 x y^2 \)
Our goal is to approximate these derivatives numerically using CUDA.
2. Writing the CUDA Kernel
The CUDA kernel computes the discrete partial derivatives at each point. Each thread handles one point in the domain.
__global__
void compute_partial_derivatives(float *d_f, float *d_f_prime_x, float *d_f_prime_y, float delta_x, float delta_y, int nx, int ny)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
if (idx < nx - 1 && idy < ny - 1)
{
int index = idy * nx + idx;
int idx_x_forward = idy * nx + (idx + 1);
d_f_prime_x[index] = (d_f[idx_x_forward] - d_f[index]) / delta_x;
int idx_y_forward = (idy + 1) * nx + idx;
d_f_prime_y[index] = (d_f[idx_y_forward] - d_f[index]) / delta_y;
}
}
3. Host Code
The host code allocates memory, initializes data, launches the kernel, retrieves the results and compare the results obtained with the expected results.
#include <iostream>
#include <cuda_runtime.h>
int main()
{
int nx = 2000;
int ny = 500;
float delta_x = 0.001f;
float delta_y = 0.001f;
int num_elements = nx * ny;
float *h_f = new float[num_elements];
float *h_f_prime_x = new float[num_elements - 1];
float *h_f_prime_y = new float[num_elements -1];
for (int j = 0; j < ny; ++j)
{
float y = j * delta_y;
for (int i = 0; i < nx; ++i)
{
float x = i * delta_x;
int idx = j * nx + i;
h_f[idx] = f(x, y);
}
}
float *d_f;
float *d_f_prime_x;
float *d_f_prime_y;
cudaMalloc(&d_f, num_elements * sizeof(float));
cudaMalloc(&d_f_prime_x, (num_elements - 1) * sizeof(float));
cudaMalloc(&d_f_prime_y, (num_elements - 1) * sizeof(float));
cudaMemcpy(d_f, h_f, num_elements * sizeof(float), cudaMemcpyHostToDevice);
dim3 blockSize(32, 32);
dim3 gridSize((nx + blockSize.x - 1) / blockSize.x,
(ny + blockSize.y - 1) / blockSize.y);
compute_partial_derivatives<<<gridSize, blockSize>>>(d_f, d_f_prime_x, d_f_prime_y, delta_x, delta_y, nx, ny);
cudaMemcpy(h_f_prime_x, d_f_prime_x, (num_elements - 1) * sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(h_f_prime_y, d_f_prime_y, (num_elements - 1) * sizeof(float), cudaMemcpyDeviceToHost);
float max_error = 0;
for (int j = 0; j < ny - 1; ++j)
{
float y = j * delta_y;
for (int i = 0; i < nx - 1; ++i)
{
float x = i * delta_x;
int idx = j * nx + i;
float error_x = fabs(f_prime_x(x, y) - h_f_prime_x[idx]);
if (error_x > max_error)
max_error = error_x;
float error_y = fabs(f_prime_y(x, y) - h_f_prime_y[idx]);
if (error_y > max_error)
max_error = error_y;
}
}
std::cout << "Max error: " << max_error << std::endl;
cudaFree(d_f);
cudaFree(d_f_prime_x);
cudaFree(d_f_prime_y);
delete[] h_f;
delete[] h_f_prime_x;
delete[] h_f_prime_y;
return 0;
}
4. Compilation and Execution
To compile and run the code, use the following commands:
nvcc compute_first_partial_derivatives.cu -o compute_first_partial_derivatives
Alternatively, you can run the build.sh shell script.
Note: Ensure that you have the NVIDIA CUDA Toolkit installed and a compatible GPU to run CUDA programs.
To run the compiled program run:
./compute_first_partial_derivatives
Alternatively, you can run the ./run.sh shell script.
Code Repository
You can find the complete code in my GitHub repository.
