High-Performance CUDA for Discrete Single Variable Derivative Computation

Introduction

In the realm of scientific computing and numerical analysis, the computation of derivatives is a fundamental operation. Whether it’s for solving differential equations, performing signal processing, or modeling physical systems, efficient calculation of derivatives is crucial. Traditional CPU implementations can be slow for large-scale problems due to their limited parallel processing capabilities. This is where CUDA (Compute Unified Device Architecture) comes into play.

CUDA is a parallel computing platform and programming model developed by NVIDIA that allows developers to leverage the power of NVIDIA GPUs for general-purpose computing. By utilizing CUDA, we can significantly accelerate the computation of discrete derivatives by parallelizing the workload across thousands of GPU cores.

This blog post will guide you through the implementation of computing the discrete first derivative of a function using CUDA. I’ll cover the mathematical background, the CUDA programming model, and provide a step-by-step example with code to illustrate how to achieve high-performance computing for this task.

Mathematical Background

The discrete first derivative of a continuous function can be approximated using finite differences. One common method is the forward difference formula:

\(f'(x_i) \approx \frac{f(x_{i+1}) – f(x_i)}{\Delta x} \)

where:

\(f'(x_i) \) is the derivative at point \(x_i\)
\(f(x_i) \) is the function value at point \(x_i\)
\(\Delta x \) is the spacing between discrete points.

This formula provides a reasonable approximation of the derivative.

The CUDA Programming Model

CUDA allows to write programs that execute on the GPU by extending the C/C++ language with minimal keywords. The key concepts in CUDA programming include:

Kernels: Functions executed on the GPU.
Threads: Lightweight execution units that run the kernel code.
Blocks and Grids: Threads are organized into blocks, and blocks are organized into grids. This hierarchy allows for scalable parallelism.

By mapping each data point to a separate thread, we can compute the derivative at each point in parallel, greatly accelerating the computation.

Implementation Steps

1. Defining the Function and Discrete Domain

I’ll consider a simple polynomial function for demonstration purposes:

\(f(x) = 2 x^ 3 – 5 x ^ 2 – 9 x + 2\)

I’ll discretize the domain from 0 to 1 into 1000 equally spaced points (so the delta of x is 0.001).
A polynomial function was deliberately used for demonstration purposes so that we can analytically calculate its first derivative, which is:

\(f(x) = 6 x^ 2 – 10 x – 9\)

This will allow me, as shown below, to verify that the discrete calculation performed by the CUDA kernel is correct.

These are the C/C++ (per CPU) implementations of the function and its derivative:

float f(float x)
{
    return 2 * powf(x, 3) - 5 * powf(x, 2) - 9 * x + 2;
}

float f_prime(float x)
{
    return 6 * powf(x, 2) - 10 * x - 9;
}

2. Writing the CUDA Kernel

The CUDA kernel computes the discrete derivative at each point. Each thread handles one point in the domain.

__global__
void compute_derivative(float *d_f, float *d_f_prime, float delta_x, int n)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < n - 1)
    {
        d_f_prime[idx] = (d_f[idx + 1] - d_f[idx]) / delta_x;
    }
}

3. Host Code

The host code allocates memory, initializes data, launches the kernel, retrieves the results and compare the results obtained with the expected results.

#include <iostream>
#include <cuda_runtime.h>

int main() 
{
    int n = 1000;
    float delta_x = 0.001f;

    float *h_f = new float[n];
    float *h_f_prime = new float[n - 1];

    for (int i = 0; i < n; i++)
    {
        float x = i * delta_x;
        h_f[i] = f(x);
    }

    float *d_f;
    float *d_f_prime;
    cudaMalloc(&d_f, n * sizeof(float));
    cudaMalloc(&d_f_prime, (n - 1) * sizeof(float));

    cudaMemcpy(d_f, h_f, n * sizeof(float), cudaMemcpyHostToDevice);

    int blockSize = 256;
    int gridSize = (n + blockSize - 1) / blockSize;

    compute_derivative<<<gridSize, blockSize>>>(d_f, d_f_prime, delta_x, n);
    
    cudaMemcpy(h_f_prime, d_f_prime, (n - 1) * sizeof(float), cudaMemcpyDeviceToHost);

    float max_error = 0;
    for(int i = 0; i < n - 1; i++)
    {
        float x = i * delta_x;
        float error = fabs(f_prime(x) - h_f_prime[i]);
        if (error > max_error)
	        max_error = error;
    }
    std::cout << "Max error: " << max_error << std::endl;
    
    cudaFree(d_f);
    cudaFree(d_f_prime);
    delete[] h_f;
    delete[] h_f_prime;

    return 0;
}

c

4. Compilation and Execution

To compile and run the code, use the following commands:

nvcc compute_first_derivative.cu -o compute_first_derivative

Alternatively, you can run the build.sh shell script.

Note: Ensure that you have the NVIDIA CUDA Toolkit installed and a compatible GPU to run CUDA programs.

To run the compiled program run:

./compute_first_derivative

Alternatively, you can run the ./run.sh shell script.

5. Expected Output

The output compares the computed derivative with the exact derivative and shows the maximum error:

Max error: 0.00508976

which is very small, and is due to discretization. So it is concluded that the first derivative calculated by the CUDA kernel is correct.

Code Repository

You can find the complete code in my GitHub repository.