Numerical Integration with CUDA: Accelerating Calculations on GPU

Introduction

Integrating a function can be a computationally intensive task, especially when dealing with a high number of subintervals for better accuracy. Leveraging GPU parallelism with CUDA allows us to significantly accelerate the numerical integration process. Today, I’ll walk you through a simple CUDA program that demonstrates how to compute the integral of a function numerically; I will compare the result got numerically with the result got analytically.

Mathematical Background

The technique used to implement the numerical integration is the trapezoidal rule. It works by dividing the area under the curve into a series of trapezoids rather than rectangles, which provides a more accurate approximation, especially for functions that are not linear. In this method, the function values at the endpoints of each subinterval are averaged, and the area of each trapezoid is calculated. By summing up the areas of all the trapezoids, we obtain an estimate of the total integral. The trapezoidal rule is particularly useful when the function is smooth and continuous, as it tends to converge to the exact value more quickly compared to simpler methods like the midpoint or rectangle rule.

Defining the Function and Discrete Domain

I’ll consider a simple polynomial function for demonstration purposes:

\(f(x) = x^2 + 4 x + 2\)

I’ll discretize the domain for x from 0 to 2 in 1,000,000 intervals.

The analytic antiderivative of f(x) is:

\(f(x) = frac{1}{3} x^3 + 2 x ^ 2 + 2 x + C\)

where C is an arbitrary constant.

Overview of code

Below is the code:

#include <iostream>
#include <cmath>

__device__ float f(float x)
{
    return pow(x, 2) + 4 * x + 2;
}

float F(float x)
{
    return (1.0/3.0) * pow(x, 3) + 2 * pow(x, 2) + 2 * x;
}

__global__ void integrate(float a, float b, float *result, int n)
{
    int idx = threadIdx.x + blockDim.x * blockIdx.x;
    int stride = blockDim.x * gridDim.x;

    float dx = (b - a) / n;
    float local_sum = 0.0;

    for (int i = idx; i < n; i += stride)
    {
        float x = a + i * dx;
        local_sum += f(x) * dx;
    }

    atomicAdd(result, local_sum);
}

int main()
{
    float a = 0.0f;
    float b = 2.0f;
    int n = 1000000;

    float h_result = 0.0f;
    float *d_result;

    cudaMalloc(&d_result, sizeof(float));
    cudaMemcpy(d_result, &h_result, sizeof(float), cudaMemcpyHostToDevice);

    int blockSize = 256;
    int numBlocks = (n + blockSize - 1) / blockSize;

    integrate<<<numBlocks, blockSize>>>(a, b, d_result, n);

    cudaMemcpy(&h_result, d_result, sizeof(float), cudaMemcpyDeviceToHost);

    cudaFree(d_result);

    std::cout << "The numerically computed integral result is : " << h_result << std::endl;
    std::cout << "The analitically computed integral result is: " << F(b) - F(a) << std::endl;

    return 0;
}

Here’s a breakdown of what the code does:

Define the Function to Integrate: The function f(x) is defined in the device code using the __device__ specifier so that it can be called by GPU threads.
CUDA Kernel for Integration: The kernel function integrate() performs numerical integration using the trapezoidal rule. Each thread contributes to the total sum by calculating the function value at given points and accumulating it locally. To avoid race conditions, atomicAdd is used to add the local sums to the global result.
Host Code: The main() function allocates memory on the GPU, sets up the kernel execution configuration, and copies the results back to the host after computation. It also computes the integral analytically using the antiderivative of f(x) for verification.

The Kernel Function

The kernel function integrate() computes the integral in parallel by dividing the range between threads:

This kernel divides the integration task among multiple threads. Each thread calculates a portion of the total area under the curve. The atomicAdd() function is used to safely accumulate the local sums into a shared result variable.

Host Code

In the main() function, we allocate memory on the GPU, launch the kernel, and finally compare the result with the analytical solution.

Results

Numerical Integration: The GPU computes the integral using parallelism, making the process fast and efficient, especially with a large number of intervals (n = 1,000,000 in this case).
Analytical Integration: The analytical result is calculated using the antiderivative of the function, which serves as a benchmark to check the accuracy of the numerical integration.

Key Takeaways

Parallelism: GPU threads allow us to break down the integration task into smaller parts, speeding up the calculation significantly compared to CPU-based approaches.
Atomic Operations: When accumulating results from multiple threads, atomic operations are essential to avoid race conditions and ensure accurate results.

Using CUDA for numerical integration is a great way to leverage the power of GPUs for computationally intensive tasks. This example demonstrates the potential for GPU acceleration in scientific computations and provides a foundation for more complex problems.

By utilizing CUDA and the GPU, you can efficiently compute numerical integrals with high accuracy. This approach can be applied to more complex functions or extended to multiple dimensions. If you are looking to boost the performance of your numerical computations, learning CUDA can be a game-changer.

Compilation and Execution

To compile and run the code, use the following commands:

nvcc compute_integral.cu -o compute_integral

Alternatively, you can run the build.sh shell script.

Note: Ensure that you have the NVIDIA CUDA Toolkit installed and a compatible GPU to run CUDA programs.

To run the compiled program run:

./compute_integral

Alternatively, you can run the ./run.sh shell script.

Code Repository

You can find the complete code in my GitHub repository.