云栈社区»论坛 › 站务中心「 Forum Service 」 › 深入解析CUDA并行任务模式：从Map到Reduce的七种核心范式与实践 ...

发回帖发新帖

1243 积分	0 好友	161 主题

发消息

深入解析CUDA并行任务模式：从Map到Reduce的七种核心范式与实践代码

发表于 15 小时前 | 查看: 0| 回复: 0

CUDA并行计算概念图

在CUDA编程中，并行任务的处理是其核心魅力所在。通过之前的学习，我们已经掌握了数据访问、线程组织等基础知识，明白了线程与数据之间的基本关系。但要真正驾驭GPU的并行能力，我们需要理解一个更上层的概念：并行任务的操作范式。这本质上是定义线程空间如何映射到数据空间，并进行高效计算的抽象模型。在CUDA实践中，常见的并行任务范式主要包括Map、Gather、Scatter、Transpose、Stencil、Reduce和Scan。

或许你对“线程空间”和“数据空间”感到陌生，我们可以用一个简单的编程常识来理解。比如，在C/C++中，数组可以有一维、二维、三维等逻辑结构，但它们在物理内存中永远是以一维线性地址存放的。类似地，CUDA中的线程可以有二维或三维的组织形式（Grid和Block），而输入数据却通常是逻辑上的一维数组或二维矩阵。将多维线程索引映射到数据存储位置，正是实现各种并行模式的基础，这类似于我们在计算线程全局索引(blockIdx * blockDim + threadIdx)时所做的事情。

七种核心并行任务类型详解

下面我们来逐一解析CUDA中几种核心的并行计算模式。

Map（映射）
这是最简单、最基础的并行模式。输出数组中的每个元素由输入数组中对应的一个元素独立计算而来，一一对应，互不干扰。在CUDA中，我们通常启动与数据量相等的线程数，每个线程处理一个数据点，无需任何线程间通信。它适用于计算密集型且内存访问连续的场景，比如对数组中的每个元素求平方。
Gather（收集）
该模式负责从不规则或非连续的输入地址“收集”数据，并将结果写入一个连续的输出空间。每个CUDA线程根据一个独立的索引值，从输入数组的指定位置读取数据，进行计算后放入输出数组的固定位置。这常用于根据索引列表查询数据。
Scatter（散射）
与Gather相反，Scatter将数据从一个连续的输入数组“散射”到输出数组的不规则、非连续位置。每个线程读取一个输入元素和一个目标索引，然后将该元素写入输出数组的索引指定处。需要注意的是，多个线程可能试图写入同一个输出位置（索引冲突），因此必要时需使用原子操作来确保数据正确性。
Transpose（转置）
通常用于矩阵转置操作，可以看作是一种特殊且规则的Scatter。它交换数据的行和列坐标。在CUDA中实现矩阵转置时，内存访问模式的优化是关键，常会用到共享内存以及通过填充（Padding）来缓解存储体冲突（Bank Conflict），以提升性能。
Stencil（模板）
这是一种具有局部性的特殊Gather操作。计算每个输出元素时，需要读取输入数据中一个固定模式（即“模板”，如3x3窗口）内的相邻元素。这涉及到CUDA中的内存预取和共享内存优化，常见于图像处理（如卷积）或科学计算领域。
Reduce（归约）
如果你有分布式计算经验，会很容易理解这个概念。归约通过二元操作（如加法、求最大值）将一组数据合并为单个结果，通常采用分治法和树形结构来实现。由于其计算过程存在并行与合并的复杂度，优化空间很大，例如使用循环展开、向量化加载等技巧。常见的归约算法包括相邻归约、树形归约和分块归约。
Scan（扫描）
扫描操作处理的是数据间的依赖关系，第n个输出结果是前n个输入元素累积计算的结果。它分为包含性前缀和与排他性前缀和两种。由于其固有的串行依赖特性，并行化实现更具挑战。在CUDA中，通常采用Hillis-Steele算法（工作量效率较低）或Blelloch算法（工作量效率较高）来实现。

除了以上七种主要模式，还有一些其他并行模式，例如：分治常用于排序算法，递归分解问题；流水线将任务分解为多个阶段重叠执行；稀疏模式专门用于优化不规则和稀疏数据结构的计算。

模式选择与CUDA生态工具

上述并行类型并非CUDA API直接提供的函数，而是一种算法设计和优化的指导思想，类似于设计模式在软件工程中的作用。理解它们有助于我们针对不同问题选择最高效的实现方案。

那么在实际项目中如何选择呢？这里有一些经验性的指导：

数据转换无依赖：首选Map模式，就像简单的函数映射。
需要查询/收集数据：Gather模式更合适。
处理图的邻接关系或卷积：推荐使用Stencil模式。
数据重排或矩阵转置：自然要选择Transpose。
散射写入数据：需谨慎使用Scatter，因为它可能引发写冲突，且输出是非线性的。
大规模数据累积或聚合：Scan（用于前缀和等累积操作）和Reduce（用于求和、求极值等归约操作）是利器。

为了便于开发者实现这些模式，NVIDIA提供了丰富的库支持：

Thrust库：高级抽象库，非常适合快速实现Map、Reduce、Scan、Sort等操作。
CUB库：底层优化库，提供块（Block）级和线程束（Warp）级的最佳性能原语。
cuBLAS库：专攻线性代数，包含高度优化的矩阵转置（Transpose）操作。
NPP库：针对图像处理，内置了多种Stencil操作（如滤波、卷积）。

在实际应用中，一个复杂算法往往是多种并行模式的组合。例如，流压缩（Stream Compaction）通常会结合Map（标记有效数据）和Scan（计算输出位置）；图像处理管线可能结合Gather（读取像素）和Stencil（进行卷积滤波）。正所谓取长补短，协同工作。

实战代码示例

理论说再多，不如一行代码。下面我们将通过一个综合示例，展示如何在CUDA中实现上述大部分并行模式。代码中同时提供了CPU版本的实现作为对照。

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

void initArray(float* arr, int size, int seed = 0) {
    srand(seed);
    for (int i = 0; i < size; i++) {
        arr[i] = (float)(rand() % 100) / 10.0f;
    }
}

void initMatrix(float* matrix, int rows, int cols, int seed = 0) {
    srand(seed);
    for (int i = 0; i < rows * cols; i++) {
        matrix[i] = (float)(rand() % 100) / 10.0f;
    }
}

// MAP模式
__global__ void mapKernel(float* input, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        output[idx] = input[idx] * input[idx];   
    }
}

void mapCPU(float* input, float* output, int n) {
    for (int i = 0; i < n; i++) {
        output[i] = input[i] * input[i];
    }
}

// GATHER模式
__global__ void gatherKernel(float* input, int* indices, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int srcIdx = indices[idx];
        output[idx] = input[srcIdx];
    }
}

void gatherCPU(float* input, int* indices, float* output, int n) {
    for (int i = 0; i < n; i++) {
        output[i] = input[indices[i]];
    }
}

// SCATTER模式
__global__ void scatterKernel(float* input, int* indices, float* output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) {
        int dst_idx = indices[idx];
        output[dst_idx] = input[idx];  
    }
}

void scatterCPU(float* input, int* indices, float* output, int n, int output_size) {
    //有可能有冲突处理
    for (int i = 0; i < n; i++) {
        output[indices[i]] = input[i];
    }
}

// STENCIL模式 (3x3卷积)
__global__ void stencilKernel(float* input, float* output, int width, int height,
                              float* kernel, int kernel_size) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x >= 1 && x < width - 1 && y >= 1 && y < height - 1) {
        float sum = 0.0f;
        for (int ky = -1; ky <= 1; ky++) {
            for (int kx = -1; kx <= 1; kx++) {
                int idx = (y + ky) * width + (x + kx);
                int kernelIdx = (ky + 1) * kernel_size + (kx + 1);
                sum += input[idx] * kernel[kernelIdx];
            }
        }
        output[y * width + x] = sum;
    } else if (x < width && y < height) {
        output[y * width + x] = input[y * width + x];
    }
}

void stencilCPU(float* input, float* output, int width, int height,
                float* kernel, int kernelSize) {
    for (int y = 1; y < height - 1; y++) {
        for (int x = 1; x < width - 1; x++) {
            float sum = 0.0f;
            for (int ky = -1; ky <= 1; ky++) {
                for (int kx = -1; kx <= 1; kx++) {
                    int idx = (y + ky) * width + (x + kx);
                    int kernelIdx = (ky + 1) * kernelSize + (kx + 1);
                    sum += input[idx] * kernel[kernelIdx];
                }
            }
            output[y * width + x] = sum;
        }
    }

    for (int y = 0; y < height; y++) {
        output[y * width] = input[y * width];
        output[y * width + width - 1] = input[y * width + width - 1];
    }
    for (int x = 0; x < width; x++) {
        output[x] = input[x];
        output[(height - 1) * width + x] = input[(height - 1) * width + x];
    }
}

//TRANSPOSE模式
__global__ void transposeNaiveKernel(float* input, float* output, int rows, int cols) {
    int x = blockIdx.x * blockDim.x + threadIdx.x;
    int y = blockIdx.y * blockDim.y + threadIdx.y;

    if (x < cols && y < rows) {
        output[x * rows + y] = input[y * cols + x];
    }
}

__global__ void transposeSharedKernel(float* input, float* output, int rows, int cols) {
    __shared__ float tile[32][33];  // 通过padding处理bank冲突

    int blockX = blockIdx.x * 32;
    int blockY = blockIdx.y * 32;

    int x = blockX + threadIdx.x;
    int y = blockY + threadIdx.y;

    // 加载到共享内存
    if (x < cols && y < rows) {
        tile[threadIdx.y][threadIdx.x] = input[y * cols + x];
    }

    __syncthreads();

    // 转置并写入
    int out_x = blockY + threadIdx.x;
    int out_y = blockX + threadIdx.y;

    if (out_x < rows && out_y < cols) {
        output[out_y * rows + out_x] = tile[threadIdx.x][threadIdx.y];
    }
}

void transposeCPU(float* input, float* output, int rows, int cols) {
    for (int y = 0; y < rows; y++) {
        for (int x = 0; x < cols; x++) {
            output[x * rows + y] = input[y * cols + x];
        }
    }
}

// SCAN模式
__global__ void scanBlockKernel(float* input, float* output, int n) {
    extern __shared__ float tmp[];

    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x * 2 + threadIdx.x;

    float val1 = (idx < n) ? input[idx] : 0;
    float val2 = (idx + blockDim.x < n) ? input[idx + blockDim.x] : 0;

    int offset = 1;
     tmp[2 * tid] = val1;
     tmp[2 * tid + 1] = val2;

    for (int d = blockDim.x; d > 0; d >>= 1) {
        __syncthreads();
        if (tid < d) {
            int ai = offset * (2 * tid + 1) - 1;
            int bi = offset * (2 * tid + 2) - 1;
             tmp[bi] +=  tmp[ai];
        }
        offset *= 2;
    }

    if (tid == 0) {
         tmp[2 * blockDim.x - 1] = 0;
    }

    for (int d = 1; d <= blockDim.x; d *= 2) {
        offset >>= 1;
        __syncthreads();
        if (tid < d) {
            int ai = offset * (2 * tid + 1) - 1;
            int bi = offset * (2 * tid + 2) - 1;
            float t =  tmp[ai];
             tmp[ai] =  tmp[bi];
             tmp[bi] += t;
        }
    }
    __syncthreads();

    if (idx < n) {
        output[idx] = tmp[2 * tid];
    }
    if (idx + blockDim.x < n) {
        output[idx + blockDim.x] = tmp[2 * tid + 1];
    }
}

// 简化单块
void scanCPU(float* input, float* output, int n) {
    output[0] = 0;
    for (int i = 1; i < n; i++) {
        output[i] = output[i - 1] + input[i - 1];
    }
}

// REDUCE模式
__global__ void reduceKernel(float* input, float* output, int n) {
    extern __shared__ float sdata[];

    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x * 2 + tid;

    sdata[tid] = 0;
    if (i < n) sdata[tid] += input[i];
    if (i + blockDim.x < n) sdata[tid] += input[i + blockDim.x];

    __syncthreads();

    // 树形归约
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }

    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}

float reduceCPU(float* input, int n) {
    float sum = 0;
    for (int i = 0; i < n; i++) {
        sum += input[i];
    }
    return sum;
}

int main() {

    // 元数据
    const int N = 1 << 20;      
    const int IMAGE_WIDTH = 512;  // 图像宽度
    const int IMAGE_HEIGHT = 512; // 图像高度
    const int MATRIX_ROWS = 1024;  // 矩阵行数
    const int MATRIX_COLS = 1024;  // 矩阵列数

    // 主机内存
    float *hInput = (float*)malloc(N * sizeof(float));
    float *hOutputCpu = (float*)malloc(N * sizeof(float));
    float *hOutputGpu = (float*)malloc(N * sizeof(float));

    // Gather/Scatter索引
    int *hIndices = (int*)malloc(N * sizeof(int));

    // 图像
    int imgSize = IMAGE_WIDTH * IMAGE_HEIGHT;
    float *hImgInput = (float*)malloc(imgSize * sizeof(float));
    float *hImgOutputCpu = (float*)malloc(imgSize * sizeof(float));
    float *hImgOutputGpu = (float*)malloc(imgSize * sizeof(float));

    // 矩阵
    int mat_size = MATRIX_ROWS * MATRIX_COLS;
    float *hMatInput = (float*)malloc(mat_size * sizeof(float));
    float *hMatOutputCpu = (float*)malloc(mat_size * sizeof(float));
    float *hMatOutputGpu = (float*)malloc(mat_size * sizeof(float));

    // 卷积核
    float hKernel[9];
    for (int i = 0; i < 9; i++) {
        hKernel[i] = 1.0f / 9.0f;
    }

    // 设备内存
    float * dInput, * dOutput=NULL;
    int * dIndices=NULL;
    float * dImgInput, * dImgOutput=NULL;
    float * dMatInput, * dMatOutput=NULL;
    float * dKernel=NULL;
    float * dScanOutput, * dReduceOutput=NULL;

    cudaMalloc(&dInput, N * sizeof(float));
    cudaMalloc(&dOutput, N * sizeof(float));
    cudaMalloc(&dIndices, N * sizeof(int));
    cudaMalloc(&dImgInput, imgSize * sizeof(float));
    cudaMalloc(&dImgOutput, imgSize * sizeof(float));
    cudaMalloc(&dMatInput, mat_size * sizeof(float));
    cudaMalloc(&dMatOutput, mat_size * sizeof(float));
    cudaMalloc(&dKernel, 9 * sizeof(float));
    cudaMalloc(&dScanOutput, N * sizeof(float));
    cudaMalloc(&dReduceOutput, ((N + 511) / 512) * sizeof(float));

    // 初始化
    initArray(hInput, N, 68);
    initArray(hImgInput, imgSize, 333);
    initMatrix(hMatInput, MATRIX_ROWS, MATRIX_COLS, 666);

    // 初始化索引（用于Gather/Scatter）
    srand(127);
    for (int i = 0; i < N; i++) {
        hIndices[i] = rand() % N;
    }

    // 拷贝数据到设备
     cudaMemcpy(dInput, hInput, N * sizeof(float), cudaMemcpyHostToDevice);
     cudaMemcpy(dIndices, hIndices, N * sizeof(int), cudaMemcpyHostToDevice);
     cudaMemcpy(dImgInput, hImgInput, imgSize * sizeof(float), cudaMemcpyHostToDevice);
     cudaMemcpy(dMatInput, hMatInput, mat_size * sizeof(float), cudaMemcpyHostToDevice);
     cudaMemcpy(dKernel, hKernel, 9 * sizeof(float), cudaMemcpyHostToDevice);

    //执行
    int blockSize = 256;
    int gridSize = (N + blockSize - 1) / blockSize;
    mapKernel <<<gridSize, blockSize >>>(dInput, dOutput, N);
    cudaDeviceSynchronize();

    cudaMemcpy(hOutputGpu, dOutput, N * sizeof(float), cudaMemcpyDeviceToHost);
    mapCPU(hInput, hOutputCpu, N);

    // GATHER
    gatherKernel <<<gridSize, blockSize >>>(dInput, dIndices, dOutput, N);cudaDeviceSynchronize();

     cudaMemcpy(hOutputGpu, dOutput, N * sizeof(float), cudaMemcpyDeviceToHost);
    gatherCPU(hInput, hIndices, hOutputCpu, N);

    // SCATTER
    // 重置
     cudaMemset(dOutput, 0, N * sizeof(float));

    scatterKernel <<<gridSize, blockSize >>>(dInput, dIndices, dOutput, N);cudaDeviceSynchronize();

    cudaMemcpy(hOutputGpu, dOutput, N * sizeof(float), cudaMemcpyDeviceToHost);

    // STENCIL
    dim3 blockDimStencil(16, 16);
    dim3 gridDimStencil((IMAGE_WIDTH + blockDimStencil.x - 1) / blockDimStencil.x,
                       (IMAGE_HEIGHT + blockDimStencil.y - 1) / blockDimStencil.y);

    stencilKernel <<<gridDimStencil, blockDimStencil >>>(dImgInput, dImgOutput, IMAGE_WIDTH, IMAGE_HEIGHT, dKernel, 3);
    cudaDeviceSynchronize();

     cudaMemcpy(hImgOutputGpu, dImgOutput, imgSize * sizeof(float),
        cudaMemcpyDeviceToHost);
    stencilCPU(hImgInput, hImgOutputCpu, IMAGE_WIDTH, IMAGE_HEIGHT, hKernel, 3);

    // TRANSPOSE
    dim3 blockDimTranspose(16, 16);
    dim3 gridDimTranspose((MATRIX_COLS + blockDimTranspose.x - 1) / blockDimTranspose.x,
                         (MATRIX_ROWS + blockDimTranspose.y - 1) / blockDimTranspose.y);

    transposeSharedKernel <<<gridDimTranspose, blockDimTranspose >>>(dMatInput, dMatOutput, MATRIX_ROWS, MATRIX_COLS);
    cudaDeviceSynchronize();

    cudaMemcpy(hMatOutputGpu, dMatOutput, mat_size * sizeof(float),cudaMemcpyDeviceToHost);
    transposeCPU(hMatInput, hMatOutputCpu, MATRIX_ROWS, MATRIX_COLS);

    // SCAN

    // 使用较小的数据测试SCAN
    const int SCAN_SIZE = 1 << 16;  // 64K
    float *hScanInput = (float*)malloc(SCAN_SIZE * sizeof(float));
    float *hScanOutputCpu = (float*)malloc(SCAN_SIZE * sizeof(float));
    float *hScanOutputGpu = (float*)malloc(SCAN_SIZE * sizeof(float));

    initArray(hScanInput, SCAN_SIZE, 99);

    float * d_scan_input;
    cudaMalloc(&d_scan_input, SCAN_SIZE * sizeof(float));
    cudaMemcpy(d_scan_input, hScanInput, SCAN_SIZE * sizeof(float),cudaMemcpyHostToDevice);

    int scanBlockSize = 512;
    int scanGridSize = (SCAN_SIZE + scanBlockSize * 2 - 1) / (scanBlockSize * 2);
    int sharedMemSize = 2 * scanBlockSize * sizeof(float);

    scanBlockKernel <<<scanGridSize, scanBlockSize, sharedMemSize >>>(d_scan_input, dScanOutput, SCAN_SIZE);
    cudaDeviceSynchronize();

    cudaMemcpy(hScanOutputGpu, dScanOutput, SCAN_SIZE * sizeof(float),cudaMemcpyDeviceToHost);
    scanCPU(hScanInput, hScanOutputCpu, SCAN_SIZE);

    // REDUCE
    int reduceBlockSize = 512;
    int reduceGridSize = (N + reduceBlockSize * 2 - 1) / (reduceBlockSize * 2);
    int reduceSharedMem = reduceBlockSize * sizeof(float);

    // 最上层归约
    reduceKernel <<<reduceGridSize, reduceBlockSize, reduceSharedMem >>>(dInput, dReduceOutput, N);

    // 继续归约
    float hReduceResult;
    if (reduceGridSize > 1) {
        float * dFinalResult;
         cudaMalloc(&dFinalResult, sizeof(float));

        //拷贝至主机并最终完成
        float *hIntermediate = (float*)malloc(reduceGridSize * sizeof(float));
        cudaMemcpy(hIntermediate, dReduceOutput,reduceGridSize * sizeof(float), cudaMemcpyDeviceToHost);

        hReduceResult = 0;
        for (int i = 0; i < reduceGridSize; i++) {
            hReduceResult += hIntermediate[i];
        }

        free(hIntermediate);
        cudaFree(dFinalResult);
    } else {
        cudaMemcpy(&hReduceResult, dReduceOutput,sizeof(float), cudaMemcpyDeviceToHost);
    }

    float cpu_reduce_result = reduceCPU(hInput, N);

    // 清理资源
    free(hInput);
    free(hOutputCpu);
    free(hOutputGpu);
    free(hIndices);
    free(hImgInput);
    free(hImgOutputCpu);
    free(hImgOutputGpu);
    free(hMatInput);
    free(hMatOutputCpu);
    free(hMatOutputGpu);
    free(hScanInput);
    free(hScanOutputCpu);
    free(hScanOutputGpu);

    // 释放设备内存
    cudaFree(dInput);
    cudaFree(dOutput);
    cudaFree(dIndices);
    cudaFree(dImgInput);
    cudaFree(dImgOutput);
    cudaFree(dMatInput);
    cudaFree(dMatOutput);
    cudaFree(dKernel);
    cudaFree(d_scan_input);
    cudaFree(dScanOutput);
    cudaFree(dReduceOutput);

    return 0;
}

建议你复制这段代码到实际的CUDA开发环境中编译运行，通过调试和性能分析，可以更深刻地理解每种模式的工作原理和优化点。在云栈社区的智能 & 数据 & 云板块，你也能找到更多关于GPU高性能计算的实战讨论与资源。

总结

本文更侧重于CUDA并行编程中的应用层设计，但这与CUDA底层开发是紧密相连的。文章提供了一个关于并行任务模式的“纲要”，由于篇幅所限，一些实现细节（如共享内存的具体使用、原子操作的注意事项、更优的Scan/Reduce算法）未能深入展开。这正需要读者以此为基础，在实践中去不断完善和探索。

学习过程就是不断总结、迭代，最终将知识内化为自己体系的过程。虽然每个人理解问题的角度可能不同，但透过现象看到的计算本质和并行思想是一致的。掌握这些并行范式，将帮助你更好地设计算法，释放GPU的强大算力。

上一篇：并发与并行详解：从概念差异到Go/Java实际应用场景
下一篇：Evolver技术拆解：基于OpenClaw的AI智能体如何实现自我进化

CUDA, C＋＋, 并行计算, 高性能计算, GPU编程

深入解析CUDA并行任务模式：从Map到Reduce的七种核心范式与实践代码

七种核心并行任务类型详解

模式选择与CUDA生态工具

实战代码示例

总结

相关帖子