首页 - 技术栈

做网站赚钱还是做app赚钱咋样做网站

作者: 五速梦信息网
时间: 2026年06月17日 09:56

当前位置：首页 > news >正文

做网站赚钱还是做app赚钱,咋样做网站,wordpress主体公园,青海高端网站建设本教程为 Numba CUDA 示例第 2 部分。按照本系列从头开始使用 Python 学习 CUDA 编程介绍在本系列的第一部分中#xff0c;我们讨论了如何使用 GPU 运行高度并行算法。高度并行任务是指任务完全相互独立的任务#xff0c;例如对两个数组求和或应用任何元素函数。在本教… 本教程为 Numba CUDA 示例第 2 部分。按照本系列从头开始使用 Python 学习 CUDA 编程介绍在本系列的第一部分中我们讨论了如何使用 GPU 运行高度并行算法。高度并行任务是指任务完全相互独立的任务例如对两个数组求和或应用任何元素函数。在本教程中许多任务虽然不是高度并行的但仍可从并行化中获益。在本期的CUDA by Numba Examples中我们将介绍一些允许线程协作进行计算的常用技术。本部分的 Google colab 代码https://colab.research.google.com/drive/1hproEOKvQyBNNxvjr0qM2LPjJWNDfyp9?uspsharing 入门导入并加载库确保您有 GPU。 from time import perf_counter import numpy as np import numba from numba import cudaprint(np.version) print(numba.version)— 1.25.2 0.59.1cuda.detect()— Found 1 CUDA devices id 0 bTesla T4 [SUPPORTED]Compute Capability: 7.5PCI Device ID: 4PCI Bus ID: 0UUID: GPU-0f022a60-18f8-5de0-1f24-ad861dcd84aeWatchdog: DisabledFP32/FP64 Performance Ratio: 32 Summary:¹⁄₁ devices are supported True线程合作简单并行缩减算法我们将从一个非常简单的问题开始本节对数组的所有元素求和。从本质上讲这个算法非常简单。如果不借助 NumPy我们可以将其实现为 def sum_cpu(array):s 0.0for i in range(array.size):s array[i]return s我知道这看起来不太符合 Python 风格。但它确实强调了s跟踪数组中的所有元素。如果依赖于数组的每个元素我们如何并行化该算法s首先我们需要重写算法以允许某种并行化。如果有些部分我们无法并行化我们应该允许线程相互通信。然而到目前为止我们还没有学会如何让线程相互通信……事实上我们之前说过不同块中的线程不会通信。我们可以考虑只启动一个块但请记住大多数 GPU 中的块只能有 1024 个线程我们如何克服这个问题好吧如果我们将数组拆分成 1024 个块或适当数量的threads_per_block然后分别对每个块求和结果会怎样最后我们可以将每个块的总和结果相加。图 2.1 显示了 2 个块拆分的一个非常简单的示例。我们如何在 GPU 上做到这一点首先我们需要将数组拆分成块。每个块只对应一个块具有固定数量的线程。在每个块中每个线程可以对多个数组元素求和网格步长循环。然后我们必须在整个块上计算这些每个线程的值。这部分需要线程进行通信。我们将在下一个示例中介绍如何做到这一点。由于我们是在块上并行化因此内核的输出应为块大小。为了完成缩减我们将其复制到 CPU 并在那里完成作业。 threads_per_block 1024 # Why not! blocks_per_grid 32 * 80 # Use 32 * multiple of streaming multiprocessors# Example 2.1: Naive reduction cuda.jit def reduce_naive(array, partial_reduction):i_start cuda.grid(1)threads_per_grid cuda.blockDim.x * cuda.gridDim.xs_thread 0.0for i_arr in range(i_start, array.size, threads_per_grid):s_thread array[i_arr]# We need to create a special shared array which will be able to be read# from and written to by every thread in the block. Each block will have its# own shared array. See the warning below!s_block cuda.shared.array((threads_per_block,), numba.float32)# We now store the local temporary sum of a single the thread into the# shared array. Since the shared array is sized# threads_per_block blockDim.x# (1024 in this example), we should index it with threadIdx.x.tid cuda.threadIdx.xs_block[tid] s_thread# The next line synchronizes the threads in a block. It ensures that after# that line, all values have been written to s_block.cuda.syncthreads()# Finally, we need to sum the values from all threads to yield a single# value per block. We only need one thread for this.if tid 0:# We store the sum of the elements of the shared array in its first# coordinatefor i in range(1, threads_per_block):s_block[0] s_block[i]# Move this partial sum to the output. Only one thread is writing here.partial_reduction[cuda.blockIdx.x] s_block[0]⚠️ 注意共享数组必须尽量“小”。具体大小取决于 GPU 的计算能力通常在 48 KB 到 163 KB 之间。请参阅本表https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability 中的“Maximum amount of shared memory per thread block”项。在编译时有一个已知的大小这就是为什么我们要设置共享数组 threads_per_block 的大小而不是 blockDim.x。的确我们可以为任意大小的共享数组定义一个factory function…但要注意这些内核的编译时间用 Numba 类型指定 dtype而不是 Numpy 类型别问我为什么。 N 1_000_000_000 a np.arange(N, dtypenp.float32) a / a.sum() # a will have sum 1 (to float32 precision)s_cpu a.sum()# Highly-optimized NumPy CPU code timing_cpu np.empty(21) for i in range(timing_cpu.size):tic perf_counter()a.sum()toc perf_counter()timing_cpu[i] toc - tic timing_cpu * 1e3 # convert to msprint(fElapsed time CPU: {timing_cpu.mean():.0f} ± {timing_cpu.std():.0f} ms)— Elapsed time CPU: 557 ± 307 msdev_a cuda.to_device(a) dev_partial_reduction cuda.device_array((blocks_per_grid,), dtypea.dtype)reduce_naiveblocks_per_grid, threads_per_block s dev_partial_reduction.copy_to_host().sum() # Final reduction in CPUnp.isclose(s, s_cpu) # Ensure we have the right number— Truetiming_naive np.empty(21) for i in range(timing_naive.size):tic perf_counter()reduce_naiveblocks_per_grid, threads_per_blocks dev_partial_reduction.copy_to_host().sum()cuda.synchronize()toc perf_counter()assert np.isclose(s, s_cpu) timing_naive[i] toc - tic timing_naive * 1e3 # convert to msprint(fElapsed time naive: {timing_naive.mean():.0f} ± {timing_naive.std():.0f} ms)— Elapsed time naive: 30 ± 11 ms我在 Google Colab 上运行了这个程序速度提高了将近 20 倍。非常棒一种更好的并行缩减算法您可能想知道为什么我们将所有内容都命名为“简单”。这意味着有一些非简单的方式来执行相同的功能。事实上有很多技巧可以加速这种代码请参阅 Optimizing Parallel Reduction in CUDA 演示以获取基准。在我们展示更好的方法之前让我们回顾一下内核的最后一部分 if tid 0: # Single thread taking care of businessfor i in range(1, threads_per_block):s_block[0] s_block[i]partial_reduction[cuda.blockIdx.x] s_block[0]我们几乎把所有事情都并行化了但在内核末尾我们让一个线程负责对共享数组 s_block 的所有 threads_per_block 元素求和。我们为什么不把这个总和也并行化呢听起来不错怎么做呢图 2.2 显示了如何实现 threads_per_block 大小为 16 的函数。我们首先运行 8 个线程第一个线程将对 s_block[0] 和 s_block[8] 中的值求和。第二个线程对 s_block[1] 和 s_block[9] 中的值求和直到最后一个线程将对s_block[7] 和 s_block[15] 中的值求和。下一步只需要前 4 个线程工作。第一个线程将计算 s_block[0] 和 s_block[4] 的总和第二个线程将计算 s_block[1] 和 s_block[5] 的总和第三个线程将计算 s_block[2] 和 s_block[6] 的总和第四个线程和最后一个线程将计算 s_block[3] 和 s_block[7] 的总和。在第三步中我们现在只需要 2 个线程来处理 s_block的前 4 个元素。第四步也是最后一步将使用一个线程来对 2 个元素求和。由于工作已在线程之间分配因此它是并行的。当然它不是由每个线程均等分配的但这是一种改进。从计算上讲此算法是 O(log2( threads_per_block))而第一个算法是 O( threads_per_block)。在我们的示例中原始算法需要 1024 次操作而改进算法只需要 10 次最后还有一个细节。在每一步中我们都需要确保所有线程都已写入共享数组。所以我们必须调用cuda.syncthreads()。来源Mark HarrisOptimizing Parallel Reduction in CUDA.

Example 2.2: Better reduction

cuda.jit def reduce_better(array, partial_reduction):i_start cuda.grid(1)threads_per_grid cuda.blockDim.x * cuda.gridDim.xs_thread 0.0for i_arr in range(i_start, array.size, threads_per_grid):s_thread array[i_arr]# We need to create a special shared array which will be able to be read# from and written to by every thread in the block. Each block will have its# own shared array. See the warning below!s_block cuda.shared.array((threads_per_block,), numba.float32)# We now store the local temporary sum of the thread into the shared array.# Since the shared array is sized threads_per_block blockDim.x,# we should index it with threadIdx.x.tid cuda.threadIdx.xs_block[tid] s_thread# The next line synchronizes the threads in a block. It ensures that after# that line, all values have been written to s_block.cuda.syncthreads()i cuda.blockDim.x // 2while (i 0):if (tid i):s_block[tid] s_block[tid i]cuda.syncthreads()i // 2if tid 0:partial_reduction[cuda.blockIdx.x] s_block[0]reduce_betterblocks_per_grid, threads_per_block s dev_partial_reduction.copy_to_host().sum() # Final reduction in CPUnp.isclose(s, s_cpu)— Truetiming_naive np.empty(21) for i in range(timing_naive.size):tic perf_counter()reduce_betterblocks_per_grid, threads_per_blocks dev_partial_reduction.copy_to_host().sum()cuda.synchronize()toc perf_counter()assert np.isclose(s, s_cpu) timing_naive[i] toc - tic timing_naive * 1e3 # convert to msprint(fElapsed time better: {timing_naive.mean():.0f} ± {timing_naive.std():.0f} ms)— Elapsed time better: 23 ± 1 ms在 Google Colab 上这比简单方法快约 30%。 ⚠️ 注意你可能会想把 syncthreads 移到 if 块内部因为每一步之后超过当前线程数一半的内核将不会被使用。但是这样做会让调用 syncthreads 的 CUDA 线程停止并等待其他线程而其他线程则会继续运行。因此停止的线程将永远等待永远不会停止同步的线程。这给我们的启示是如果要同步线程请确保所有线程都调用了 cuda.syncthreads()。 i cuda.blockDim.x // 2 while (i 0): if (tid i): s_block[tid] s_block[tid i] cuda.syncthreads() # 不要放在这里cuda.syncthreads() # 而不是这里i // 2减少 Numba 由于上述缩减算法并不简单Numba 提供了一个便捷cuda.reduce装饰器可将二元函数转换为缩减算法。上面的长而复杂的算法可以用以下方法替代

Example 2.3: Numba reduction

cuda.reduce def reduce_numba(a, b):return a b# Compile and check s reduce_numba(dev_a)np.isclose(s, s_cpu)— True# Time timing_numba np.empty(21) for i in range(timing_numba.size):tic perf_counter()s reduce_numba(dev_a)toc perf_counter()assert np.isclose(s, s_cpu) timing_numba[i] toc - tic timing_numba * 1e3 # convert to msprint(fElapsed time better: {timing_numba.mean():.0f} ± {timing_numba.std():.0f} ms)— Elapsed time better: 20 ± 0 ms就我个人而言我发现手写缩减通常要快得多至少快 2 倍但 Numba 递归非常容易使用。话虽如此我还是鼓励大家阅读 reduction code in the Numba source code. 还需要注意的是默认情况下reduction 会复制到主机这会强制同步。为了避免这种情况您可以使用设备数组作为输出来调用 Reduce dev_s cuda.device_array((1,), dtypes)reduce_numba(dev_a, resdev_s)s dev_s.copy_to_host()[0] np.isclose(s, s_cpu)— True2D 缩减示例并行缩减技术很棒但如何将其扩展到更高维度并不明显。虽然我们总是可以使用解开的数组 ( array2d.ravel()) 来调用 Numba 缩减但了解如何手动缩减多维数组非常重要。在这个例子中我们将结合所学的关于 2D 内核的知识和所学的关于 1D 缩减的知识来计算 2D 缩减。 threads_per_block_2d (16, 16) # 256 threads total blocks_per_grid_2d (64, 64)# Total number of threads in a 2D block (has to be an int) shared_array_len int(np.prod(threads_per_block_2d))# Example 2.4: 2D reduction with 1D shared array cuda.jit def reduce2d(array2d, partial_reduction2d):ix, iy cuda.grid(2)threads_per_grid_x, threads_per_grid_y cuda.gridsize(2)s_thread 0.0for i0 in range(iy, array2d.shape[0], threads_per_grid_x):for i1 in range(ix, array2d.shape[1], threads_per_grid_y):s_thread array2d[i0, i1]# Allocate shared arrays_block cuda.shared.array(shared_array_len, numba.float32)# Index the threads linearly: each tid identifies a unique thread in the# 2D grid.tid cuda.threadIdx.x cuda.blockDim.x * cuda.threadIdx.ys_block[tid] s_threadcuda.syncthreads()# We can use the same smart reduction algorithm by remembering that# shared_array_len blockDim.x * cuda.blockDim.y# So we just need to start our indexing accordingly.i (cuda.blockDim.x * cuda.blockDim.y) // 2while (i ! 0):if (tid i):s_block[tid] s_block[tid i]cuda.syncthreads()i // 2# Store reduction in a 2D array the same size as the 2D blocksif tid 0:partial_reduction2d[cuda.blockIdx.x, cuda.blockIdx.y] s_block[0]N_2D (20_000, 20_000) a_2d np.arange(np.prod(N_2D), dtypenp.float32).reshape(N_2D) a_2d / a_2d.sum() # a_2d will have sum 1 (to float32 precision)s_2d_cpu a_2d.sum()dev_a_2d cuda.to_device(a_2d) dev_partial_reduction_2d cuda.device_array(blocks_per_grid_2d, dtypea.dtype)reduce2dblocks_per_grid_2d, threads_per_block_2d s_2d dev_partial_reduction_2d.copy_to_host().sum() # Final reduction in CPUnp.isclose(s_2d, s_2d_cpu) # Ensure we have the right number— Truetiming_2d np.empty(21) for i in range(timing_2d.size):tic perf_counter()reduce2dblocks_per_grid_2d, threads_per_block_2ds_2d dev_partial_reduction_2d.copy_to_host().sum()cuda.synchronize()toc perf_counter()assert np.isclose(s_2d, s_2d_cpu) timing_2d[i] toc - tic timing_2d * 1e3 # convert to msprint(fElapsed time better: {timing_2d.mean():.0f} ± {timing_2d.std():.0f} ms)— Elapsed time better: 11 ± 0 ms设备功能到目前为止我们只讨论了内核它们是启动线程的特殊 GPU 函数。内核通常依赖于在 GPU 中定义的较小函数这些函数只能访问 GPU 数组。这些被称为设备函数。与内核不同的是它们可以返回值。为了结束本部分教程我们将展示一个跨不同内核使用设备函数的示例。该示例还将强调在使用共享数组时同步线程的重要性。注意在较新版本的 CUDA 中内核可以启动其他内核。这称为动态并行Numba CUDA 尚不支持。* 2D 共享数组示例在此示例中我们将在固定大小的数组中创建波纹图案。我们首先需要声明将使用的线程数因为这是共享数组所需的。 threads_16 16import mathcuda.jit(deviceTrue, inlineTrue) # inlining can speed up execution def amplitude(ix, iy):return (1 math.sin(2 * math.pi * (ix - 64) / 256)) * (1 math.sin(2 * math.pi * (iy - 64) / 256))# Example 2.5a: 2D Shared Array cuda.jit def blobs_2d(array2d):ix, iy cuda.grid(2)tix, tiy cuda.threadIdx.x, cuda.threadIdx.yshared cuda.shared.array((threads_16, threads_16), numba.float32)shared[tiy, tix] amplitude(iy, ix)cuda.syncthreads()array2d[iy, ix] shared[15 - tiy, 15 - tix]# Example 2.5b: 2D Shared Array without synchronize cuda.jit def blobs_2d_wrong(array2d):ix, iy cuda.grid(2)tix, tiy cuda.threadIdx.x, cuda.threadIdx.yshared cuda.shared.array((threads_16, threads_16), numba.float32)shared[tiy, tix] amplitude(iy, ix)# When we dont sync threads, we may have not written to shared# yet, or even have overwritten it by the time we write to array2darray2d[iy, ix] shared[15 - tiy, 15 - tix]N_img 1024 blocks (N_img // threads_16, N_img // threads_16) threads (threads_16, threads_16)dev_image cuda.device_array((N_img, N_img), dtypenp.float32) dev_image_wrong cuda.device_array((N_img, N_img), dtypenp.float32)blobs_2dblocks, threads blobs_2d_wrongblocks, threadsimage dev_image.copy_to_host() image_wrong dev_image_wrong.copy_to_host()import matplotlib.pyplot as pltfig, (ax1, ax2) plt.subplots(1, 2) ax1.imshow(image.T, cmapnipy_spectral) ax2.imshow(image_wrong.T, cmapnipy_spectral) for ax in (ax1, ax2):ax.set_xticks([])ax.set_yticks([])ax.set_xticklabels([])ax.set_yticklabels([])结论在本教程中您学习了如何开发需要缩减模式来处理一维和二维数组的内核。在此过程中我们学习了如何利用共享数组和设备功能。