python – Anaconda的NumbaPro CUDA断言错误
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了python – Anaconda的NumbaPro CUDA断言错误,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含4262字,纯文字阅读大概需要7分钟。
内容图文
![python – Anaconda的NumbaPro CUDA断言错误](/upload/InfoBanner/zyjiaocheng/800/3f3c810a8ba347e282e0f324de0f0fb6.jpg)
我正在尝试使用NumbaPro的cuda扩展来增加大型阵列矩阵.我最终想要的是将大小为NxN的矩阵乘以一个对角矩阵,该矩阵将作为一维矩阵输入(因此,a.dot(numpy.diagflat(b))我发现它是一个同义词* b).但是,我收到的断言错误没有提供任何信息.
如果我将两个1D阵列矩阵相乘,我只能避免这个断言错误,但这不是我想要做的.
from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np
def generate_input(n):
import numpy as np
A = np.array(np.random.sample((n,n)))
B = np.array(np.random.sample(n) + 10)
return A, B
def product(a, b):
return a * b
def main():
cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape)
stream = cuda.stream()
with stream.auto_synchronize():
dA = cuda.to_device(A, stream)
dB = cuda.to_device(B, stream)
dD = cuda.to_device(D, stream, copy=False)
cu_product(dA, dB, out=dD, stream=stream)
dD.to_host(stream)
if __name__ == '__main__':
main()
这是我的终端吐出来的:
Traceback (most recent call last):
File "cuda_vectorize.py", line 32, in <module>
main()
File "cuda_vectorize.py", line 28, in main
cu_product(dA, dB, out=dD, stream=stream)
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError
解决方法:
问题是你在一个带有非标量参数的函数上使用了vectorize.使用NumbaPro的矢量化的想法是它将标量函数作为输入,并生成一个函数,将标量运算并行应用于矢量的所有元素.见NumbaPro documentation.
你的函数采用矩阵和向量,它们绝对不是标量. [编辑]您可以使用NumbaPro的cuBLAS包装器或编写自己的简单内核函数,在GPU上执行您想要的操作.这是一个演示两者的例子.注意将需要NumbaPro 0.12.2或更高版本(刚刚在此编辑时发布).
from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer
def generate_input(n):
A = np.array(np.random.sample((n,n)), dtype=np.float32)
B = np.array(np.random.sample(n), dtype=A.dtype)
return A, B
@cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
height, width = c.shape
for y in range(startY, height, gridY):
for x in range(startX, width, gridX):
c[y, x] = a[y, x] * b[x]
def main():
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape, dtype=A.dtype)
E = np.zeros(A.shape, dtype=A.dtype)
F = np.empty(A.shape, dtype=A.dtype)
start = timer()
E = np.dot(A, np.diag(B))
numpy_time = timer() - start
blas = cublas.api.Blas()
start = timer()
blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
cublas_time = timer() - start
diff = np.abs(D-E)
print("Maximum CUBLAS error %f" % np.max(diff))
blockdim = (32, 8)
griddim = (16, 16)
start = timer()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dF = cuda.to_device(F, copy=False)
diagproduct[griddim, blockdim](dF, dA, dB)
dF.to_host()
cuda_time = timer() - start
diff = np.abs(F-E)
print("Maximum CUDA error %f" % np.max(diff))
print("Numpy took %f seconds" % numpy_time)
print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time))
print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))
if __name__ == '__main__':
main()
内核明显更快,因为SGEMM执行完整的矩阵 – 矩阵乘法(O(n ^ 3)),并将对角线扩展为完整矩阵. diagproduct功能更智能.它只是对每个矩阵元素进行单次乘法运算,并且从不将对角线扩展为完整矩阵.以下是我的NVIDIA Tesla K20c GPU上N = 1000的结果:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.024535 seconds
CUBLAS took 0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup
时序包括GPU的所有副本,这是小型矩阵的一个重要瓶颈.如果我们将N设置为10,000并再次运行,我们将获得更大的加速:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 7.245677 seconds
CUBLAS took 1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup
但是,对于非常小的矩阵,CUBLAS SGEMM具有优化的路径,因此它更接近CUDA性能.这里,N = 100
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.006876 seconds
CUBLAS took 0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup
内容总结
以上是互联网集市为您收集整理的python – Anaconda的NumbaPro CUDA断言错误全部内容,希望文章能够帮你解决python – Anaconda的NumbaPro CUDA断言错误所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。