CUDA Python/C++

Docusaurus Plushie

CAUTION : WORK IN PROGRESS

Matrix Multiplication

import numpy
def dot(a, b):
    """Multiply the matrix a with the matrix b.
    
    Parameters
    ----------
    a: ndarray
        left matrxi
    b: ndarray
        right matrix
        
    Return
    ------
    c: ndarray
        result matrix
    """
    c = numpy.zeros((a.shape[0], b.shape[1]))
    for i in range(a.shape[0]):
        for j in range(b.shape[1]):
            for k in range(a.shape[1]):
                c[i, j] += a [i, k] * b[k, j]
    return c

Let's two small matrices and see how long the above function takes to conpute matrix multiplication.

n = 256
a = numpy.random.random((n, n))
b = numpy.random.random((n, n))

t_dot = %timeit -o dot(a, b)

A matrix multiplication of two n by n matrices performs 2n^3 operations. The dot function achieves

print("%.3f GFLOP/s" % (2e-9 * n**3 / t_dot.best))

That's why they tell is Python is slower.

Luckily, we have Numba, an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.

Numba makes Python code fast.

from numba import jit
jdot = jit(dot)

t_jit = %timeit -o jdot(a, b)

print("%.3f GFLOP/s" % (2e-9 * n**3 / t_jit.best))

CUDA Python/C++

CAUTION : WORK IN PROGRESS​

CAUTION : WORK IN PROGRESS