Numba - @vectorize and @guvectorize
An explanation for the @vectorize and @guvectorize decorators from Numba.
Updated 16.08.2019. Better syntax, less errors and an example to show inlining of functions.
Why should I use them
Both decorators are extremely powerful as Numba compiles the instructions of the decorated function to machine code. That makes them significantly faster than the same operation written with Numpy. Of course, this is also true for functions decorated with @jit
, but the other two decorators enable other features like reduction, accumulation and broadcasting (explanation) which make operations on bigger matrices even faster.
When should I use them
“[P]remature optimization is the root of all evil” (Donald Knuth in Computer Programming as an Art, p. 671). Before you start to optimize your implementation use tools like line_profiler or snakeviz to profile your code and find the real bottlenecks, not the ones you think about. Often it is sufficient to find better implementations from the standard library or rewrite parts such that they use Numpy or similar optimized frameworks. If you use Numpy, do not use loops but array operations. Use higher-dimensional arrays to gain performance by broadcasting. Avoid copies. Also, refactoring and rewriting parts of the code may not give direct speed improvements, but your code becomes more compact and you can single out expensive operations.
Imagine you have done this, but still there are some functions which take up the majority of runtime. Next, decide whether you need a function that operates on single elements of an array or on arrays.
Using the @vectorize
decorator
The @vectorize
is for writing efficient functions which work on every element of an array. One useful application for me was to compute the probability of x
under a normal distribution with mean mu
and standard deviation sigma
. There already exists an implemenation in scipy, but it is incredibly slow. Let us start to write our own implementation.
import numpy as np
def get_prob_norm_dist(x, mu=0, sigma=1):
return 1 / (np.sqrt(2 * np.pi) * sigma) * np.exp(-(x - mu) ** 2 / (2 * sigma ** 2))
We can compare our implementation with the scipy version to make sure it works as expected. Use the testing utilities from numpy.testing
as the precision of floating point numbers will always differ to some extent. (Tip: Use doctest to document and test your function at the same time. Here is an example.)
from scipy.stats import norm
np.random.seed(42)
a = np.random.randn(10)
np.testing.assert_array_almost_equal(norm.pdf(a), get_prob_norm_dist(a), decimal=15)
Now, create test data and measure the performance of our two methods.
a = np.random.randn(10000)
%timeit norm.pdf(a)
647 µs ± 47.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit get_prob_norm_dist(a)
78.4 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Surprisingly, we are already faster by a factor of 8. At this point you should ask yourself again if further optimization is really necessary and whether your time is better spent anywhere else - remember Donald Knuth! In this example we will come down the rabbit hole a little bit more. The next step involves rewriting the function with the @vectorize
decorator.
The major new thing is that the decorator requires a signature or even multiple signatures which define the input and output types of the function. The output type wraps the input types with round brackets. There are two ways to define the signature: declare the types of the arguments with types from Numba or declare the types with a string. I prefer the latter as it keeps your import statements short. With nopython=True
, Numba does not use Python as a fallback if compilation fails, so we get notified. Also, the new function does not like default arguments.
from numba import vectorize
@vectorize(["float64(float64, float64, float64)"], nopython=True)
def get_prob_norm_dist_fast(x, mu, sigma):
return 1 / (np.sqrt(2 * np.pi) * sigma) * np.exp(-(x - mu) ** 2 / (2 * sigma ** 2))
%timeit get_prob_norm_dist_fast(a, 0, 1)
248 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The new version actually requires three times the runtime of the Numpy version. But, while using Numba we need to code a little bit different. As appealing the one-liner may look, Numba works better if we store intermediate results in other variables and combine them later.
@vectorize(["float64(float64, float64, float64)"], nopython=True, target="cpu")
def get_prob_norm_dist_fast(x, mu, sigma):
y = x - mu
a = np.sqrt(2 * np.pi) * sigma
b = np.exp(-y ** 2 / (2 * sigma ** 2))
probability = 1 / a * b
return probability
%timeit get_prob_norm_dist_fast(a, 0, 1)
248 µs ± 3.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
We do not gain any performance, but this is just a showcase. If the operation does not involve many calculations and the input is not big, you cannot expect more performance than plain Numpy. If we increase the number of calculations and set the target
keyword from cpu
to parallel
, we get similar performances.
The different target means that the compiled function does not use one core, but a recommended amount of cores on your machine.
@vectorize(["float64(float64, float64, float64)"], nopython=True, target="parallel")
def get_prob_norm_dist_fast_parallel(x, mu, sigma):
y = x - mu
a = np.sqrt(2 * np.pi) * sigma
b = np.exp(-y ** 2 / (2 * sigma ** 2))
probability = 1 / a * b
return probability
x = np.random.randn(10000, 10000)
%timeit get_prob_norm_dist_fast_parallel(x, 0, 1)
2.72 s ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit get_prob_norm_dist(x, 0, 1)
2.32 s ± 66.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Even if the resulting function is slower than the Numpy implementation, it can still make sense to write one. For example, you can only use Numba-compiled functions in other Numba-compiled functions. When this PR for Numba is finished, nesting functions is no obstacle anymore, but until then it is not possible all the time. Also, Numba inlines functions which means the function and its calls are compiled as one function. So, you do not have to worry whether calling another function reduces performance.
Using the @guvectorize
decorator
I found the @guvectorize
extremely useful and powerful, but at the same time hard to understand as I was not experienced in the more general concepts of array computation. The following example is a Monte Carlo integration which was my first point of contact with Numba.
Trying to understand what extrapolating from arrays to arrays of higher dimension means can be hard. Thus, we will start with the simplest example and then gradually expand the problem. Along the way, you will learn about the flexibility and to some extent elegance of @guvectorize
:).
The problem is that we have an agent faced with four choices which yield constant utilities plus a stochastic component which is i.i.d. and normally distributed. We want to find the expected maximum utility (emax) from all choices which can be simulated by calculating maximum utility for a number of draws. The average yields the emax.
import matplotlib.pyplot as plt
np.random.seed(42)
static_utilities = np.arange(1, 5)
1. Simulate emax for one agent with Numpy
We start by simulating the emax for one agent with Numpy to showcase the normal solution. We create 1000 draws add them to the deterministic component of the utility, take the max and then the average over all maxima.
n_draws = 1000
draws = np.random.randn(n_draws, 4)
def simulate_emax_np(static_utilities, draws):
utilities = static_utilities + draws
max_utilities = utilities.max(axis=1)
emax = max_utilities.mean()
return emax
simulate_emax_np(static_utilities, draws)
4.247215674679622
2. Simulate emax for one agent with Numba
Now, we will do the same thing with Numba. I will give you the function first and then we will discuss its components.
from numba import guvectorize
@guvectorize(
["f8[:], f8[:, :], f8[:]"], "(n_choices), (n_draws, n_choices) -> ()", nopython=True, target="cpu"
)
def simulate_emax(static_utilities, draws, emax):
n_draws, n_choices = draws.shape
emax_ = 0
for i in range(n_draws):
max_utility = 0
for j in range(n_choices):
utility = static_utilities[j] + draws[i, j]
if utility > max_utility or j == 0:
max_utility = utility
emax_ += max_utility
emax[0] = emax_ / n_draws
emax = simulate_emax(static_utilities, draws)
emax
4.247215674679617
First, we will take a look at the decorator and the two new components.
- The first argument is the signature, a list containing type declarations of the inputs.
f8
is just short-hand forfloat64
andf8[:]
represents an one-dimensional,f8[:, :]
a two-dimensional array (types and signatures). Note that the output type does not wrap the input types. Instead it is added to the end of the list. Furthermore, this function should return a scalar which has to be declared as an array in the signature. Later, we will only write to the first entry of the array. - The second argument is the layout which specifies the dimension of the inputs. You can use an arbitrary letter for a dimension, but assign the same letter to dimensions which should match.
()
represents a scalar and(n_draws, n_choices)
an array where the first dimension is the number of draws and the second the number of choices. The return argument is separated from the rest with an arrow,->
. Here the return is declared as a scalar.
Now, we will examine the function. In the special case of gufuncs, the return value is added to the arguments of the function.
Instead of array operations, we are very explicit within the function and do everything with loops. For each set of draws, we add the deterministic and stochastic components and keep only the maximum which is stored in a temporary variable. In the end, the average over the sum of maximum utilities is saved to the initial slot of the output array to get a scalar output.
The clumsy syntax helps Numba optimize the syntax. As before, temporary variables can help Numba to optimize memory accesses.
3. Simulate emax for many agents and many draws
The power of the gufunc is visible if we apply the function to many agents. This is possible without further adjustments as the implementation automatically infers that the defined operation has to be repeated over all other dimensions. Specifically, @guvectorize
extends the function defined for dimensions (n_choices,)
and (n_draws, n_choices)
to whichever dimensions are added to the left meaning (..., n_choices)
and (..., n_draws, n_choices)
.
Note that, we use the same draws for all individuals.
n_agents = 500
static_utilities = np.tile(np.arange(1, 5), (n_agents, 1))
Unfortunately, we have to adjust the Numpy function as it cannot handle the new dimension. The gufunc is just fine ;). Let’s timeit!
def simulate_emax_np(static_utilities, draws):
utilities = static_utilities.reshape(-1, 1, 4) + draws
max_utilities = utilities.max(axis=2)
emax = max_utilities.mean(axis=1)
return emax
%timeit simulate_emax_np(static_utilities, draws)
37.1 ms ± 948 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit simulate_emax(static_utilities, draws)
2.22 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is the magic of the gufunc. You only have to define what should happen with the innermost dimensions and the function will recognize that the operations should be repeated for all other dimensions added to the left. The basic rules of broadcasting apply. In the previous example we have reused the same set of draws for all agents. We could have also passed a draw matrix with dimensions (n_agents, n_draws, n_choices)
.
Just to drive the point home, assume that we do the operation for 500 individuals in five villages. So we add another dimension on the left side of the static utilities. Here is only the case for the gufunc because I do not want to adjust the Numpy function again.
n_villages = 5
static_utilities = np.tile(np.arange(1, 5), (n_villages, n_agents, 1))
%timeit simulate_emax(static_utilities, draws)
11 ms ± 323 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Inlining functions
To show an example of @jit
and how to inline functions, here is the following trick. Assume that you want to test different utility functions, but you want not duplicate the Monte Carlo integration over and over again. In this case, we can single out the utility function as a @jit
-compiled function. Within the @jit
-function assume, that you only receive the objects which you define within the @guvectorize
-function - a single static utility and one draw.
from numba import njit
@njit
def utility_func(static_utility, draw):
return static_utility + draw
@guvectorize(
["f8[:], f8[:, :], f8[:]"], "(n_choices), (n_draws, n_choices) -> ()", nopython=True, target="cpu"
)
def simulate_emax_inlined(static_utilities, draws, emax):
n_draws, n_choices = draws.shape
emax_ = 0
for i in range(n_draws):
max_utility = 0
for j in range(n_choices):
utility = utility_func(static_utilities[j], draws[i, j])
if utility > max_utility or j == 0:
max_utility = utility
emax_ += max_utility
emax[0] = emax_ / n_draws
%timeit simulate_emax_inlined(static_utilities, draws)
11.1 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
There is no hit to performance, but a clear division between numerical technique and economic model.
Conclusion
I have illustrated how @vectorize
and @guvectorize
decorators work and in which settings they can be applied. @jit
will be covered in a different post and linked here. Luckily, it is easier to understand with the documentation and does not require much knowledge besides basic Numpy. But, because the subset of supported functions and/or its keywords is so restricted, it was not really useful to me. I achieved major performance boosts with the presented decorators and would recommend to use them. If you find something helpful which is not covered here but suits the topic, write me an email.
Appendix
More information on Numba
- It is also possible to set
target="cuda"
and transfer the computation to the processor of your graphic card, GPU. This can provide better performance if the arithmetic intensity per byte transfered to the GPU is sufficiently high. What does that mean? There is an additional overhead to run the function on the GPU instead on CPU (the processors of the computer) which is the transfer of the data. The function can only make up for the additional cost, if there are enough operations on each byte. I read somewhere that you need at least 10 arithmetic operations per byte as a rule of thumb. (More information: here and here.)
Example for reduction, accumulation and broadcasting
import numpy as np
from numba import vectorize
@vectorize(["int32(int32, int32)", "int64(int64, int64)"])
def f(x, y):
return x + y
a = np.arange(12).reshape(3, 4)
a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
f.reduce(a, axis=0)
array([12, 15, 18, 21])
f.reduce(a, axis=1)
array([ 6, 22, 38])
f.accumulate(a)
array([[ 0, 1, 2, 3],
[ 4, 6, 8, 10],
[12, 15, 18, 21]])
f.accumulate(a, axis=1)
array([[ 0, 1, 3, 6],
[ 4, 9, 15, 22],
[ 8, 17, 27, 38]])
Broadcasting is a complicated but fundamentally useful concept to understand when writing Numba (g)ufuncs. It is about how Numpy treats two arrays of different shapes while performing arithmetic operations. The simplest example is the following.
a = np.array([1, 2, 3])
b = 2
a * b
array([2, 4, 6])
Obviously, the shapes of the two arrays do not match, but still the result is what we would expect namely that b is applied to each element of a. This also works for higher dimensional structures.
a = np.ones((2, 2, 5))
a
array([[[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]],
[[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]]])
b = np.full((5), 2)
b
array([2, 2, 2, 2, 2])
a * b
array([[[2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2.]],
[[2., 2., 2., 2., 2.],
[2., 2., 2., 2., 2.]]])
We can see that the array b
is applied to every other dimension which is not part of b
itself. The general rule behind broadcasting is that when operating on two arrays, Numpy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
- they are equal, or
- one of them is 1
Example using doctest
We extend the function by writing an example in the docstring. Then, we run doctest which executes the example and checks whether outputs match. Note that doctest does not work for gufuncs.
def get_prob_norm_dist(x, mu=0, sigma=1):
"""Get probability of ``x`` under the normal distribution.
Example
-------
The second test will fail!
>>> from scipy.stats import norm
>>> norm.pdf(0)
0.3989422804014327
>>> get_prob_norm_dist(0)
0.3989422804014328
"""
return (
1 /
(np.sqrt(2 * np.pi) * sigma) *
np.exp(-(x - mu) ** 2 / (2 * sigma ** 2))
)
import doctest; doctest.testmod()
**********************************************************************
File "__main__", line 11, in __main__.get_prob_norm_dist
Failed example:
get_prob_norm_dist(0)
Expected:
0.3989422804014328
Got:
0.3989422804014327
**********************************************************************
1 items had failures:
1 of 3 in __main__.get_prob_norm_dist
***Test Failed*** 1 failures.
TestResults(failed=1, attempted=3)