Numpy - Views vs. Copies
This article explains the difference between views and copies of a Numpy array.
Introduction
In one of my recent projects, I needed to accelerate a discrete choice dynamic programming model. After I changed a part of the implementation, the program was indeed faster. But, the most expensive operation according to profiling with snakeviz was now ~:0(<method 'copy' of 'numpy.ndarray' objects>)
. I was puzzled. I was sure that there was no use of np.copy()
at all. After reading some StackOverflow posts and blog entries, it became clear that some operations and more importantly indexing methods return copies instead of views. The difference between the two is that views refer to the same underlying data in memory whereas a copy creates a new object. The disadvantages of a copy are:
- takes more time
- takes more memory
But, what operations return copies?
How to identify views and copies?
To notice whether two objects do not refer to the same data buffer in the memory, we use the following function.
import numpy as np
def aid(x):
"""This function returns the memory block address of an object."""
return x.__array_interface__["data"][0]
Let us start simple. We construct an array and take a look at the memory block address of the array and the same array starting at the first position.
x = np.array([1, 2, 3])
aid(x), aid(x[1:])
(2065100659856, 2065100659860)
Indeed, they have very similar addresses but the slice has an offset of 4. Every offset represents a byte so that the first number blocks 32bit. Furthermore, we know that the array contains integers. Thus, the dtype must be int32.
x.dtype
dtype('int32')
Addresses are only identical, if they share the same first element.
aid(x), aid(x[:2])
(2065100659856, 2065100659856)
As we are more interested in whether two objects come from the same memory block address instead whether they start at the same offset, we define two other functions.
def get_data_base(x):
base = x
while isinstance(base.base, np.ndarray):
base = base.base
return base
def arrays_share_data(x, y):
return get_data_base(x) is get_data_base(y)
Just a quick test.
arrays_share_data(x, x.copy())
False
arrays_share_data(x, x[1:])
True
View vs. Copy
Let us start examining when a copy or a view is returned.
In-place operations
x = np.arange(16).reshape(4, 4)
x_base = get_data_base(x)
x *= 2
get_data_base(x) is x_base
True
a = x * 2
arrays_share_data(x, a)
False
Matrix multiplication
x = np.arange(16).reshape(4, 4)
x_base = get_data_base(x)
x = x.dot(np.eye(4))
get_data_base(x) is x_base
False
Indexing
arrays_share_data(x, x[0])
True
arrays_share_data(x, x[0, 0])
False
arrays_share_data(x, x[:1, :1])
True
arrays_share_data(x, x[0][0])
False
This is a little bit mind-boggling, right? There is no problem with the first case as you probably expected that the two array share the same base. The other three cases all index the same element of the matrix, but in two of the cases a copy is returned. Why is that? The reason is that there are two kinds of indexing. The first group of indexing comprises simple indices, x[0]
, slices, x[:2]
and boolean masks, x[x > 0]
. These methods all return a view and not a copy. The second group is called fancy indexing and basically means that we use arrays of indices to access multiple values at once. The simplest way of fancy indexing is using lists of indices.
In the following example, a
has just a different ordering of rows than x
.
x = np.arange(16).reshape(4, 4)
x_base = get_data_base(x)
a = x[[0, 2, 1, 3]]
a
array([[ 0, 1, 2, 3],
[ 8, 9, 10, 11],
[ 4, 5, 6, 7],
[12, 13, 14, 15]])
arrays_share_data(x, a)
False
We can also combine fancy indexing with other indexing schemes, but the return value is always a copy.
a = x[1, [1, 2]]
a
array([5, 6])
arrays_share_data(x, a)
False
So every form of fancy indexing returns a copy. What about the following case?
a = x[(1,)]
What do you expect?
arrays_share_data(x, a)
True
And, this one?
x = np.arange(16).reshape(4, 4, -1)
a = x[(1, 2)]
arrays_share_data(x, a)
True
I was a little bit puzzled by this one at first as I thought it is the same as the following.
a
array([6])
b = x[[1, 2]]
b
array([[[ 4],
[ 5],
[ 6],
[ 7]],
[[ 8],
[ 9],
[10],
[11]]])
arrays_share_data(x, b)
False
The reason is that ellipses are simply omitted and if the resulting index is not fancy the return value is a view.