Date Category blog Tags Numpy Read in 21 minutes

Binder

Introduction

In one of my recent projects, I needed to accelarate a discrete choice dynamic programming model. After I changed a part of the implementation, the program was indeed faster. But, the most expensive operation according to profiling with snakeviz was now ~:0(<method 'copy' of 'numpy.ndarray' objects>). I was puzzled. I was sure that there was no use of np.copy() at all. After reading some StackOverflow posts and blog entries, it became clear that some operations and more importantly indexing methods return copies instead of views. The difference between the two is that views refer to the same underlying data in memory whereas a copy creates a new object. The disadvantages of a copy are:

  • takes more time
  • takes more memory

But, what operations return copies?

How to identify views and copies?

To notice whether two objects do not refer to the same data buffer in the memory, we use the following function.

In [1]:
import numpy as np
In [2]:
def aid(x):
    """This function returns the memory block address of an object."""
    return x.__array_interface__["data"][0]

Let us start simple. We construct an array and take a look at the memory block address of the array and the same array starting at the first position.

In [3]:
x = np.array([1, 2, 3])
aid(x), aid(x[1:])
Out[3]:
(2065100659856, 2065100659860)

Indeed, they have very similar addresses but the slice has an offset of 4. Every offset represents a byte so that the first number blocks 32bit. Furthermore, we know that the array contains integers. Thus, the dtype must be int32.

In [4]:
x.dtype
Out[4]:
dtype('int32')

Addresses are only identical, if they share the same first element.

In [5]:
aid(x), aid(x[:2])
Out[5]:
(2065100659856, 2065100659856)

As we are more interested in whether two objects come from the same memory block address instead whether they start at the same offset, we define two other functions.

In [6]:
def get_data_base(x):
    base = x
    while isinstance(base.base, np.ndarray):
        base = base.base
    return base

def arrays_share_data(x, y):
    return get_data_base(x) is get_data_base(y)

Just a quick test.

In [7]:
arrays_share_data(x, x.copy())
Out[7]:
False
In [8]:
arrays_share_data(x, x[1:])
Out[8]:
True

View vs. Copy

Let us start examining when a copy or a view is returned.

In-place operations

In [9]:
x = np.arange(16).reshape(4, 4)
In [10]:
x_base = get_data_base(x)
In [11]:
x *= 2
In [12]:
get_data_base(x) is x_base
Out[12]:
True
In [13]:
a = x * 2
In [14]:
arrays_share_data(x, a)
Out[14]:
False

Matrix multiplication

In [15]:
x = np.arange(16).reshape(4, 4)

x_base = get_data_base(x)
In [16]:
x = x.dot(np.eye(4))
In [17]:
get_data_base(x) is x_base
Out[17]:
False

Indexing

In [18]:
arrays_share_data(x, x[0])
Out[18]:
True
In [19]:
arrays_share_data(x, x[0, 0])
Out[19]:
False
In [20]:
arrays_share_data(x, x[:1, :1])
Out[20]:
True
In [21]:
arrays_share_data(x, x[0][0])
Out[21]:
False

This is a little bit mind-boggling, right? There is no problem with the first case as you probably expected that the two array share the same base. The other three cases all index the same element of the matrix, but in two of the cases a copy is returned. Why is that? The reason is that there are two kinds of indexing. The first group of indexing comprises simple indices, x[0], slices, x[:2] and boolean masks, x[x > 0]. These methods all return a view and not a copy. The second group is called fancy indexing and basically means that we use arrays of indices to access multiple values at once. The simplest way of fancy indexing is using lists of indices.

In the following example, a has just a different ordering of rows than x.

In [22]:
x = np.arange(16).reshape(4, 4)
In [23]:
x_base = get_data_base(x)
In [24]:
a = x[[0, 2, 1, 3]]
In [25]:
a
Out[25]:
array([[ 0,  1,  2,  3],
       [ 8,  9, 10, 11],
       [ 4,  5,  6,  7],
       [12, 13, 14, 15]])
In [26]:
arrays_share_data(x, a)
Out[26]:
False

We can also combine fancy indexing with other indexing schemes, but the return value is always a copy.

In [27]:
a = x[1, [1, 2]]
In [28]:
a
Out[28]:
array([5, 6])
In [29]:
arrays_share_data(x, a)
Out[29]:
False

So every form of fancy indexing returns a copy. What about the following case?

In [30]:
a = x[(1,)]

What do you expect?

In [31]:
arrays_share_data(x, a)
Out[31]:
True

And, this one?

In [32]:
x = np.arange(16).reshape(4, 4, -1)
In [33]:
a = x[(1, 2)]
In [34]:
arrays_share_data(x, a)
Out[34]:
True

I was a little bit puzzled by this one at first as I thought it is the same as the following.

In [35]:
a
Out[35]:
array([6])
In [36]:
b = x[[1, 2]]
In [37]:
b
Out[37]:
array([[[ 4],
        [ 5],
        [ 6],
        [ 7]],

       [[ 8],
        [ 9],
        [10],
        [11]]])
In [38]:
arrays_share_data(x, b)
Out[38]:
False

The reason is that ellipses are simply ommitted and if the resulting index is not fancy the return value is a view.