Blogomorphism - NumPy Formalized

The word tensor seems to be everywhere in Deep Learning. You will encounter it in most theoretical texts as well as in most frameworks. Heck, one of the most popular frameworks around is even called tensorflow! But what exactly is a tensor?

Being unsure about what the word tensor precisely means, the first thing any sane person will do, is try to find a mathematical definition. Here, some confusion will often arise and I have myself struggled with this confusion longer than I would have liked. The problem is that all really nice and formal definitions you will find, define a notion that is called a tensor by mathematicians and physicists, but does not seem to be the same thing that all the Deep Learning people are happily talking about. You will also never see this definition in the context of Deep Learning, making the situation especially suspicious.

Keeping a long story short, it turns out to be really simple: tensors (in DL) are just multi dimensional arrays. That is, just what is called an ndarray in NumPy. Nothing more, nothing less. The definition found in Maths and Physics defines a distinct notion.

Multi dimensional arrays are used everywhere in scientific computing, but just like it’s mathematical definition is never written out explicitely, one is also hard pressed to find a formal theory on how to do linear algebra with them. Of course, the space of all multi-dimensional arrays (or let us also just call them tensors) of a given shape is finite dimensional, thus isomorphic to \(\R^n\) for some \(n\). So we should not really expect a lot of fancy new maths to turn up. In fact, it is actually just regular linear algebra with a bunch of annoying index shuffling on top of it. However, because the role of tensors is so important in Deep Learning (and beyond), it might be worthwhile to explore a formal treatment of the theory. We will try to do that in this post!

In the end Deep Learning is just gradient descent on functions from tensors to tensors, so it also important to formally consider the notion of a derivate of such a map. This will be the dessert of this post.

Warning

This blog post is really for those who have an unhealthy appetite for formalizing everything, even when it is not necessary. If you are a more healthy individual and this is the first post on this site you are reading (and it might be, since it is the first post chronologically), please consider just skipping this one.

A Formal Definition

As anyone who has read an explanatory blog post on tensors before knows, it is required by law to include a picture that illustrates how tensors are generalizations of vectors and matrices. Here is my variant.

From Scalar to Tensor

We can make this more explicit as follows:

A scalar is just a real number \(\lambda\in\R\). This can be viewed as a map \(\{0\}\to\R\).
A vector is an element \(\bs x \in \R^n\) for some \(n\in\N\).¹ This can be viewed as a map \(\{0, \ldots, n-1\}\to\R\).
A real matrix of shape \((m, n)\) is a map \(\{0, \ldots, m-1\}\times \{0, \ldots, n-1\}\to \R\).

A real life example of a rank \(3\) tensor could be an RGB color image with one axis for color, one for \(x\) coordinate and one for \(y\) coordinate.

It becomes now quire clear how we should formally define tensors. Before we do it, some notation that will be used throughout this post:

\(\N\) is the set of natural numbers starting at \(1\).
For any set \(A\), the set \(A^{<\omega}\) denotes the set of finite sequences in \(A\), that is, \(A^{<\omega}=\bigcup_{n=0}^\infty A^n\).
For any natural number \(n\in \N\) we identify \(n\) with the set \(\{0, \ldots, n-1\}\) whenever useful.

Definition.

A tensor of shape \(\bs n = (n_1, \ldots, n_{K}) \in \N^{<\omega}\) is a map \[\begin{align*} \bs{\cal X}\c n_1\times\ldots\times n_{K}\to \R. \end{align*}\] The number \(K\) is called the rank of \(\bs{\cal X}\). We will write \(I_{\bs n}\) for the set \(n_1\times\ldots\times n_K\) and it’s elements will be called indices.

Note that the terminology “rank” is not universally used. It is indeed called rank in tensorflow, but ndim in numpy and dim in pytorch.

When we have a tensor \(\cal{\bs X}\) of shape \(\bs n\) and rank \(K\), we can make tensors of rank \(L\) for every \(L\ge K\) holding extactly the same amount of data by just adding ones to the shape. To illustrate this with numpy:

import numpy as np

x = np.array([1, 2, 3])
x.shape

(3,)

y = x.reshape((1, 3))
y

array([[1, 2, 3]])

y.shape

(1, 3)

This might somehow feel as if our notion is not completely well defined, but it is actually not a problem for our theory and we can sometimes even use this as a trick.

Now that we know what a tensor is, lets consider the structure of a space of tensors. For a given shape \(\bs n \in \N^{<\omega}\), let \(\R^{\bs n}\) denote the set of all tensors of shape \(\bs n\). Pointwise addition and scalar multiplication give this set the structure of a real vector space.² For any \(\bs i\in I_{\bs n}\) we define the tensor \(\bs{\cal E}_{\bs i}\) to be zero everywhere except at \(\bs i\) where it is \(1\). The set \(\{\bs{\cal E}_{\bs i} : \bs i \in I_{\bs n}\}\) is a basis for \(\R^{\bs n}\), showing that \(\R^{\bs n}\) is finite dimensional with dimension \(|I_{\bs n}| = \prod_{k=1}^{K}n_k\).

So we are now firmly in the world of linear algebra. Lets come back to the word “tensor” one last time and see why it is not such a bad word.

Proposition.

Let \(V_1, \ldots, V_{K}\) be real finite dimensional vector spaces with dimensions \(n_1, \ldots, n_{k}\). Let \(\bs n\) denote the sequence \((n_1, \ldots, n_{K})\). We have an isomorphism of vector spaces \[\begin{align*} \bigotimes_{k=1}^{K}V_k \cong \R^{\bs n} \end{align*}\]

Proof.

For \(1\le k \le K\), let \(\{b^k_i\}_{i=0}^{n_k-1}\) be a basis. Any element in \(\bigotimes_{k=1}^{K} V_k\) can be uniquely written as \[\begin{align*} \sum_{(i_1, \ldots, i_{K})\in I_{\bs n}}\lambda_{i_1, \ldots, i_{K}}(b^1_{i_1}\otimes\ldots\otimes b^{K}_{i_{K}}). \end{align*}\] So we can just map this element to the tensor \[\begin{align*} \mathcal{\bs\Lambda}\c (i_1, \ldots, i_{K})\mapsto \lambda_{i_1, \ldots, i_{K}} \end{align*}\] and this defines an isormorphism.

Note.

For simplicity, we have done everything above over the real numbers. Of course, no on stops you to consider this all over an arbitrary field.

The Tensordot product

If you have worked with tensors in practice before, you have probably encountered the tensordot product. And you would be quite tough if you were not a least a little bit intimidated. The tensordot product is not the most simple operation to wrap your head around, and it does not help that is is not straightforward to give a simple definition. Informally we can say that the tensordot product of two tensors \(\mathcal{\bs X}\) and \(\mathcal{\bs Y}\) is the resulting tensor that forms after contracting some paired axes of \(\mathcal{\bs X}\) and \(\mathcal{\bs Y}\). In the case of a \(n\times m\) matrix \(A\) and an \(m\times k\) matrix \(B\), we can pair the second axis of \(A\) with the first axis of \(B\) and after contraction we retrieve an \(n\times k\) matrix (this is just matrix multiplication).

Simple or not, we will give a formal definition!

Definition.

Let \(\cal{\bs X}\) be a rank \(K\) tensor of shape \(\bs n\) and \(\cal{\bs Y}\) a rank \(L\) tensor of shape \(\bs m\). An axis pairing of \(\cal{\bs X}\) and \(\cal{\bs Y}\) is a subset \(P \subseteq \{1, \ldots, K\}\times \{1,\ldots,L\}\) such that for all \((k, \ell)\in P\) we have \(n_k = m_\ell\). The set \(\bigtimes_{(k, \ell)\in P} n_k\) is denoted by \(I_P\).

We introduce some notation to do index manipulation for us:

For any \(\bs{n}, \bs{m} \in \N^{< \omega}\) we define \(\bs{n}\odot\bs{m}\) to be the concatanation of \(\bs n\) and \(\bs m\).
If \(P\) is an axis pairing, then we define \(\hat{\bs n}_P\) to be the shape \(\bs n\) but with all entries with an index in \(P\) ommitted.
For and index \(\bs{i}\in I_{\bs n}\) and \(\bs a\in I_P\) we define \(\bs{i}_{P=\bs a}\) to be \(\bs i\) but with \(\bs a\) inserted at the indexes in \(P\).

If this is somewhat vague, we can also express it in Python:

import typing as t

# just to make type hints more readible
Shape: t.TypeAlias = tuple[int, ...]
Pairing: t.TypeAlias = list[tuple[int, int]]
Index: t.TypeAlias = tuple[int, ...]

def hat(n: Shape, p: Pairing, left: bool) -> Shape:
    omitted_axes = set([pair[not left] for pair in p]) # pair[0] if left else pair[1]

    return tuple([
        ni for axis, ni in enumerate(n) if axis not in omitted_axes
    ])

def ipa(i: Index, p: Pairing, a: Index, left: bool) -> Index:
    # also a nice beer :)
    result = list(i)

    to_insert = {}
    for pair_idx, pair in enumerate(p):
        paired_axis = pair[not left] # pair[0] if left else pair[1]
        to_insert[paired_axis] = a[pair_idx]
    
    for paired_axis in sorted(to_insert):
        result.insert(paired_axis, to_insert[paired_axis])
    
    return tuple(result)

# example

n: Shape = (2, 3, 4)
m: Shape = (3, 2)
p: Pairing = [(0, 1), (1, 0)]

hat(n, p, left=True) # \hat{n}_P

(4,)

hat(m, p, left=False) #\hat{m}_P

()

i_left: Index = (3, ) # indexer for tensor of shape n
i_right: Index = () # indexer for tensor of shape m
a: Index = (1, 2) # indexer for the paired axes

ipa(i_left, p, a, left=True)

(1, 2, 3)

ipa(i_right, p, a, left=False)

(2, 1)

We now have the notation to give a nice definition of the tensordot product:

Definition.

Let \(\cal{\bs X}\) be a rank \(K\) tensor of shape \(\bs n\) and \(\cal{\bs Y}\) a rank \(L\) tensor of shape \(\bs m\). Let \(P\) be an axis pairing of \(\cal{\bs X}\) and \(\cal{\bs Y}\). The \(P\)-tensordot product \(\mathcal{\bs X}\ast_P\mathcal{\bs Y}\) of \(\cal{\bs X}\) and \(\cal{\bs Y}\) is the tensor defined by

\[\begin{align*} \mathcal{\bs X}\ast_P\mathcal{\bs Y}(\bs i\odot\bs j) = \sum_{\bs a \in I_P}\mathcal{\bs X}(\bs i_{P=\bs a})\mathcal{\bs Y}(\bs j_{P=\bs a}). \end{align*}\]

In the special case that \(P = \{(K-i, i): 1\le i \le M \}\) for some \(M\le \min(K, L)\) we can also simply write \(\ast_M\). This allows us to use common tensorproducts without explicitely needing to define the pairing. For example, if \(K=L=2\), then the tensors are just matrices and the tensordot product \(\ast_1\) is matrix multiplication.

And we can also write a custom implementation!

import numpy as np
import numpy.typing as npt

def tensordot(x: npt.NDArray, y: npt.NDArray, p: Pairing) -> npt.NDArray:
    n, m = x.shape, y.shape
    K, L = x.ndim, y.ndim

    nhat, mhat = hat(n, p, left=True), hat(m, p, left=False)
    result_shape = nhat + mhat
    p_shape = tuple([n[i] for i, _ in p])
    result = np.zeros(shape=result_shape)

    for i in np.ndindex(nhat):
        for j in np.ndindex(mhat):
            for a in np.ndindex(p_shape):
                result[i+j] += x[ipa(i, p, a, left=True)] * y[ipa(j, p, a, left=False)]

    return result

Lets check if our implementation gives the same answer as the numpy implementation.

# check answer of first tensordot example from the NumPy documentation
a = np.arange(60.).reshape(3,4,5)
b = np.arange(24.).reshape(4,3,2)
p = ([1,0],[0,1])
np_c = np.tensordot(a, b, axes=p)
custom_c = tensordot(a, b, p)
np.array_equal(np_c, custom_c)

True

The tensordot generalizes a lot of notions we already know! Let’s see some examples.

Let \(A\) be an \(m\times n\) matrix and \(B\) an \(n\times k\) matrix. The tensordot product \(A\ast_1 B\) is just good old matrix multiplication.
Let \(\mathcal{\bs X}\) and \(\mathcal{\bs Y}\) be rank \(K\) tensors of shape \(\bs n\). We have \[\begin{align*} \mathcal{\bs X}\ast_K\mathcal{\bs Y} = \sum_{\bs i\in I_{\bs n}}\mathcal{\bs X}(\bs i)\mathcal{\bs Y}(\bs i). \end{align*}\] This generalizes the inner product.
Let \(\mathcal{\bs X}\) and \(\mathcal{\bs Y}\) be rank \(K\) tensors of shape \(\bs n\). Let \(\mathrm{diag}(\mathcal{\bs X})\) denote the tensor of shape \(\bs n\odot \bs n\) given by \[\begin{align*} \mathrm{diag}(\mathcal{\bs X})(\bs i\odot \bs j) = \left\{ \begin{matrix} 0 & \bs i \neq \bs j \\ \mathcal{\bs X}(\bs i) & otherwise. \end{matrix} \right. \end{align*}\] The tensordot product \(\mathrm{diag}(\mathcal{\bs X})\ast_K\mathcal{\bs Y}\) is the same as pointwise multiplication of \(\mathcal{\bs X}\) and \(\mathcal{\bs Y}\).
Let \(\mathcal{\bs X}\) be a tensor of shape \(\bs n\) and \(\mathcal{\bs Y}\) a tensor of shape \(\bs m\). We can view \(\mathcal{\bs X}\) as a tensor of shape \(\bs n \odot (1)\) and \(\mathcal{\bs Y}\) as a tensor of shape \((1)\odot \bs m\). The tensordot product \(\mathcal{\bs X}\ast_1 \mathcal{\bs Y}\) is called the kronecker product of \(\mathcal{\bs X}\) and \(\mathcal{\bs Y}\).
Ok… Lets do one actual real life application. We can use the tensordot product to implement a convolutional layer in a neural network. We will not explain what that is, so if you are unsure, please skip this example. For simplicity we will only look at a specific situation. The input is a batch of images and the input shape looks like (batch_size, n_channels, height, width). The user is always kind enough to provide a kernel whose shape looks like (n_channels, kernel_height, kernel_width, n_output_channels) where the number of input channels coincides with the number of input channels of the input batch. Note that the implementation given here is very inefficient in the way it prepares the input tensor x. In reality you should probably do some fancy stuff, like stride manipulation. However, the goal here is to show the usefullness of the tensordot operation and not to distract by doing all kind of numpy black magic.

import numpy as np
import numpy.typing as npt

def subtensor(x: npt.NDArray, start_height: int, kernel_height: int, start_width: int, kernel_width: int) -> npt.NDArray:
    return x[:, :, start_height:start_height+kernel_height, start_width:start_width+kernel_width]

def conv(x: npt.NDArray, kernel: npt.NDArray) -> npt.NDArray:
    (batch_size, n_channels, height, width) = x.shape
    (n_channels, kernel_height, kernel_width, n_out_channels) = kernel.shape
    x = np.array([subtensor(x, i, kernel_height, j, kernel_width) for i in range(height - kernel_height + 1) for j in range(width - kernel_width + 1)])
    return np.tensordot(x, kernel, axes=3)

That was the only applied stuff we will see in the blog post, so lets get quickly go back to the theory. A cool thing about \(m\times n\) matrices is that they are in one-to-one correspondence with linear maps \(\R^n\to\R^m\) with matrix mulitplication playing the role of function evaluation. We now have a nice theory where \(\R^{\bs n}\) generalizes \(\R^n\), a tensor generalizes a matrix and the tensordot product generalizes matrix multiplication. This probably allows us to call the following result trivial, or at least obvious.

Proposition.

Let \(\bs n = (n_1, \ldots, n_{K})\) and \(\bs m = (m_1, \ldots, m_{L})\) be shapes. There is an isomorphism \[\begin{align*} \varphi\c\hom(\R^{\bs n}, \R^{\bs m})\isoto \R^{\bs m\odot\bs n} \end{align*}\] such that \[\begin{align*} f(\mathcal{\bs X}) = \phi(f)\ast_K\mathcal{\bs X} \end{align*}\] for all \(\mathcal{\bs X}\in\R^{\bs n}\).

Proof.

Let \(f\in \hom(\R^{\bs n}, \R^{\bs m})\). For any \(\bs i \in I_{\bs n}\) there are \(a_{\bs i, \bs j}\) for \(\bs j \in I_{\bs m}\) such that \[\begin{align*} f(\mathcal{\bs E}_{\bs i}) = \sum_{\bs j \in I_{\bs m}} a_{\bs i, \bs j}\mathcal{\bs E}_{\bs j}. \end{align*}\] We now define \(\varphi(f) \in \R^{\bs n\odot \bs m}\) to be the tensor defined by \[\begin{align*} \varphi(f)(\bs i\odot \bs j) = a_{\bs i, \bs j}. \end{align*}\] It is not hard to see that this assignment satisfies the claim.

Differentiating Tensor Valued functions

Let’s take a small step back. Tensors turn up as parameters in neural networks. For example, a simple linear layer with input size \(n\) and \(m\) neurons is given by \(\bs{x}\mapsto \sigma(W\bs x + \bs b)\) where \(W\) is an \(m\times n\) matrix, \(\sigma\) is some activation function \(\R^m\to\R^m\) and \(\bs{b}\in\R^m\). During back propagation of the weight \(W\) we view this as a function \(\R^{(m, n)}\to \R^m\). In general, during back propagation, layers often are functions where the input space is \(\R^{\bs n}\) for some shape \(\bs n\). If we want to compute the gradients of these functions, we will need to have a theory of differentiation for \(\R^\bs{n}\). As one might expect, there is actually nothing really difficult to do, we can just generalize all the theory we have on \(\R^n\). We will briefly mention some points.

Differentiability implies contiuouty so we must first think about what continuous functions on \(\R^{\bs n}\) are. The topological structure on \(\R^n\) comes from the euclidean distance \(\|\bs{x} - \bs{y}\|\). The norm \(\|.\|\) here is actually defined by the standard innner product \(\|\bs{x}\| = \sqrt{\langle \bs{x}, \bs{x}\rangle}\). We saw that we have a generalization of the standard inner product on \(\R^{\bs n}\) (first example of the tensordot product)! This also gives us a norm on \(\R^n\) \[\begin{align*} \|\mathcal{\bs{X}}\| = \sqrt{\mathcal{\bs{X}}\ast_K\mathcal{\bs{X}}} \end{align*}\]
and therefore a metric. This allows us to talk about continuoutiy, limits and all other kinds of nice topological stuff. Without giving all the details, we trust that the following makes sense.

Definition.

Let \(U\subseteq \R^{\bs n}\) be open, \(\mathcal{\bs A}\in U\) and \(f\c U \to \R^{\bs m}\). For \(\bs{i}\in I_{\bs{n}}\) and \(\bs{j}\in I_{\bs{m}}\) the partial derivative \(\frac{\partial f_{\bs{j}}}{\partial x_{\bs i}}\) at \(\mathcal{\bs{A}}\in\R^{\bs n}\) is defined by

\[\begin{align*} \lim_{h\to 0} \dfrac{f_{\bs{j}} (\dltensor{A} + h\dltensor{E}_{\bs{i}}) - f_{\bs{j}}(\dltensor{A} )}{h}. \end{align*}\]

We can gather all these partial derivatives in a Jacobian tensor. This tensor has shape \(\bs{m}\odot\bs{n}\) and is given by

\[\begin{align*} \dltensor{J}_f(\dltensor{A})(\bs{i}\odot\bs{j}) = \dfrac{\partial f_{\bs{j}}}{\partial x_{\bs{i}}}(\dltensor A). \end{align*}\]

Footnotes

Actually this is only true in common language. For mathematicians, a vector is an element of any vector space.↩︎
In fact, pointwise multiplication makes it even an \(\R\)-algebra.↩︎