Dense Layers

Dense Layers#

The dense layer is the simplest unit of a neural network, performing a single matrix multplication.

The heart of any neural network is the dense layer. They are also referred to by other names – linear layer, feedforward layer – but these are all the same. The dense layer is a single matrix multiplication followed by an optional bias term.

W = np.random.randn(64, 128)
b = np.random.randn(128)

def dense(x):
    return x @ W + b

What does a dense layer accomplish?#

The general purpose of a dense layer is of course to handle arbitrary computation. As discussed in the neural networks section, and in the famous [bitter lesson], the most effective models are those that do not rely on hand-designed structures. In this way, a dense layer has almost zero implicit biases, save for regularatory structures such as smoothness.

The gradient of a dense layer is simple to compute. As the output is a linear combination of the input features (defined by parameters), the derivative of the output features are simply the input features themselves.

Which behaviors do we want to avoid?#

The main pitfall in a dense layer is magnitude explosion. If the parameters of a dense layer blow up in magnitude, then the corresponding features will also blow up, and gradients will become nonsense. It is important to keep track of the parameter magnitude for this reason. Thankfully, in general dense layers are stable to train and do not require much machinery to keep regularized.

Initialization#

To prevent magnitude explosion, we must properly initialize our dense layers to a reasonable starting point. Remember that the main parameter of a dense layer is an (m,n) matrix. A common way to measure the magnitude of a vector is via the root mean squared (RMS) norm, which is equivalent to the standard deviation for zero-mean vectors, and takes the form:

\[ RMS(x) = \sqrt{\frac{1}{N} \sum x_n^2} \]

Let’s assume our input features are well-behaved – they are independently distributed and have an RMS of 1. If we naively initialize the weight matrix from a normal distribution, then the output feature norm would increase by a factor of sqrt(m).

W = np.random.randn(64, 128)
b = np.random.randn(128) * 0.1

def dense(x):
    return x @ W + b

x = np.random.randn(256, 64) # (Batch of 256, 64-dimensional vectors)
y = dense(x)
print('Average norm of x:', np.sqrt(np.mean(np.square(x))))
print('Average norm of y:', np.sqrt(np.mean(np.square(y))))

Average norm of x: 1.002288982152277
Average norm of y: 8.169048515484485

With naive initialization the norm increases, and if we stacked multiple layers, the norms would exponentially increase, leading to collapse.

LeCun Initialization#

To reason for the norm increase is that each feature y is defined as a linear combination over m input features. We know that the standard deviation of such distributions scales with the dimension of sqrt(m).

We can solve the problem by accounting for this structure, and dividing by the sqrt term during initialization. This strategy is known as LeCun initialization.

\[ W = N(0,1) / \sqrt(m) \]

W = np.random.randn(64, 128) / np.sqrt(64)
b = np.random.randn(128) * 0.1

def dense(x):
    return x @ W + b

x = np.random.randn(256, 64) # (Batch of 256, 64-dimensional vectors)
y = dense(x)
print('Average norm of x:', np.sqrt(np.mean(np.square(x))))
print('Average norm of y:', np.sqrt(np.mean(np.square(y))))

Average norm of x: 1.000468286120868
Average norm of y: 1.0038317906064507

Breaking Symmetry#

You may notice that we initialize our parameter matrices from random distributions. If we used a constant initialization instead, the matrix would never escape its rank-1 form – backpropagation would calculate the same gradients for each entry in the parameter matrix. This is bad, so don’t do it.

Bias Initialization#

While a number of theories and tricks are used for matrix initialization, the bias parameters are generally initialized from a simple scaled normal, or even to a vector of zeros. In contrast to the matrix parameters, biases don’t suffer from any kind of symmetry issues (since they are rank-1 to begin with). While matrix initilization needs careful consideration, since any deviation will compound exponentially, biases only interact additively with features so they are not as sensitive.