# Recitation 5

## Homework Tips

• Be familiar with the What Can I Ask On Diderot? policy
• Talk to other students taking the course – they can help you and you can help them.
• Look for the “Common Problems in Homework x” post on Diderot before asking questions online.
• We limit the number of submissions in parts of this homework.
import numpy as np
import scipy.sparse as sp
from IPython.display import HTML, display
import tabulate

def pp(a):
display(HTML(tabulate.tabulate(a, tablefmt='html')))

def aa(a, v):
print(f"{'CORRECT' if np.allclose(a, v) else 'INCORRECT'}\t{a}")


## Working With Small Numbers

In parts of this course you have had to deal with small probabilities, likelihoods, and scores. Floating-point numbers have limited precision, and when your numbers become very small, it can cause problems. For example:

x = np.array([2**-150, 2**-151], dtype=np.float32)
x[0]/(x[0] + x[1])


### More Precision?

You can escape this using higher-precision floating-point numbers, but that’s slower, takes more memory, and doesn’t fundamentally solve the problem.

x = np.array([2**-150, 2**-151], dtype=np.float64)
aa(x[0]/(x[0] + x[1]), 2/3)

CORRECT	0.6666666666666666


x = np.array([2**-1490, 2**-1491], dtype=np.float64)
aa(x[0]/(x[0] + x[1]), 2/3)

/home/gauravmm/.pyenv/versions/3.6.7/envs/homework1/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars



Numpy will tell you the precision it can give you:

finfo = np.finfo(np.float32)
print("Number:")
pp([(k, finfo.__dict__[k]) for k in ["bits", "eps", "precision"]])
print("Exponent:")
pp([(k, finfo.__dict__[k]) for k in ["iexp", "minexp", "maxexp"]])

 iexp 8 minexp -126 maxexp 128

This is how the number is represented:

From Wikipedia: Single-precision floating-point format

Single-precision floating-point format, from Wikipedia, used under CC-BY-SA.

That’s why we ask you to work with the logarithm of the probability density; we are trading off precision in the fraction/mantissa for additional range of values that we can represent.

x = np.array([-1490 * np.log(2), -1491 * np.log(2)])
aa(x[0]/(x[0] + x[1]), 2/3)

INCORRECT	0.4998322710499832



This doesn’t work, because you can’t directly add logarithms. So far, in our language models and Bayes homework, we’ve only asked you to multiply numbers, which you do by adding their logarithms.

### One Weird Trick

For Unsupervised Learning (and in the future), you’ll need to add very small numbers together when using them in a fraction. The trick to doing that is to divide them by a small number.

If you just use the numerator: $\frac{e^{-x}}{e^{-x} + e^{-y}} = \frac{e^{-x}}{e^{-x} + e^{-y}} * \frac{e^{x}}{e^{x}} = \frac{e^{0}}{e^{0} + e^{x-y}} = \frac{1}{1 + e^{x-y}}$

This works if the exponents $x$ and $y$ are similar orders-of-magnitude. A good trick to use is to pick the smaller of $x$ and $y$ (that is, the number closer to 1 between $e^{-x}$, $e^{-y}$).

Why is this heuristic good?

Implementing this:

x = np.array([-1490 * np.log(2), -1491 * np.log(2)])

x -= x.max() # This is the trick
x = np.exp(x) # Now we can convert it from the log-value to the value.

aa(x[0] / (x[0] + x[1]), 2/3)

CORRECT	0.6666666666666544



Great!

## Neural Networks.

Colored Neural Network, from Wikimedia Commons, used under CC-BY-SA.

The basic layer type is a fully-connected layer, shown here. Our data has $64*64*3 = 12288$ elements; If we had one fully-connected hidden layer with a modest 1000 elements, we would need a total of:

The problem with fully-connected networks was discussed by Prof. Kolter in class (this slide).

### Hierarchical Structure

Visual processing systems in nature are organized hierarchically (Hubel & Wiesel, 1959). In particular, the building blocks are similar and local. That’s where convolutional layers come in:

Convolutional Layer from Dive Into Deep Learning, used under CC-NC-BY-SA.

We slide the same small matrix over the entire input space, performing the convolution at each position and storing the output in the corresponding position. This is the building block of almost every network operating on image data, and the associated operation is nn.Conv2d.

Given an input vector shaped like [example, channel, height, width], it slides over the height and width dimensions and convolves any number of input channels to a (potentially different) number of output channels. You should experiment with these options:

torch.nn.Conv2d(
in_channels, out_channels,
kernel_size,
stride=1,


Note that the output size is different than the input size. Experiment with kernel_size and padding_mode and see how they change the shape of the output. (That’s why we print the shapes in the test case!)

The sliding-window convolution is a popular building block, but you can perform other operations, like taking the maximum value in each channel. These are called pooling layers.

Instead of pooling, you can also include non-linear operations like nn.ReLU, nn.Tanh, etc. These are sometimes called activation functions.

### Tips

• Look at the network linked to in the handout; the typical structure is to use a few convolutional and/or pooling layers to get the size manageable and then use fully connected layers.
• Convolutional layers are quite small, you can afford to stack them. We Have To Go Deeper!
• You can use half (or more) of your parameters in the fully-connected layers; that’s normal.
• Remember your non-linear operation between linear operations like fully-connected layers. (Why?)
• A useful thing to think about is the receptive field of each cell in your intermediate tensor. Here’s an online calculator.

If you want a more detailed introduction, this open-source book is pretty good.