⭳ Download Jupyter Notebook

Recitation 5

Homework Tips

import numpy as np
import scipy.sparse as sp
from IPython.display import HTML, display
import tabulate

def pp(a):
    display(HTML(tabulate.tabulate(a, tablefmt='html')))
    

def aa(a, v):
    print(f"{'CORRECT' if np.allclose(a, v) else 'INCORRECT'}\t{a}")

Working With Small Numbers

In parts of this course you have had to deal with small probabilities, likelihoods, and scores. Floating-point numbers have limited precision, and when your numbers become very small, it can cause problems. For example:

x = np.array([2**-150, 2**-151], dtype=np.float32)
x[0]/(x[0] + x[1]) 

More Precision?

You can escape this using higher-precision floating-point numbers, but that’s slower, takes more memory, and doesn’t fundamentally solve the problem.

x = np.array([2**-150, 2**-151], dtype=np.float64)
aa(x[0]/(x[0] + x[1]), 2/3)
CORRECT	0.6666666666666666

x = np.array([2**-1490, 2**-1491], dtype=np.float64)
aa(x[0]/(x[0] + x[1]), 2/3)
/home/gauravmm/.pyenv/versions/3.6.7/envs/homework1/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars
  

Numpy will tell you the precision it can give you:

finfo = np.finfo(np.float32)
print("Number:")
pp([(k, finfo.__dict__[k]) for k in ["bits", "eps", "precision"]])
print("Exponent:")
pp([(k, finfo.__dict__[k]) for k in ["iexp", "minexp", "maxexp"]])
iexp 8
minexp-126
maxexp 128

This is how the number is represented:

From Wikipedia: Single-precision floating-point format

Single-precision floating-point format, from Wikipedia, used under CC-BY-SA.

That’s why we ask you to work with the logarithm of the probability density; we are trading off precision in the fraction/mantissa for additional range of values that we can represent.

x = np.array([-1490 * np.log(2), -1491 * np.log(2)])
aa(x[0]/(x[0] + x[1]), 2/3)
INCORRECT	0.4998322710499832

This doesn’t work, because you can’t directly add logarithms. So far, in our language models and Bayes homework, we’ve only asked you to multiply numbers, which you do by adding their logarithms.

One Weird Trick

For Unsupervised Learning (and in the future), you’ll need to add very small numbers together when using them in a fraction. The trick to doing that is to divide them by a small number.

If you just use the numerator: \(\frac{e^{-x}}{e^{-x} + e^{-y}} = \frac{e^{-x}}{e^{-x} + e^{-y}} * \frac{e^{x}}{e^{x}} = \frac{e^{0}}{e^{0} + e^{x-y}} = \frac{1}{1 + e^{x-y}}\)

This works if the exponents $x$ and $y$ are similar orders-of-magnitude. A good trick to use is to pick the smaller of $x$ and $y$ (that is, the number closer to 1 between $e^{-x}$, $e^{-y}$).

Why is this heuristic good?

Implementing this:

x = np.array([-1490 * np.log(2), -1491 * np.log(2)])

x -= x.max() # This is the trick
x = np.exp(x) # Now we can convert it from the log-value to the value.

aa(x[0] / (x[0] + x[1]), 2/3)
CORRECT	0.6666666666666544

Great!

Neural Networks.

Colored Neural Network, from Wikimedia Commons, used under CC-BY-SA.

The basic layer type is a fully-connected layer, shown here. Our data has $64*64*3 = 12288$ elements; If we had one fully-connected hidden layer with a modest 1000 elements, we would need a total of:

\[(12,288 * 1,000 + 1,000) + (1,000 * 1 + 1) = 12,290,001~\text{elements}\]

The problem with fully-connected networks was discussed by Prof. Kolter in class (this slide).

Hierarchical Structure

Visual processing systems in nature are organized hierarchically (Hubel & Wiesel, 1959). In particular, the building blocks are similar and local. That’s where convolutional layers come in:

Convolutional Layer from Dive Into Deep Learning, used under CC-NC-BY-SA.

We slide the same small matrix over the entire input space, performing the convolution at each position and storing the output in the corresponding position. This is the building block of almost every network operating on image data, and the associated operation is nn.Conv2d.

Given an input vector shaped like [example, channel, height, width], it slides over the height and width dimensions and convolves any number of input channels to a (potentially different) number of output channels. You should experiment with these options:

torch.nn.Conv2d(
    in_channels, out_channels,
    kernel_size,
    stride=1,
    padding_mode='zeros')

Note that the output size is different than the input size. Experiment with kernel_size and padding_mode and see how they change the shape of the output. (That’s why we print the shapes in the test case!)

The sliding-window convolution is a popular building block, but you can perform other operations, like taking the maximum value in each channel. These are called pooling layers.

Instead of pooling, you can also include non-linear operations like nn.ReLU, nn.Tanh, etc. These are sometimes called activation functions.

Tips

If you want a more detailed introduction, this open-source book is pretty good.