# Recitation 5

## Homework Tips

- Be familiar with the What Can I Ask On Diderot? policy
- Talk to other students taking the course – they can help you and you can help them.
- Look for the “Common Problems in Homework x” post on Diderot before asking questions online.
- We limit the number of submissions in parts of this homework.

```
import numpy as np
import scipy.sparse as sp
from IPython.display import HTML, display
import tabulate
def pp(a):
display(HTML(tabulate.tabulate(a, tablefmt='html')))
def aa(a, v):
print(f"{'CORRECT' if np.allclose(a, v) else 'INCORRECT'}\t{a}")
```

## Working With Small Numbers

In parts of this course you have had to deal with small probabilities, likelihoods, and scores. Floating-point numbers have limited precision, and when your numbers become very small, it can cause problems. For example:

```
x = np.array([2**-150, 2**-151], dtype=np.float32)
x[0]/(x[0] + x[1])
```

### More Precision?

You can escape this using higher-precision floating-point numbers, but that’s slower, takes more memory, and doesn’t fundamentally solve the problem.

```
x = np.array([2**-150, 2**-151], dtype=np.float64)
aa(x[0]/(x[0] + x[1]), 2/3)
```

CORRECT 0.6666666666666666

```
x = np.array([2**-1490, 2**-1491], dtype=np.float64)
aa(x[0]/(x[0] + x[1]), 2/3)
```

/home/gauravmm/.pyenv/versions/3.6.7/envs/homework1/lib/python3.6/site-packages/ipykernel_launcher.py:2: RuntimeWarning: invalid value encountered in double_scalars

Numpy will tell you the precision it can give you:

```
finfo = np.finfo(np.float32)
print("Number:")
pp([(k, finfo.__dict__[k]) for k in ["bits", "eps", "precision"]])
print("Exponent:")
pp([(k, finfo.__dict__[k]) for k in ["iexp", "minexp", "maxexp"]])
```

iexp | 8 |

minexp | -126 |

maxexp | 128 |

This is how the number is represented:

Single-precision floating-point format, from Wikipedia, used under CC-BY-SA.

That’s why we ask you to work with the *logarithm* of the probability density; we are trading off precision in the fraction/mantissa for additional range of values that we can represent.

```
x = np.array([-1490 * np.log(2), -1491 * np.log(2)])
aa(x[0]/(x[0] + x[1]), 2/3)
```

INCORRECT 0.4998322710499832

This doesn’t work, because you *can’t directly add logarithms*. So far, in our language models and Bayes homework, we’ve only asked you to multiply numbers, which you do by adding their logarithms.

### One Weird Trick

For Unsupervised Learning (and in the future), you’ll need to add very small numbers together when using them in a fraction. The trick to doing that is to divide them by a small number.

If you just use the numerator:

This works if the exponents $x$ and $y$ are similar orders-of-magnitude. A good trick to use is to pick the smaller of $x$ and $y$ (that is, the number closer to 1 between $e^{-x}$, $e^{-y}$).

*Why is this heuristic good?*

Implementing this:

```
x = np.array([-1490 * np.log(2), -1491 * np.log(2)])
x -= x.max() # This is the trick
x = np.exp(x) # Now we can convert it from the log-value to the value.
aa(x[0] / (x[0] + x[1]), 2/3)
```

CORRECT 0.6666666666666544

Great!

## Neural Networks.

Colored Neural Network, from Wikimedia Commons, used under CC-BY-SA.

The basic layer type is a fully-connected layer, shown here. Our data has $64*64*3 = 12288$ elements; If we had one fully-connected hidden layer with a modest 1000 elements, we would need a total of:

The problem with fully-connected networks was discussed by Prof. Kolter in class (this slide).

### Hierarchical Structure

Visual processing systems in nature are organized hierarchically (Hubel & Wiesel, 1959). In particular, the building blocks are *similar* and *local*. That’s where convolutional layers come in:

Convolutional Layer from Dive Into Deep Learning, used under CC-NC-BY-SA.

We slide the same small matrix over the entire input space, performing the *convolution* at each position and storing the output in the corresponding position. This is the building block of almost every network operating on image data, and the associated operation is `nn.Conv2d`

.

Given an input vector shaped like `[example, channel, height, width]`

, it slides over the `height`

and `width`

dimensions and convolves any number of input channels to a (potentially different) number of output channels. You should experiment with these options:

```
torch.nn.Conv2d(
in_channels, out_channels,
kernel_size,
stride=1,
padding_mode='zeros')
```

Note that the output size is different than the input size. Experiment with `kernel_size`

and `padding_mode`

and see how they change the shape of the output. (That’s why we print the shapes in the test case!)

The sliding-window convolution is a popular building block, but you can perform other operations, like taking the *maximum* value in each channel. These are called pooling layers.

Instead of pooling, you can also include non-linear operations like `nn.ReLU`

, `nn.Tanh`

, etc. These are sometimes called *activation functions*.

### Tips

- Look at the network linked to in the handout; the typical structure is to use a few convolutional and/or pooling layers to get the size manageable and then use fully connected layers.
- Convolutional layers are quite small, you can afford to stack them. We Have To Go Deeper!
- You can use half (or more) of your parameters in the fully-connected layers; that’s normal.
- Remember your non-linear operation between linear operations like fully-connected layers. (Why?)
- A useful thing to think about is the
*receptive field*of each cell in your intermediate tensor. Here’s an online calculator.

If you want a more detailed introduction, this open-source book is pretty good.