CSCI 497P/597P - Homework 5

Fall 2020

You may collaborate freely on this homework; the one rule is that you have to write up your own solutions.

Complete the following problems and submit your solutions to the HW5 assignment on Canvas. For all questions, justify your answers either or showing your work or giving a brief explanation. Please typeset your solutions using latex or similar; you may include neatly hand-drawn figures so long as the scan quality is good. You may work with your classmates on these problems, but you must write up your own solutions individually.

What property must an activation function have?
For each of the following, give a description in one sentence or less of the problem it solves or in what way it improves neural network training. You do not need to say what it means or how it works, just why you’d want to do it.
1. batch normalization
2. data augmentation
3. momentum
4. dropout
Why can’t you call backward() on something that’s not a scalar?
Give two reasons that minibatch SGD is used instead of standard gradient descent to train CNNs.
In this problem, we’ll look into the size of CNNs, both in terms of the number of parameters (weights) learned and the size of the feature maps.
1. How many parameters are learned in a linear (fully connected) layer that takes a 64-by-64 color image as input and produces a 1000-dimensional output vector? Assume biases are used.
2. How many parameters are there in a 5x5 convolution layer that operates on a 64-by-64 color input image and has 16 output channels? Assume biases are not used.
3. Can the transformation performed by the convolution layer from part (2) be performed by the linear layer from part (1), if the weights of the linear layer are set appropriately?
4. Consider the following architecture, which closely resembles AlexNet. For brevity, we’ve excluded the ReLU operations, because they don’t change the dimensions or require any parameters, and they can be applied to the feature map in place.
```
nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=5), # conv1
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(64, 192, kernel_size=5, padding=2), # conv2
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Conv2d(192, 384, kernel_size=3, padding=1), # conv3
nn.Conv2d(384, 256, kernel_size=3, padding=1), # conv4
nn.Conv2d(256, 256, kernel_size=3, padding=1), # conv5 
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Linear(256 * 7 * 7, 4096), #fc1 
nn.Linear(4096, 4096), # fc2
nn.Linear(4096, 1000)] # fc3
```
  How many parameters are learned in each layer? As above, assume no biases for conv layers and biases for linear layers.
5. Assume the network’s input dimensions are 224x224. Calculate the size of the feature map produced by each layer of the network. Note that by default, Conv2d layers use valid output size.
6. To train the network on a GPU, we need to store arrays containing each feature map and each set of parameters in GPU RAM. Further, it has to store the gradient of the loss with respect to each set of parameters. Assuming feature maps and parameters all have type float32 and thus require 4 bytes per element to store, how much GPU RAM would be necessary to train AlexNet? Note that we’re ignoring any overhead for each array, space required to store the input image, and a variety of other details.
7. In practice, we train by pushing batches of images through the network at once. If my GPU has 8GB of RAM, what’s the largest batch size I could use?