The input to a network is a 3-channel RGB image. The first layer of the network is a convolution layer that learns 8 filters, each of which is 3x3; the filters also include a bias parameter. How many parameters (weights) need to be learned for this layer?
The layer described in #1 is applied to a 227x227 color image with “valid” output size. What are the dimensions of the layer’s output on this input?
In traditional 2D image convolution, what transformations could be accomplished by a convolution with a 1x1 kernel?
What transformations can be accomplished by a CNN conv layer with size 1x1?
We saw in the notebook that typical convolutional architectures progressively reduce the spatial dimensions (using strides or pooling) and increase the channel dimensions (by learning more filters per layer) of the feature map as it passes through the network. This problem explores: what if we didn’t reduce the spatial dimensions? Suppose we have a network, loosely modeled after AlexNet, with 5 convolution layers and 3 fully-connected layers:
# recall: nn.Conv2d(in_channels, out_channels, kernel_size)
= nn.Conv2d(3, 96, 3)
conv1 = nn.Conv2d(96, 256, 3)
conv2 = nn.Conv2d(256, 384, 3)
conv3 = nn.Conv2d(384, 384, 3)
conv4 = nn.Conv2d(384, 256, 3)
conv5 = nn.Linear(??, 4096)
dense1 = nn.Linear(4096, 4096)
dense2 = nn.Linear(4096, 1000) # 1000 output classes dense3
Assume the model’s outputs are 227x227 color images, and using
same
output size for simplicity, calculate the input size
for the first linear layer (noted as ??
above).
For the architecture in #5, give an order-of-magnitude estimate of the number of parameters in the model.
Which layer(s) dominate the parameter count?