1. Problems
We take 0/1 classification problem for data with d dimensional features as an example. A neural network with one hidden layer can be written as
\[ o = w^T_2 σ(W_1v + b_1) + b_2 \tag{1}\]
where \(v\) is the \(d\) dimensional input feature of the data, while \(W_1\), \(w_2\), \(b_1\), \(b_2\) are the parameters of the neural network model. \(W_1\) is a \(n × d\) matrix, \(w_2\) is a n dimensional vector, and \(b_1\) is a \(n\) dimensional bias vector while \(b_2\) is the bias. When \(o > 0\) we classify the datum as label 1, while when \(o ≤ 0\) we classify it as label -1. This forms a neural network, or multi-layer-perceptron, with one hidden layer containing \(n\) neurons. In this problem, we focus on the pre-training case with frozen parameters that \(W_1\) and \(b_1\) must be decided when seeing all \(v\) without labels \((l_1, ...l_i)\), while \(w_2\) and \(b_2\) can be decided after seeing all labels of examples \((l_1, ...l_i)\)
1.1 Problem1
Given \(n, d\), calculate the VC dimension of the neural network for the linear activation case, i.e. \(\sigma(x) = x\). Prove your result.
1.2 Problem2
Given \(n, d\), calculate the VC dimension of the neural network for the \(ReLU\) activation case, i.e. \(\sigma(x) = max(0, x)\). Prove your result.
1.3 Hints
- Recall the definition of VC dimension.
- Consider \(n > d\) and \(n ≤ d\).
- For problem 2, Start from \(d = 1\)
2. Solutions
2.1 Problem1
Given a hyphothesis set \(\mathcal{H}\)
2.1.1 Result
\[VC(\mathcal{H}) = d + 1\]
2.1.2 Prove
\[\sigma(x) = x \tag{1.1}\]
\[\begin{align} o &= w^T_2 σ(W_1v + b_1) + b_2\\ &= w^T_2(W_1v + b_1) + b_2\\ &= w^T_2W_1v + (w^T_2b_1 + b_2) \tag{1.2} \end{align}\]
It can be considered as linear classifier
We consider it as \[ o = W_3X \]
\[ W_3 = \left[ \begin{matrix} w_3 & b_3 \end{matrix} \right] \]
\[ X = \left[ \begin{matrix} v \\ 1 \\ \end{matrix} \right] \]
\[ w_3 = w^T_2W_1 \]
\[ b_3 = w^T_2b_1 + b_2 \]
while \(w_3\) is a \(n × d\) matrix, and \(b_3\) is the bias, \(W_3\) is a \(n × (d + 1)\) matrix
First, we need to prove \[VC(\mathcal{H}) \geq d + 1\],
We need to prove that \(\mathcal H\) can shatter at least \(d + 1\) points. That is, it suffices to show that there exists a set of \(d + 1\) points such that \(\mathcal H\) can produce any pre-sepecified {-1, +1} assignment \(y = (y_1, ..., y_{d+1})\), Let \[\begin{align} X &= \left[ \begin{matrix} x_1^T\\ \vdots\\ x_{d+1}^T \\ \end{matrix} \right] \\ &= \left[ \begin{matrix} 1 & 0 & 0 & \cdots & 0\\ 1 & 1 & 0 & \cdots & 0\\ 1 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & \cdots & 1\\ \end{matrix} \right] \in \mathbb{R}^{(d +1) \times (d + 1)} \end{align}\]
Next, we need to prove \[VC(\mathcal{H}) \leq d + 1\],
2.2 Problem2
Given a hyphothesis set \(\mathcal{H}\)
2.2.1 Result
\[VC(\mathcal{H}) = n + 1\]