VC Dimension of Neural Networks

2020-06-19

1. Problems

We take 0/1 classification problem for data with d dimensional features as an example. A neural network with one hidden layer can be written as

\[ o = w^T_2 σ(W_1v + b_1) + b_2 \tag{1}\]

where \(v\) is the \(d\) dimensional input feature of the data, while \(W_1\), \(w_2\), \(b_1\), \(b_2\) are the parameters of the neural network model. \(W_1\) is a \(n × d\) matrix, \(w_2\) is a n dimensional vector, and \(b_1\) is a \(n\) dimensional bias vector while \(b_2\) is the bias. When \(o > 0\) we classify the datum as label 1, while when \(o ≤ 0\) we classify it as label -1. This forms a neural network, or multi-layer-perceptron, with one hidden layer containing \(n\) neurons. In this problem, we focus on the pre-training case with frozen parameters that \(W_1\) and \(b_1\) must be decided when seeing all \(v\) without labels \((l_1, ...l_i)\), while \(w_2\) and \(b_2\) can be decided after seeing all labels of examples \((l_1, ...l_i)\)

1.1 Problem1

Given \(n, d\), calculate the VC dimension of the neural network for the linear activation case, i.e. \(\sigma(x) = x\). Prove your result.

1.2 Problem2

Given \(n, d\), calculate the VC dimension of the neural network for the \(ReLU\) activation case, i.e. \(\sigma(x) = max(0, x)\). Prove your result.

1.3 Hints

Recall the definition of VC dimension.
Consider \(n > d\) and \(n ≤ d\).
For problem 2, Start from \(d = 1\)

2. Solutions

2.1 Problem1

Given a hyphothesis set \(\mathcal{H}\)

2.1.1 Result

\[VC(\mathcal{H}) = d + 1\]

2.1.2 Prove

\[\sigma(x) = x \tag{1.1}\]

\[\begin{align} o &= w^T_2 σ(W_1v + b_1) + b_2\\ &= w^T_2(W_1v + b_1) + b_2\\ &= w^T_2W_1v + (w^T_2b_1 + b_2) \tag{1.2} \end{align}\]

It can be considered as linear classifier

We consider it as \[ o = W_3X \]

\[ W_3 = \left[ \begin{matrix} w_3 & b_3 \end{matrix} \right] \]

\[ X = \left[ \begin{matrix} v \\ 1 \\ \end{matrix} \right] \]

\[ w_3 = w^T_2W_1 \]

\[ b_3 = w^T_2b_1 + b_2 \]

while \(w_3\) is a \(n × d\) matrix, and \(b_3\) is the bias, \(W_3\) is a \(n × (d + 1)\) matrix

First, we need to prove \[VC(\mathcal{H}) \geq d + 1\],

We need to prove that \(\mathcal H\) can shatter at least \(d + 1\) points. That is, it suffices to show that there exists a set of \(d + 1\) points such that \(\mathcal H\) can produce any pre-sepecified {-1, +1} assignment \(y = (y_1, ..., y_{d+1})\), Let \[\begin{align} X &= \left[ \begin{matrix} x_1^T\\ \vdots\\ x_{d+1}^T \\ \end{matrix} \right] \\ &= \left[ \begin{matrix} 1 & 0 & 0 & \cdots & 0\\ 1 & 1 & 0 & \cdots & 0\\ 1 & 0 & 1 & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & \cdots & 1\\ \end{matrix} \right] \in \mathbb{R}^{(d +1) \times (d + 1)} \end{align}\]
Next, we need to prove \[VC(\mathcal{H}) \leq d + 1\],

2.2 Problem2

Given a hyphothesis set \(\mathcal{H}\)

2.2.1 Result

\[VC(\mathcal{H}) = n + 1\]

2.2.2 Prove

3. References

VC Dimension of Linear Classifiers