# Deep Networks for Named Entity Recognition

Following Socher’s CS224d course @ Stanford, the second part of the assignment requires developing deep networks for the task of named entity recognition (NER). For this we define a two layer network of the form:

We know, $y \in {\mathbb R}^5$ is a one hot vector for various classes of possible named entities (person, organization, location, miscellaneous, or not a named entity). We define the input

as a window of words. Each $x_i$ is a one-hot index into a word vector matrix $L$.

Now we need to compute gradients.

First, let’s compute $\frac{\delta J}{\delta \theta}$. First, the sum is really only relevant for a single one hot vector so we can kind of ignore it a little bit. If you see it missing, that’s why.

And for when $i=j$

Now we move onto other derivatives with the help of our old friend, the chain rule.

Now, we want to avoid parameters exploding or becoming highly correlated, so we need to augment our cost with a gaussian prior as follows. “This tends to push parameter weights closer to zero, without constraining their direction, and often leads to classifiers with better generalization ability.”

Maximizing log likelihood with respect to the Gaussian prior results in a formulated regularization parameter:

The the combined loss function becomes:

Our update gradients will thus include a new term for each respectively:

Now, we want to create something to have random initializations to try to avoid local minima. Apparently, the following equation has been found to work well. For a matrix $A$ of dimension $m \times n$, select values $A_{ij}$ uniformly in range $[-\epsilon, \epsilon]$. Where:

In code we can implement this as a function as seen here.