# Training a Classifier.

May 18th, 2018: Data generation and model training for classification using TensorFlow.

## Preface.

I've been wanting to write a post like this for a while. With this, I am targeting programmers with a medium skill level. I expect the reader to be comfortable using all aspects of the Python language, and be aware of the basic API for TensorFlow, NumPy and MatPlotLib. Follow this post to set up this software on your system. We will use Jupyter Notebook.

## The Classification Setting.

We define a classification problem as the mapping from a measurement vector x with N entries to probability distribution y across class M labels. The function f performs the mapping parameterized by the collection of weights theta.

We define the dataset D as a collection of B tuples, containing a measurement vector x, and associated class label l. We define a loss function eta across the dataset. We wish to find the parameters theta that minimize the loss.

Intuitively, we want to push up the probability of the correct class label l for each output y produced from input x, conditioned on theta. We accomplish this by minimizing the sum of the negative log probabilities as the loss instead.

## Starting a Jupyter Notebook.

This experiment will run inside a Jupyter Notebook server. This provides easy data visualization and code organization, which is essential for presenting research to other scientists. Run the following in your terminal to start a notebook.

``name@computer:~\$ jupyter notebook``

In the first panel of the notebook, import the following packages.

``import numpy as np``
``import tensorflow as tf``
``import matplotlib.pyplot as plt``

After running this panel, these packages will be part of the global namespace.

## Generating a Dataset.

In the second panel of the Jupyter Notebook, define these global variables. I encourage you to experiment with these values, and see how performance is affected. These values control the dimensions and complexity of the dataset.

``N = 100 # The size of the measurement vector``
``M = 5 # The number of class labels``
``B = 10000 # The number of dataset examples``

We then define the dataset by randomly sampling (B/M) measurement vectors offsets for M different centroids in N dimensional space. We use the Normal distribution. The mean and standard deviations have been appropriately chosen.

``measurements = np.random.normal(0.0, 10.0, (M, N))``
``labels = np.arange(M, dtype=int)``
``measurements = np.tile(measurements[:, np.newaxis, :], (1, B//M, 1)).reshape((B, N))``
``labels = np.tile(labels[:, np.newaxis], (1, B//M)).reshape((B,))``
``measurements += np.random.normal(0.0, 1.0, (B, N))``

There is a notation difference here. I defined the dataset D as a list of tuples for simplicity in the math notation. But, in the actual code implementation, keeping the inputs and labels separate is more efficient in practice to implement. Try the following.

``plt.plot(measurements[:, 0], measurements[:, 1], "bo")``

I obtained this graph, which clearly depicts the five clusters. ## Defining the Model.

The model I use is an Artificial Neural Network, that takes an N dimensional vector as input, and produces an M dimensional vector as output, that is normalized to be a probability distribution. In particular, we have the following definition.

The Neural Network has V layers, where each layer projects the previous hidden state h onto the space W offset by b. Each hidden state is normalized by the nonlinearity sigma. The final layer z is passed through softmax instead of sigma.

Theta is defined as the set of all W and b for each layer of the network. Here, I use the ReLU nonlinearity. Enter this code into the next panel of the notebook, for a model inference function, parameterized by theta.

``def inference(x, theta):``
``    for i, (W, b) in enumerate(theta):``
``        x = tf.tensordot(x, W, 1) + b``
``        if i != 2:``
``            x = tf.nn.relu(x)``
``    return x``

Note the change in notation here. We previous defined theta as a set of trainable parameters W and b for each layer. For simplicity in code, the theta function parameter is an ordered list of tuples containing (W, b) for each layer.

``weights = [``
``    tf.get_variable(``
``        "weights_one",``
``        [N*2, N],``
``        initializer=tf.truncated_normal_initializer()),``
``    tf.get_variable(``
``        "weights_two",``
``        [N*2, N*2],``
``        initializer=tf.truncated_normal_initializer()),``
``    tf.get_variable(``
``        "weights_three",``
``        [M, N*2],``
``        initializer=tf.truncated_normal_initializer())]``
``biases = [``
``    tf.get_variable(``
``        "biases_one",``
``        [N*2],``
``        initializer=tf.constant_initializer(1.0)),``
``    tf.get_variable(``
``        "biases_two",``
``        [N*2],``
``        initializer=tf.constant_initializer(1.0)),``
``    tf.get_variable(``
``        "biases_three",``
``        [M],``
``        initializer=tf.constant_initializer(1.0))]``

This code will initialize the weights and biases in a three layer Neural Network. Notice that weights are drawn from a Normal distribution, and biases are set to a constant value. These are particularly strong choices with ReLU.

## Training.

We have a model parameterized by many numbers. In deep learning, we HOPE there is SOME combination of parameters that performs really well. We estimate these optimal parameters using variants of Gradient Descent.

I use a variant to vanilla Gradient Descent named ADAM. In this experiment, we perform gradient descent across the whole dataset, but in practice I will sample and shuffle batches from the dataset. Define the following code in the notebook.

``def train(x_inputs, y_labels, W, b):``
``    y_pred = inference(``
``        x_inputs,``
``        zip(W, b))``
``    loss = tf.reduce_mean(``
``        tf.nn.sparse_softmax_cross_entropy_with_logits(``
``            logits=y_pred,``
``            labels=y_labels))``
``    return tf.argmax(y_pred, axis=-1), loss, tf.train.AdamOptimizer(``
``        learning_rate=0.1).minimize(``
``        loss, var_list=(W + b))``

Here, given the dataset as input, and the weights and biases for each layer in the Neural Network, we perform inference to obtain the class probabilities, which are used to calculate the mean cross entropy loss with respect to the labels. I use the AdamOptimizer class to calculate the gradient of the loss with respect to the parameters, using a fancy update rule.

``x_holder = tf.placeholder(tf.float32)``
``y_holder = tf.placeholder(tf.int32)``
``data_out = train(x_holder, y_holder, weights, biases)``

In a new panel, enter this code. We are using the static computational graph version of TensorFlow, and so all computation must be predefined in terms of placeholders, which we can fill in with actual data later.

``losses = []``
``accuracies = []``

``session = tf.Session()``
``session.run(tf.global_variables_initializer())``
``for i in range(10):``
``    pred, loss, grad = session.run(``
``        data_out,``
``        feed_dict={x_holder: measurements, y_holder: labels})``
``    acc = sum([1.0 if pred[i] == labels[i] else 0.0 for i in range(B)]) / B``
``    print(``
``        "Accuracy: %.2f %%" % acc,``
``        "Loss: %.2f" % loss)``
``    ``
``    losses += [loss]``
``    accuracies += [acc]``

Here, we have entered a training session, where the computational graph is executed. In each of the 10 training steps, we compute the prediction of the network for each example, and backpropagate the derivative of the loss to tune the weights and biases. If all goes well, you should see something like the following printed to your notebook.

``Accuracy: 0.15 % Loss: 11533.53``
``Accuracy: 0.58 % Loss: 5829.85``
``Accuracy: 1.00 % Loss: 1.39``
``Accuracy: 1.00 % Loss: 9.62``
``Accuracy: 0.98 % Loss: 232.09``
``Accuracy: 0.99 % Loss: 83.18``
``Accuracy: 1.00 % Loss: 7.97``
``Accuracy: 1.00 % Loss: 0.17``
``Accuracy: 1.00 % Loss: 0.33``
``Accuracy: 1.00 % Loss: 1.60``

Nice!! We were able to get really high accuracy out of such a simple model, and very high dimensional data. Try plotting the accuracies and losses we collected during training, like what I provide below.  ## Parting Thoughts.

In this post, we have built a small Neural Network that classifies high dimensional data. This sort of implementation can be adapted for recognizing handwritten digits, and naming objects. I challenge you to explore these next! The code for my implementation of this classifier is on this GitHub repository. Happy coding!