New Topic

Insert Topic Here PyTorch

Created : 20/01/2022 | on Linux: 5.4.0-91-generic
Updated: 20/01/2022 | on Linux: 5.4.0-91-generic
Status: Draft

previous topic 1: Starting Development with PyTorch
previous topic 2: Tensors and Data Handling with PyTorch previous topic 3: Building a network in eager mode

Warning!! This post is under construction!

Softmax function

Probabilistic and Information Theory perspectives.

Proof for Softmax Gradient

Implementations with python

The code below is taken from my answers to the assignments in CS231

from builtins import range
import numpy as np

def softmax_loss_naive(W, X, y, reg):
    """
    Softmax loss function, naive implementation (with loops)

    Inputs have dimension D, there are C classes, and we operate on minibatches
    of N examples.

    Inputs:
    - W: A numpy array of shape (D, C) containing weights.
    - X: A numpy array of shape (N, D) containing a minibatch of data.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c means
      that X[i] has label c, where 0 <= c < C.
    - reg: (float) regularization strength

    Returns a tuple of:
    - loss as single float
    - gradient with respect to weights W; an array of same shape as W
    """
    # Initialize the loss and gradient to zero.
    loss = 0.0
    dW = np.zeros_like(W)

    batch_size = X.shape[0]
    n_features = X.shape[1]
    n_classes  = W.shape[1]

    for i in range(batch_size):

      un_normalised_probs = X[i] @ W
      loss += -un_normalised_probs[y[i]] + np.log(np.sum(np.exp(un_normalised_probs))) 

      # Probabilities with the "log-sum-exp trick"
      un_normalised_probs -= np.max(un_normalised_probs)
      y_hat = np.exp(un_normalised_probs)/np.sum(np.exp(un_normalised_probs))
      
      for k in range(n_classes):
      	# Here we implement the gradient we derived above 
        dW[:, k] += X[i] * (y_hat[k] - (k == y[i]))
   
    loss /= batch_size 
    dW /= batch_size

    loss += reg * np.sum(W*W)   
    dW +=  reg * 2 * W

    return loss, dW

We apply the gradient for each column in the dW matrix

for k in range(n_classes):
	# Here we implement the gradient we derived above 
	dW[:, k] += X[i] * (y_hat[k] - (k == y[i]))

Here is a faster vectorised version of the code above.

un_normalised_probs = X @ W
losses = np.log(np.sum(np.exp(un_normalised_probs), axis=1 , keepdims=True)) - np.take_along_axis(un_normalised_probs, np.reshape(y, (-1, 1)), axis=1)
loss   += np.sum(losses)/batch_size + reg *  np.sum(W*W) 

un_normalised_probs -=  np.amax(un_normalised_probs, axis=1, keepdims=True)
y_hats = np.exp(un_normalised_probs) / np.sum(np.exp(un_normalised_probs), axis=1, keepdims=True)
y_hats[np.arange(y.shape[0]), y] -= 1
dW = (X.T @ y_hats)/batch_size + reg * 2 * W

Check the next topic

Source: PyTorch Tutorial

Click here to report Errors, make Suggestions or Comments!