Insert Topic Here PyTorch
Created : 20/01/2022 | on Linux: 5.4.0-91-generic
Updated: 20/01/2022 | on Linux: 5.4.0-91-generic
Status: Draft
previous topic 1: Starting Development with PyTorch
previous topic 2: Tensors and Data Handling with PyTorch previous topic 3: Building a network in eager mode
Warning!! This post is under construction!
Softmax function
Probabilistic and Information Theory perspectives.
Proof for Softmax Gradient
Implementations with python
The code below is taken from my answers to the assignments in CS231
from builtins import range
import numpy as np
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
batch_size = X.shape[0]
n_features = X.shape[1]
n_classes = W.shape[1]
for i in range(batch_size):
un_normalised_probs = X[i] @ W
loss += -un_normalised_probs[y[i]] + np.log(np.sum(np.exp(un_normalised_probs)))
# Probabilities with the "log-sum-exp trick"
un_normalised_probs -= np.max(un_normalised_probs)
y_hat = np.exp(un_normalised_probs)/np.sum(np.exp(un_normalised_probs))
for k in range(n_classes):
# Here we implement the gradient we derived above
dW[:, k] += X[i] * (y_hat[k] - (k == y[i]))
loss /= batch_size
dW /= batch_size
loss += reg * np.sum(W*W)
dW += reg * 2 * W
return loss, dW
We apply the gradient for each column in the dW
matrix
for k in range(n_classes):
# Here we implement the gradient we derived above
dW[:, k] += X[i] * (y_hat[k] - (k == y[i]))
Here is a faster vectorised version of the code above.
un_normalised_probs = X @ W
losses = np.log(np.sum(np.exp(un_normalised_probs), axis=1 , keepdims=True)) - np.take_along_axis(un_normalised_probs, np.reshape(y, (-1, 1)), axis=1)
loss += np.sum(losses)/batch_size + reg * np.sum(W*W)
un_normalised_probs -= np.amax(un_normalised_probs, axis=1, keepdims=True)
y_hats = np.exp(un_normalised_probs) / np.sum(np.exp(un_normalised_probs), axis=1, keepdims=True)
y_hats[np.arange(y.shape[0]), y] -= 1
dW = (X.T @ y_hats)/batch_size + reg * 2 * W
Check the next topic
Source: PyTorch Tutorial
Click here to report Errors, make Suggestions or Comments!