One-layer model is fitting a linear function f(input), defined as output = kernel * input + bias
The kernel and bias are tunable parameters (the weights) of the dense layer.
These weights contain the information learned by the network from exposure to the training data.
Initially, these weights are filled with small random values (a step called random initialization).
To find a good setting for the kernel and bias (collectively, the weights) we need two things:
A measure that tells us how well we are doing at a given setting of the weights. This is represented by a loss function measurement.
A method to update the weights’ values so that next time we will do better than we currently are doing, according to the measure previously mentioned. This is accomplished by an optimizer method i.e. the algorithm by which the network will update its weights (kernel and bias, in this case) based on the data and the loss function.
The compile() method specifies 'sgd' as the optimizer and 'meanAbsoluteError' as the loss.
'meanAbsoluteError' means that the loss function will calculate how far the predictions are from the targets, take their absolute values (making them all positive), and then return the average of those values:
meanAbsoluteError = average( absolute(modelOutput - targets))
'sgd' stands for stochastic gradient descent, a calculus formula to determine what adjustments should be made to the weights in order to reduce the loss.
The fit() method is the training process of a model in TensorFlow.js. It can often be long-running, lasting for seconds or minutes. Therefore, the async/await feature is used.
The evaluate() method calculates the loss function as applied to the provided example features and targets. It is similar to the fit() method in that it calculates the same loss, but evaluate() does not update the model’s weights.
The training loop iterates through the following steps:
1. Draw a batch of training samples x and corresponding targets y_true. A batch is simply a number of input examples put together as a tensor. The number of examples in a batch is called the batch size. In practical deep learning, it is often set to be a power of 2, such as 128 or 256. Examples are batched together to take advantage of the GPU’s parallel processing power and to make the calculated values of the gradients more stable.
2. Run the network on x (a step called the forward pass) to obtain predictions y_pred.
3. Compute the loss of the network on the batch, a measure of the mismatch between y_true and y_pred. Recall that the loss function is specified when model.compile() is called.
4. Update all the weights (parameters) in the network in a way that slightly reduces the loss on this batch. The detailed updates to the individual weights are managed by the optimizer, which was specified during the model.compile() call.
The loss as a function of all tunable parameters is known as the loss surface concept.
The loss surface for this example has a bowl shape, with a global minimum at the bottom of the bowl representing the best parameter settings.
In general, however, the loss surface of a deep-learning model is much more complex. It will have many more than two dimensions and could have many local minima i.e. points that are lower than anything nearby but not the lowest overall.
For larger problems i.e. when optimizing millions of weights, the likelihood of randomly selecting a good direction becomes vanishingly small.
A much better approach is to take advantage of the fact that all operations used in the network are differentiable and hence, to compute the gradient of the loss with regard to the network’s parameters.
The mathematical definition of a gradient specifies a direction along which the loss function increases. When training neural networks, the loss should gradually decrease. Therefore the weights should be moved in the direction opposite the gradient. This training process is aptly named gradient descent.
One of the most desirable properties of deep neural networks are that they are universal approximators. Which means they should be able to cover non-convex functions as well. The problem with non-convex functions is that your initial guess might not be near the global minima and gradient descent might converge to a local minima. A solution to this problem is the stochastic gradient descent approach.
The term “stochastic” means drawing random samples from the training data during each gradient-descent step for efficiency, as opposed to using every training data sample at every step. In short, stochastic gradient descent is simply a modification of gradient descent for computational efficiency.
Stochastic means nondeterministic or unpredictable. Random generally means unrecognizable, not adhering to a pattern. A random variable is also called a stochastic variable. (https://math.stackexchange.com/questions/114373/whats-the-difference-between-stochastic-and-random)
.