<aside> 🤔 Some materials define formula as follows. what is the diff? $\theta_j := \theta_j + {\alpha \over n} \sum_{i=1}^n(y^{(i)} - h_{\theta}(x^{(i)}))x_j^{(i)}$ (divide by $n$, thus taking an average)

</aside>

In gradient descent optimization, especially for machine learning algorithms, the cost function typically involves taking an average of the losses computed for each individual example in the training dataset. This averaging is done to make the cost function (also known as the loss or objective function) invariant to the size of the training set.

Here's why this is important:

  1. Scalability: By averaging the loss, the final loss value is not affected by the number of training examples. This is important because it makes the cost function and the updates to the parameters during training scalable to different sized training sets. If you didn't scale by the number of training examples, the magnitude of the gradients (and thus the updates to the parameters during training) would be proportionally larger for larger datasets, potentially leading to instability in learning.
  2. Comparability: It allows for a fair comparison of the cost function value across different training sets of varying sizes. Because the cost is an average, a cost of, say, "5" means the same thing regardless of whether your training set has 10 examples or 10,000.
  3. Stability and Learning Rate: Dividing by the number of training examples provides a more stable numerical range for the cost function, especially when using datasets of different sizes. This helps in maintaining a consistent learning rate hyperparameter, as the learning rate would otherwise have to be adjusted when the dataset size changes.
  4. Convergence Analysis: When analyzing the convergence of the gradient descent algorithm theoretically, having a cost function that is an average (i.e., divided by the number of training examples) often simplifies the mathematics, as the expected loss per example is typically what's considered.

While it's most common to divide the aggregate loss by the total number of training samples (i.e., to calculate the mean loss), it's not fundamentally "wrong" to use the sum instead, as long as you're consistent in how you calculate gradients and perhaps adjust the learning rate accordingly. However, using the mean loss is a more standard approach and has the aforementioned advantages.