|
|
|
|
|
{% extends "layout.html" %}
|
|
|
|
|
|
{% block content %}
|
|
|
<!DOCTYPE html>
|
|
|
<html lang="en">
|
|
|
<head>
|
|
|
<meta charset="UTF-8">
|
|
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
|
<title>Gradient Descent Study Guide</title>
|
|
|
|
|
|
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
|
|
|
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
|
|
<style>
|
|
|
|
|
|
body {
|
|
|
background-color: #ffffff;
|
|
|
color: #000000;
|
|
|
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Helvetica, Arial, sans-serif;
|
|
|
font-weight: normal;
|
|
|
line-height: 1.8;
|
|
|
margin: 0;
|
|
|
padding: 20px;
|
|
|
}
|
|
|
|
|
|
|
|
|
.container {
|
|
|
max-width: 800px;
|
|
|
margin: 0 auto;
|
|
|
padding: 20px;
|
|
|
}
|
|
|
|
|
|
|
|
|
h1, h2, h3 {
|
|
|
color: #000000;
|
|
|
border: none;
|
|
|
font-weight: bold;
|
|
|
}
|
|
|
|
|
|
h1 {
|
|
|
text-align: center;
|
|
|
border-bottom: 3px solid #000;
|
|
|
padding-bottom: 10px;
|
|
|
margin-bottom: 30px;
|
|
|
font-size: 2.5em;
|
|
|
}
|
|
|
|
|
|
h2 {
|
|
|
font-size: 1.8em;
|
|
|
margin-top: 40px;
|
|
|
border-bottom: 1px solid #ddd;
|
|
|
padding-bottom: 8px;
|
|
|
}
|
|
|
|
|
|
h3 {
|
|
|
font-size: 1.3em;
|
|
|
margin-top: 25px;
|
|
|
}
|
|
|
|
|
|
|
|
|
strong {
|
|
|
font-weight: 900;
|
|
|
}
|
|
|
|
|
|
|
|
|
p, li {
|
|
|
font-size: 1.1em;
|
|
|
border-bottom: 1px solid #e0e0e0;
|
|
|
padding-bottom: 10px;
|
|
|
margin-bottom: 10px;
|
|
|
}
|
|
|
|
|
|
|
|
|
li:last-child {
|
|
|
border-bottom: none;
|
|
|
}
|
|
|
|
|
|
|
|
|
ul {
|
|
|
list-style-type: none;
|
|
|
padding-left: 0;
|
|
|
}
|
|
|
|
|
|
li::before {
|
|
|
content: "β’";
|
|
|
color: #000;
|
|
|
font-weight: bold;
|
|
|
display: inline-block;
|
|
|
width: 1em;
|
|
|
margin-left: 0;
|
|
|
}
|
|
|
|
|
|
|
|
|
pre {
|
|
|
background-color: #f4f4f4;
|
|
|
border: 1px solid #ddd;
|
|
|
border-radius: 5px;
|
|
|
padding: 15px;
|
|
|
white-space: pre-wrap;
|
|
|
word-wrap: break-word;
|
|
|
font-family: "Courier New", Courier, monospace;
|
|
|
font-size: 0.95em;
|
|
|
font-weight: normal;
|
|
|
color: #333;
|
|
|
border-bottom: none;
|
|
|
}
|
|
|
|
|
|
|
|
|
.story {
|
|
|
background-color: #f9f9f9;
|
|
|
border-left: 4px solid #4a90e2;
|
|
|
margin: 15px 0;
|
|
|
padding: 10px 15px;
|
|
|
font-style: italic;
|
|
|
color: #555;
|
|
|
font-weight: normal;
|
|
|
border-bottom: none;
|
|
|
}
|
|
|
|
|
|
|
|
|
@media (max-width: 768px) {
|
|
|
body, .container {
|
|
|
padding: 10px;
|
|
|
}
|
|
|
|
|
|
h1 {
|
|
|
font-size: 2em;
|
|
|
}
|
|
|
|
|
|
h2 {
|
|
|
font-size: 1.5em;
|
|
|
}
|
|
|
|
|
|
h3 {
|
|
|
font-size: 1.2em;
|
|
|
}
|
|
|
|
|
|
p, li {
|
|
|
font-size: 1em;
|
|
|
}
|
|
|
|
|
|
pre {
|
|
|
font-size: 0.85em;
|
|
|
}
|
|
|
}
|
|
|
</style>
|
|
|
</head>
|
|
|
<body>
|
|
|
|
|
|
<div class="container">
|
|
|
<h1>Gradient Descent Study Guide</h1>
|
|
|
|
|
|
|
|
|
<div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<a
|
|
|
href="/gradient-descent-three"
|
|
|
target="_blank"
|
|
|
onclick="playSound()"
|
|
|
class="
|
|
|
cursor-pointer
|
|
|
inline-block
|
|
|
relative
|
|
|
bg-blue-500
|
|
|
text-white
|
|
|
font-bold
|
|
|
py-4 px-8
|
|
|
rounded-xl
|
|
|
text-2xl
|
|
|
transition-all
|
|
|
duration-150
|
|
|
|
|
|
/* 3D Effect (Hard Shadow) */
|
|
|
shadow-[0_8px_0_rgb(29,78,216)]
|
|
|
|
|
|
/* Pressed State (Move down & remove shadow) */
|
|
|
active:shadow-none
|
|
|
active:translate-y-[8px]
|
|
|
">
|
|
|
Tap Me!
|
|
|
</a>
|
|
|
</div>
|
|
|
|
|
|
<script>
|
|
|
function playSound() {
|
|
|
const audio = document.getElementById("clickSound");
|
|
|
if (audio) {
|
|
|
audio.currentTime = 0;
|
|
|
audio.play().catch(e => console.log("Audio play failed:", e));
|
|
|
}
|
|
|
}
|
|
|
</script>
|
|
|
|
|
|
|
|
|
|
|
|
<h2>πΉ Core Concepts</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: The Lost Hiker</strong></p>
|
|
|
<p>Imagine a hiker is lost in a thick fog on a mountain and wants to get to the lowest valley. They can't see far, but they can feel the slope of the ground under their feet. Their strategy is simple: feel the steepest downward slope and take a step in that direction. By repeating this process, they slowly but surely make their way downhill, hoping to find the bottom. In this story, the hiker is Gradient Descent, their position is the model's parameters, and the mountain's altitude is the error (or loss). The goal is to find the lowest point.</p>
|
|
|
</div>
|
|
|
<h3>Definition:</h3>
|
|
|
<p>
|
|
|
<strong>Gradient Descent (GD)</strong> is an <strong>optimization algorithm</strong> used to minimize a <strong>loss/cost function</strong> by iteratively updating model <strong>parameters</strong> (weights, biases).
|
|
|
</p>
|
|
|
|
|
|
<h3>Why optimization in ML?</h3>
|
|
|
<ul>
|
|
|
<li>In ML, models (like <strong>Linear Regression, Neural Networks</strong>) try to learn <strong>parameters</strong> that best fit the training data.</li>
|
|
|
<li>We measure how well the model fits using a <strong>loss function</strong> (e.g., <strong>Mean Squared Error</strong>).</li>
|
|
|
<li><strong>Optimization</strong> finds the minimum loss, meaning the best model parameters.</li>
|
|
|
</ul>
|
|
|
|
|
|
<h3>Basic Idea:</h3>
|
|
|
<p>Think of standing on a hill (the cost function surface). To reach the lowest valley (minimum loss):</p>
|
|
|
<ul>
|
|
|
<li>Look at the slope (<strong>gradient</strong>).</li>
|
|
|
<li>Take a small step in the <strong>opposite direction</strong> of the slope.</li>
|
|
|
<li>Repeat until you reach the bottom.</li>
|
|
|
</ul>
|
|
|
<p>π <strong>Example: Linear Regression</strong></p>
|
|
|
<p>If the predicted line is too high, the <strong>gradient</strong> tells us to decrease slope/intercept; if too low, increase them.</p>
|
|
|
|
|
|
|
|
|
<h2>πΉ Mathematical Foundation</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: The Hiker's Rulebook</strong></p>
|
|
|
<p>The hiker needs a precise set of instructions for each step. This formula is their rulebook. It says: "Your <strong>new position</strong> is your <strong>old position</strong>, minus a small step (the <strong>learning rate</strong>) in the direction of the steepest slope (the <strong>gradient</strong>)." This rule ensures every step they take is a calculated move towards the valley floor, preventing them from walking in circles.</p>
|
|
|
</div>
|
|
|
<h3>General Update Rule:</h3>
|
|
|
<p>
|
|
|
$$ \theta_j := \theta_j - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_j} $$
|
|
|
</p>
|
|
|
<ul>
|
|
|
<li>\( \theta_j \): <strong>parameter</strong> (like weight, bias).</li>
|
|
|
<li>\( \alpha \): <strong>learning rate</strong> (step size).</li>
|
|
|
<li>\( J(\theta) \): <strong>cost function</strong>.</li>
|
|
|
<li>\( \frac{\partial J}{\partial \theta_j} \): <strong>slope</strong> of cost wrt parameter.</li>
|
|
|
</ul>
|
|
|
|
|
|
<h3>Example: Linear Regression with Mean Squared Error (MSE):</h3>
|
|
|
<p>
|
|
|
$$ J(w,b) = \frac{1}{m} \sum_{i=1}^m (y_i - (wx_i + b))^2 $$
|
|
|
</p>
|
|
|
<p><strong>Gradient wrt w:</strong>
|
|
|
$$ \frac{\partial J}{\partial w} = -\frac{2}{m} \sum_{i=1}^m x_i(y_i - (wx_i+b)) $$
|
|
|
</p>
|
|
|
<p><strong>Gradient wrt b:</strong>
|
|
|
$$ \frac{\partial J}{\partial b} = -\frac{2}{m} \sum_{i=1}^m (y_i - (wx_i+b)) $$
|
|
|
</p>
|
|
|
|
|
|
<h3>Visualization:</h3>
|
|
|
<p>Imagine a U-shaped curve (parabola). Starting at the top, you repeatedly step downward along the slope until you reach the bottom.</p>
|
|
|
|
|
|
|
|
|
<h2>πΉ Types of Gradient Descent</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Three Different Hikers</strong></p>
|
|
|
<p>Our lost hikers can have different personalities:</p>
|
|
|
<ul>
|
|
|
<li><strong>The Deliberate Hiker (Batch GD):</strong> Before taking a single step, this hiker scans the *entire* visible landscape around them, averages out the slope, and then takes one perfect, confident step. It's a very slow but very stable process.</li>
|
|
|
<li><strong>The Impulsive Hiker (Stochastic GD):</strong> This hiker is in a hurry. They just feel the slope under one foot and immediately take a step. Their path is erratic and zig-zaggy, but they move very fast and their chaotic nature can help them jump out of small ditches (local minima).</li>
|
|
|
<li><strong>The Pragmatic Hiker (Mini-Batch GD):</strong> This hiker finds a middle ground. They scan a small patch of ground around them (not everything, but more than one spot), decide on a direction, and take a step. This is the most popular strategyβa good balance of speed and stability.</li>
|
|
|
</ul>
|
|
|
</div>
|
|
|
<h3>Batch Gradient Descent</h3>
|
|
|
<ul>
|
|
|
<li>Uses the <strong>whole dataset</strong> to compute gradient.</li>
|
|
|
<li>Accurate but slow for large datasets.</li>
|
|
|
<li>π <strong>Example:</strong> 1 million rows β 1 update per pass.</li>
|
|
|
</ul>
|
|
|
|
|
|
<h3>Stochastic Gradient Descent (SGD)</h3>
|
|
|
<ul>
|
|
|
<li>Uses <strong>1 data point</strong> at a time.</li>
|
|
|
<li>Faster, introduces noise β helps escape local minima.</li>
|
|
|
<li>π <strong>Example:</strong> updates model after each single sample.</li>
|
|
|
</ul>
|
|
|
|
|
|
<h3>Mini-Batch Gradient Descent</h3>
|
|
|
<ul>
|
|
|
<li>Uses small random batches (e.g., <strong>32, 64 samples</strong>).</li>
|
|
|
<li>Balance between speed & stability.</li>
|
|
|
<li>Standard in <strong>deep learning</strong>.</li>
|
|
|
<li>π <strong>Example:</strong> For 1000 samples, batch size = 100 β 10 updates per pass.</li>
|
|
|
</ul>
|
|
|
|
|
|
<h2>πΉ Variants / Improvements</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Upgrading the Hiker's Gear</strong></p>
|
|
|
<p>A basic hiker is good, but a hiker with advanced gear is better. These variants are like giving our hiker special equipment to navigate the mountain more effectively.</p>
|
|
|
<ul>
|
|
|
<li><strong>Momentum Boots:</strong> These boots are heavy. Once they start moving in a direction, they build momentum, helping the hiker roll over small bumps (local minima) and speed up on long, straight downhill paths.</li>
|
|
|
<li><strong>Adam's All-Terrain Boots:</strong> This is the ultimate hiking gear. It combines the momentum boots with adaptive traction. The boots automatically adjust their grip (the learning rate), taking smaller, careful steps on slippery, steep slopes and longer strides on flat, easy terrain. This is why it's the most popular and reliable choice.</li>
|
|
|
</ul>
|
|
|
</div>
|
|
|
<ul>
|
|
|
<li><strong>Momentum</strong> β remembers previous update direction to speed up learning.</li>
|
|
|
<li><strong>Nesterov Accelerated Gradient (NAG)</strong> β looks ahead before updating, more precise.</li>
|
|
|
<li><strong>Adagrad</strong> β adapts learning rate for each parameter, useful for sparse data.</li>
|
|
|
<li><strong>RMSProp</strong> β smooths updates using moving average of squared gradients.</li>
|
|
|
<li><strong>Adam</strong> β combines Momentum + RMSProp (most popular optimizer in deep learning).</li>
|
|
|
</ul>
|
|
|
<p>π <strong>Example:</strong> Training a CNN, <strong>Adam</strong> often converges much faster than plain GD.</p>
|
|
|
|
|
|
|
|
|
<h2>πΉ Key Parameters</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Calibrating the Hiker's Tools</strong></p>
|
|
|
<p>Before starting, the hiker must calibrate their approach. These are the settings they control:</p>
|
|
|
<ul>
|
|
|
<li><strong>Step Size (Learning Rate):</strong> How big of a step should they take? If it's too large, they might leap right over the valley. If it's too small, it could take them forever to get down the mountain.</li>
|
|
|
<li><strong>Journey Duration (Number of Iterations):</strong> How many steps should they plan to take? They need to walk long enough to reach the valley, but not so long that they waste energy wandering around the bottom.</li>
|
|
|
<li><strong>Scan Area (Batch Size):</strong> For the pragmatic hiker, how large of a patch of ground should they look at for each step? A small patch gives a quick decision, while a larger patch gives more stability.</li>
|
|
|
</ul>
|
|
|
</div>
|
|
|
<ul>
|
|
|
<li><strong>Learning Rate (\(\alpha\))</strong>
|
|
|
<ul>
|
|
|
<li><strong>Too high</strong> β overshoot, divergence.</li>
|
|
|
<li><strong>Too low</strong> β very slow convergence.</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
<li><strong>Number of Iterations (epochs)</strong> β how many times to repeat updates.</li>
|
|
|
<li><strong>Batch size</strong> β affects speed & stability.</li>
|
|
|
<li><strong>Initialization of parameters</strong>
|
|
|
<ul>
|
|
|
<li><strong>Random initialization</strong> prevents symmetry.</li>
|
|
|
<li><strong>Xavier/He initialization</strong> used in deep networks.</li>
|
|
|
</ul>
|
|
|
</li>
|
|
|
</ul>
|
|
|
<p>π <strong>Example:</strong> If learning rate = 1, model may oscillate; if 0.0001, may take forever.</p>
|
|
|
|
|
|
|
|
|
<h2>πΉ Strengths & Weaknesses</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Reviewing the Hiking Strategy</strong></p>
|
|
|
<p>The "take a step downhill" strategy is simple and effective, but not foolproof.</p>
|
|
|
<ul>
|
|
|
<li><strong>Strengths:</strong> Itβs a universal strategy that works on almost any mountain (problem), and it's the foundation for more advanced hiking techniques.</li>
|
|
|
<li><strong>Weaknesses:</strong> The hiker might find a small ditch and think they've reached the main valley (a local minimum). The whole journey also depends heavily on picking the right step size, and if the mountain has many different types of terrain (unscaled features), the hiker's path can become very inefficient.</li>
|
|
|
</ul>
|
|
|
</div>
|
|
|
<h3>Advantages:</h3>
|
|
|
<ul>
|
|
|
<li>β
Simple and effective.</li>
|
|
|
<li>β
Works well for large problems.</li>
|
|
|
<li>β
Forms the base of advanced optimizers.</li>
|
|
|
</ul>
|
|
|
<h3>Disadvantages:</h3>
|
|
|
<ul>
|
|
|
<li>β Can get stuck in <strong>local minima/saddle points</strong>.</li>
|
|
|
<li>β Very sensitive to <strong>learning rate</strong>.</li>
|
|
|
<li>β Requires <strong>normalized data</strong> for efficiency.</li>
|
|
|
</ul>
|
|
|
|
|
|
|
|
|
<h2>πΉ When to Use</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Choosing the Right Mountain</strong></p>
|
|
|
<p>This downhill hiking strategy is perfect for any "mountain" that has a smooth, continuous surface where you can always calculate a slope. This applies to countless problems in the real world, from predicting house prices (a gentle hill) to training massive AI for image recognition (a complex, high-dimensional mountain range).</p>
|
|
|
</div>
|
|
|
<ul>
|
|
|
<li><strong>Regression & Classification</strong> models.</li>
|
|
|
<li><strong>Neural networks & deep learning</strong>.</li>
|
|
|
<li>Any optimization with differentiable cost functions.</li>
|
|
|
</ul>
|
|
|
<p>π <strong>Example:</strong> Logistic Regression, CNNs, RNNs all trained with GD.</p>
|
|
|
|
|
|
|
|
|
<h2>πΉ Python Implementation</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Writing Down the Hiking Plan</strong></p>
|
|
|
<p>This code is the hiker's plan written down before they start their journey. It defines the map of the mountain (`X` and `y` data), sets the calibration (learning rate, epochs), and contains the step-by-step instructions for the hike (the loop). Finally, it plots a chart of their altitude (loss) over time to confirm they successfully made it downhill.</p>
|
|
|
</div>
|
|
|
<h3>Example: Gradient Descent for Linear Regression</h3>
|
|
|
<pre><code>
|
|
|
import numpy as np
|
|
|
import matplotlib.pyplot as plt
|
|
|
|
|
|
# Data
|
|
|
X = np.array([1, 2, 3, 4, 5])
|
|
|
y = np.array([2, 4, 6, 8, 10]) # y = 2x
|
|
|
|
|
|
# Parameters
|
|
|
w, b = 0, 0
|
|
|
alpha = 0.01
|
|
|
epochs = 1000
|
|
|
m = len(X)
|
|
|
losses = []
|
|
|
|
|
|
# Gradient Descent
|
|
|
for i in range(epochs):
|
|
|
y_pred = w*X + b
|
|
|
dw = -(2/m) * np.sum(X * (y - y_pred))
|
|
|
db = -(2/m) * np.sum(y - y_pred)
|
|
|
|
|
|
w -= alpha * dw
|
|
|
b -= alpha * db
|
|
|
|
|
|
loss = np.mean((y - y_pred)**2)
|
|
|
losses.append(loss)
|
|
|
|
|
|
print("Final w:", w, "Final b:", b)
|
|
|
|
|
|
# Plot convergence
|
|
|
plt.plot(losses)
|
|
|
plt.xlabel("Iterations")
|
|
|
plt.ylabel("Loss")
|
|
|
plt.title("Convergence Curve")
|
|
|
plt.show()
|
|
|
</code></pre>
|
|
|
<p>π Try changing <strong>learning rate</strong> (\(\alpha\)) and see convergence behavior.</p>
|
|
|
|
|
|
|
|
|
<h2>πΉ Real-World Applications</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Famous Mountains Conquered</strong></p>
|
|
|
<p>This hiking method isn't just a theory; it's used to conquer real-world challenges every day. It's the strategy used to train the models that power Netflix's recommendations, Google's speech recognition, and the AI that can diagnose diseases from medical scans. Each of these is a complex "mountain" that Gradient Descent learns to navigate.</p>
|
|
|
</div>
|
|
|
<ul>
|
|
|
<li>Training <strong>Linear & Logistic Regression</strong>.</li>
|
|
|
<li>Optimizing <strong>Neural Networks</strong> (CNN, RNN, Transformers).</li>
|
|
|
<li>Used in <strong>Gradient Boosting, XGBoost</strong>.</li>
|
|
|
<li><strong>Applications:</strong> Speech recognition, image classification, recommendation systems.</li>
|
|
|
</ul>
|
|
|
|
|
|
<h2>πΉ Best Practices</h2>
|
|
|
<div class="story">
|
|
|
<p><strong>The Story: Pro-Tips for Every Hiker</strong></p>
|
|
|
<p>Experienced hikers have a checklist for a successful journey:</p>
|
|
|
<ul>
|
|
|
<li><strong>Prepare the Terrain (Normalize Data):</strong> Before you start, smooth out the mountain path. Rescaling the terrain makes the slope more consistent and the hike much faster.</li>
|
|
|
<li><strong>Use the Best Gear (Adaptive Optimizers):</strong> Don't just walk. Use the Adam all-terrain boots to automatically adapt your steps to the terrain.</li>
|
|
|
<li><strong>Check Your GPS (Monitor Loss):</strong> Regularly check your altitude chart. If you're going up instead of down, you know something is wrong (like your step size is too big).</li>
|
|
|
<li><strong>Know When to Camp (Early Stopping):</strong> Once you've reached a good-enough spot in the valley, set up camp. Continuing to wander around risks getting lost again (overfitting).</li>
|
|
|
</ul>
|
|
|
</div>
|
|
|
<ul>
|
|
|
<li><strong>Normalize/Standardize Data:</strong> Scale your features to have a similar range. This helps GD converge much faster.</li>
|
|
|
<li><strong>Use Adaptive Optimizers:</strong> Start with <strong>Adam</strong> instead of standard SGD. It often works well with little tuning.</li>
|
|
|
<li><strong>Monitor the Loss Curve:</strong> Always plot the loss vs. iterations. If it's flat, learning has stalled. If it's increasing, your learning rate is likely too high.</li>
|
|
|
<li><strong>Use Early Stopping:</strong> To prevent overfitting, monitor performance on a validation set and stop training when the validation loss starts to increase.</li>
|
|
|
</ul>
|
|
|
<p>π <strong>Example:</strong> A standard, robust approach for a deep learning project is to <strong>standardize the input data</strong>, use the <strong>Adam optimizer</strong>, and apply <strong>early stopping</strong>.</p>
|
|
|
</div>
|
|
|
|
|
|
</body>
|
|
|
</html>
|
|
|
|
|
|
{% endblock %} |