Logistic Regression
[!NOTE] Despite the word “regression” in its name, Logistic Regression is a fundamental algorithm used for classification problems. It predicts the probability that an observation belongs to a particular class.
1. Intuition: Mapping to Probabilities
Linear Regression output is continuous and unbounded—it can predict any number from negative infinity to positive infinity. But if you want to classify an email as “Spam” (1) or “Not Spam” (0), you need a model that outputs a value strictly between 0 and 1, representing a probability.
We achieve this by taking the linear equation (y = mx + b) and passing it through a special mathematical function called the Sigmoid Function (or Logistic Function).
1.1 The Sigmoid Function
The sigmoid function “squashes” any real number into the range (0, 1).
σ(z) = 1 / (1 + e-z)
Where z is the output of our linear equation: z = θTx
- If
zis very large and positive, e-z approaches 0, and σ(z) approaches 1. - If
zis very large and negative, e-z approaches infinity, and σ(z) approaches 0. - If
zis 0, σ(z) is 0.5.
2. Interactive Analysis: Decision Boundary
Let’s visualize how the parameters (slope and intercept) change the shape of the sigmoid curve and the resulting decision boundary.
Model Parameters
Adjust the weight and bias to see how the probability curve changes to classify the points (0 or 1).
3. The Rigor: Log Loss (Cross-Entropy)
We cannot use Mean Squared Error (MSE) for Logistic Regression. Because we are passing the linear equation through a non-linear sigmoid function, plotting the MSE cost against the parameters would result in a “wavy,” non-convex surface with many local minima. Gradient Descent would likely get stuck and fail to find the global minimum.
Instead, we use a cost function called Log Loss (or Binary Cross-Entropy). It is derived from Maximum Likelihood Estimation.
For a single observation i:
- If the true label
y = 1, the cost is log(hθ(x)) - If the true label
y = 0, the cost is log(1 - hθ(x))
3.1 Why Log Loss Works
- If
y = 1and our model predicts0.99, the cost is-log(0.99), which is very close to 0. (Low penalty for being right). - If
y = 1and our model predicts0.01, the cost is-log(0.01), which is a very large number. (Massive penalty for being confidently wrong).
The combined, convex cost function for the entire dataset of m examples is:
J(θ) = -(1/m) × Σi=1m [y(i) log(hθ(x(i))) + (1 - y(i)) log(1 - hθ(x(i)))]
Because this function is convex (bowl-shaped), Gradient Descent is guaranteed to find the global minimum.
4. Implementation: Scikit-Learn
5. Complexity Analysis
| Phase | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Training | O(K × N × M) | O(N × M) | K=iterations, N=samples, M=features. Uses optimization algorithms like gradient descent or L-BFGS. |
| Prediction | O(M) | O(1) | Extremely fast; dot product followed by sigmoid calculation. |