Logistic Regression

[!NOTE] Despite the word “regression” in its name, Logistic Regression is a fundamental algorithm used for classification problems. It predicts the probability that an observation belongs to a particular class.

1. Intuition: Mapping to Probabilities

Linear Regression output is continuous and unbounded—it can predict any number from negative infinity to positive infinity. But if you want to classify an email as “Spam” (1) or “Not Spam” (0), you need a model that outputs a value strictly between 0 and 1, representing a probability.

We achieve this by taking the linear equation (y = mx + b) and passing it through a special mathematical function called the Sigmoid Function (or Logistic Function).

1.1 The Sigmoid Function

The sigmoid function “squashes” any real number into the range (0, 1).

σ(z) = 1 / (1 + e^-z)

Where z is the output of our linear equation: z = θ^Tx

If z is very large and positive, e^-z approaches 0, and σ(z) approaches 1.
If z is very large and negative, e^-z approaches infinity, and σ(z) approaches 0.
If z is 0, σ(z) is 0.5.

2. Interactive Analysis: Decision Boundary

Let’s visualize how the parameters (slope and intercept) change the shape of the sigmoid curve and the resulting decision boundary.

Model Parameters

Adjust the weight and bias to see how the probability curve changes to classify the points (0 or 1).

Weight (w): 1.0

Bias (b): 0.0

Decision Boundary (x): 0.00

3. The Rigor: Log Loss (Cross-Entropy)

We cannot use Mean Squared Error (MSE) for Logistic Regression. Because we are passing the linear equation through a non-linear sigmoid function, plotting the MSE cost against the parameters would result in a “wavy,” non-convex surface with many local minima. Gradient Descent would likely get stuck and fail to find the global minimum.

Instead, we use a cost function called Log Loss (or Binary Cross-Entropy). It is derived from Maximum Likelihood Estimation.

For a single observation i:

If the true label y = 1, the cost is log(h_θ(x))
If the true label y = 0, the cost is log(1 - h_θ(x))

3.1 Why Log Loss Works

If y = 1 and our model predicts 0.99, the cost is -log(0.99), which is very close to 0. (Low penalty for being right).
If y = 1 and our model predicts 0.01, the cost is -log(0.01), which is a very large number. (Massive penalty for being confidently wrong).

The combined, convex cost function for the entire dataset of m examples is:

J(θ) = -(1/m) × Σ_i=1^m [y⁽ⁱ⁾ log(h_θ(x⁽ⁱ⁾)) + (1 - y⁽ⁱ⁾) log(1 - h_θ(x⁽ⁱ⁾))]

Because this function is convex (bowl-shaped), Gradient Descent is guaranteed to find the global minimum.

4. Implementation: Scikit-Learn

Python (Scikit-Learn)

```python
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# 1. Prepare Data (e.g., hours studied vs passed exam)
X = np.array([[1], [2], [3], [5], [6], [7]])
y = np.array([0, 0, 0, 1, 1, 1]) # 0=Fail, 1=Pass

# 2. Initialize Model
# By default, scikit-learn applies L2 regularization to prevent overfitting
model = LogisticRegression(penalty='l2')

# 3. Fit the model using an iterative solver (like L-BFGS)
model.fit(X, y)

# 4. View Parameters
print(f"Intercept: {model.intercept_[0]:.2f}")
print(f"Coefficient (Weight): {model.coef_[0][0]:.2f}")

# 5. Make Predictions
# .predict() automatically thresholds probabilities at 0.5
predictions = model.predict(X)
print(f"Accuracy: {accuracy_score(y, predictions):.2f}")

# 6. View Raw Probabilities
# Returns array of shape (n_samples, n_classes)
probs = model.predict_proba(X)
print(f"Probabilities of passing: {probs[:, 1]}")
```

5. Complexity Analysis

Phase	Time Complexity	Space Complexity	Notes
Training	O(K × N × M)	O(N × M)	K=iterations, N=samples, M=features. Uses optimization algorithms like gradient descent or L-BFGS.
Prediction	O(M)	O(1)	Extremely fast; dot product followed by sigmoid calculation.