Logistic Regression

[!NOTE] Despite the word “regression” in its name, Logistic Regression is a fundamental algorithm used for classification problems. It predicts the probability that an observation belongs to a particular class.

1. Intuition: Mapping to Probabilities

Linear Regression output is continuous and unbounded—it can predict any number from negative infinity to positive infinity. But if you want to classify an email as “Spam” (1) or “Not Spam” (0), you need a model that outputs a value strictly between 0 and 1, representing a probability.

We achieve this by taking the linear equation (y = mx + b) and passing it through a special mathematical function called the Sigmoid Function (or Logistic Function).

1.1 The Sigmoid Function

The sigmoid function “squashes” any real number into the range (0, 1).

σ(z) = 1 / (1 + e-z)

Where z is the output of our linear equation: z = θTx

  • If z is very large and positive, e-z approaches 0, and σ(z) approaches 1.
  • If z is very large and negative, e-z approaches infinity, and σ(z) approaches 0.
  • If z is 0, σ(z) is 0.5.

2. Interactive Analysis: Decision Boundary

Let’s visualize how the parameters (slope and intercept) change the shape of the sigmoid curve and the resulting decision boundary.

Model Parameters

Adjust the weight and bias to see how the probability curve changes to classify the points (0 or 1).

Decision Boundary (x): 0.00

3. The Rigor: Log Loss (Cross-Entropy)

We cannot use Mean Squared Error (MSE) for Logistic Regression. Because we are passing the linear equation through a non-linear sigmoid function, plotting the MSE cost against the parameters would result in a “wavy,” non-convex surface with many local minima. Gradient Descent would likely get stuck and fail to find the global minimum.

Instead, we use a cost function called Log Loss (or Binary Cross-Entropy). It is derived from Maximum Likelihood Estimation.

For a single observation i:

  • If the true label y = 1, the cost is log(hθ(x))
  • If the true label y = 0, the cost is log(1 - hθ(x))

3.1 Why Log Loss Works

  • If y = 1 and our model predicts 0.99, the cost is -log(0.99), which is very close to 0. (Low penalty for being right).
  • If y = 1 and our model predicts 0.01, the cost is -log(0.01), which is a very large number. (Massive penalty for being confidently wrong).

The combined, convex cost function for the entire dataset of m examples is:

J(θ) = -(1/m) × Σi=1m [y(i) log(hθ(x(i))) + (1 - y(i)) log(1 - hθ(x(i)))]

Because this function is convex (bowl-shaped), Gradient Descent is guaranteed to find the global minimum.

4. Implementation: Scikit-Learn

Python (Scikit-Learn)
```python import numpy as np from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, confusion_matrix # 1. Prepare Data (e.g., hours studied vs passed exam) X = np.array([[1], [2], [3], [5], [6], [7]]) y = np.array([0, 0, 0, 1, 1, 1]) # 0=Fail, 1=Pass # 2. Initialize Model # By default, scikit-learn applies L2 regularization to prevent overfitting model = LogisticRegression(penalty='l2') # 3. Fit the model using an iterative solver (like L-BFGS) model.fit(X, y) # 4. View Parameters print(f"Intercept: {model.intercept_[0]:.2f}") print(f"Coefficient (Weight): {model.coef_[0][0]:.2f}") # 5. Make Predictions # .predict() automatically thresholds probabilities at 0.5 predictions = model.predict(X) print(f"Accuracy: {accuracy_score(y, predictions):.2f}") # 6. View Raw Probabilities # Returns array of shape (n_samples, n_classes) probs = model.predict_proba(X) print(f"Probabilities of passing: {probs[:, 1]}") ```

5. Complexity Analysis

Phase Time Complexity Space Complexity Notes
Training O(K × N × M) O(N × M) K=iterations, N=samples, M=features. Uses optimization algorithms like gradient descent or L-BFGS.
Prediction O(M) O(1) Extremely fast; dot product followed by sigmoid calculation.