Machine Learning Glossary

## A ### AUC (Area Under the Curve) Specifically, the area under the ROC curve. A scalar value between 0 and 1 measuring a classifier's overall performance. 0.5 represents a random guess, while 1.0 represents a perfect classifier. ## B ### Bias The error introduced by approximating a real-world problem, which may be complex, by a simplified model. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). ## C ### Confusion Matrix A table used to describe the performance of a classification model. It shows True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). ### Cross-Validation A resampling procedure used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into (e.g., K-Fold Cross-Validation). ## F ### F1 Score The harmonic mean of Precision and Recall. It provides a single metric that balances both concerns, useful when you need to take both false positives and false negatives into account. Formula: 2 * (Precision * Recall) / (Precision + Recall). ## M ### MSE (Mean Squared Error) A regression metric that measures the average squared difference between the estimated values and the actual value. It penalizes large errors more than small ones. ## O ### Overfitting A modeling error that occurs when a function is too closely fit to a limited set of data points. An overfitted model learns the noise in the training data rather than the underlying pattern, leading to poor performance on new data (high variance). ## P ### Precision The ratio of correctly predicted positive observations to the total predicted positive observations. Formula: TP / (TP + FP). It answers: "Of all the instances predicted as positive, how many were actually positive?" ## R ### Recall (Sensitivity) The ratio of correctly predicted positive observations to the all observations in actual class. Formula: TP / (TP + FN). It answers: "Of all the actual positive instances, how many did we correctly identify?" ### RMSE (Root Mean Squared Error) The square root of the Mean Squared Error. It measures the standard deviation of the residuals (prediction errors). It is in the same units as the target variable. ### ROC (Receiver Operating Characteristic) Curve A graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (Recall) against the False Positive Rate. ### R-squared (Coefficient of Determination) A statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. ## U ### Underfitting A modeling error that occurs when a model is too simple to capture the underlying structure of the data. An underfitted model performs poorly on both training and testing data (high bias). ## V ### Variance The amount by which the estimate of the target function changes if different training data was used. High variance indicates that the model is sensitive to small fluctuations in the training set (overfitting).