+
1) General Metrics
+
Loss: Measures the difference between predicted and actual values. Lower is better. Often used for optimization during training.
+
Accuracy: Proportion of correct predictions among all predictions. Simple but can be misleading for imbalanced datasets.
+
Micro Accuracy: Calculates accuracy by summing up all individual true positives and true negatives across all classes, making it suitable for multiclass or multilabel problems.
+
Token Accuracy: Measures how often the predicted tokens (e.g., in sequences) match the true tokens. Useful in sequence prediction tasks like NLP.
+
2) Precision, Recall & Specificity
+
Precision: Out of all positive predictions, how many were correct. Precision = TP / (TP + FP). Helps when false positives are costly.
+
Recall (Sensitivity): Out of all actual positives, how many were predicted correctly. Recall = TP / (TP + FN). Important when missing positives is risky.
+
Specificity: True negative rate. Measures how well the model identifies negatives. Specificity = TN / (TN + FP). Useful in medical testing to avoid false alarms.
+
3) Macro, Micro, and Weighted Averages
+
Macro Precision / Recall / F1: Averages the metric across all classes, treating each class equally, regardless of class frequency. Best when class sizes are balanced.
+
Micro Precision / Recall / F1: Aggregates TP, FP, FN across all classes before computing the metric. Gives a global view and is ideal for class-imbalanced problems.
+
Weighted Precision / Recall / F1: Averages each metric across classes, weighted by the number of true instances per class. Balances importance of classes based on frequency.
+
4) Average Precision (PR-AUC Variants)
+
Average Precision Macro: Precision-Recall AUC averaged across all classes equally. Useful for balanced multi-class problems.
+
Average Precision Micro: Global Precision-Recall AUC using all instances. Best for imbalanced data or multi-label classification.
+
Average Precision Samples: Precision-Recall AUC averaged across individual samples (not classes). Ideal for multi-label problems where each sample can belong to multiple classes.
+
5) ROC-AUC Variants
+
ROC-AUC: Measures model's ability to distinguish between classes. AUC = 1 is perfect; 0.5 is random guessing. Use for binary classification.
+
Macro ROC-AUC: Averages the AUC across all classes equally. Suitable when classes are balanced and of equal importance.
+
Micro ROC-AUC: Computes AUC from aggregated predictions across all classes. Useful in multiclass or multilabel settings with imbalance.
+
6) Ranking Metrics
+
Hits at K: Measures whether the true label is among the top-K predictions. Common in recommendation systems and retrieval tasks.
+
7) Confusion Matrix Stats (Per Class)
+
True Positives / Negatives (TP / TN): Correct predictions for positives and negatives respectively.
+
False Positives / Negatives (FP / FN): Incorrect predictions — false alarms and missed detections.
+
8) Other Useful Metrics
+
Cohen's Kappa: Measures agreement between predicted and actual values adjusted for chance. Useful for multiclass classification with imbalanced labels.
+
Matthews Correlation Coefficient (MCC): Balanced measure of prediction quality that takes into account TP, TN, FP, and FN. Particularly effective for imbalanced datasets.
+
9) Metric Recommendations
+
+ - Use Accuracy + F1 for balanced data.
+ - Use Precision, Recall, ROC-AUC for imbalanced datasets.
+ - Use Average Precision Micro for multilabel or class-imbalanced problems.
+ - Use Macro scores when all classes should be treated equally.
+ - Use Weighted scores when class imbalance should be accounted for without ignoring small classes.
+ - Use Confusion Matrix stats to analyze class-wise performance.
+ - Use Hits at K for recommendation or ranking-based tasks.
+
+