1) General Metrics
Loss: Measures the difference between predicted and actual values. Lower is better. Often used for optimization during training.
Accuracy: Proportion of correct predictions among all predictions. Simple but can be misleading for imbalanced datasets.
Micro Accuracy: Calculates accuracy by summing up all individual true positives and true negatives across all classes, making it suitable for multiclass or multilabel problems.
Token Accuracy: Measures how often the predicted tokens (e.g., in sequences) match the true tokens. Useful in sequence prediction tasks like NLP.
2) Precision, Recall & Specificity
Precision: Out of all positive predictions, how many were correct. Precision = TP / (TP + FP). Helps when false positives are costly.
Recall (Sensitivity): Out of all actual positives, how many were predicted correctly. Recall = TP / (TP + FN). Important when missing positives is risky.
Specificity: True negative rate. Measures how well the model identifies negatives. Specificity = TN / (TN + FP). Useful in medical testing to avoid false alarms.
3) Macro, Micro, and Weighted Averages
Macro Precision / Recall / F1: Averages the metric across all classes, treating each class equally, regardless of class frequency. Best when class sizes are balanced.
Micro Precision / Recall / F1: Aggregates TP, FP, FN across all classes before computing the metric. Gives a global view and is ideal for class-imbalanced problems.
Weighted Precision / Recall / F1: Averages each metric across classes, weighted by the number of true instances per class. Balances importance of classes based on frequency.
4) Average Precision (PR-AUC Variants)
Average Precision Macro: Precision-Recall AUC averaged across all classes equally. Useful for balanced multi-class problems.
Average Precision Micro: Global Precision-Recall AUC using all instances. Best for imbalanced data or multi-label classification.
Average Precision Samples: Precision-Recall AUC averaged across individual samples (not classes). Ideal for multi-label problems where each sample can belong to multiple classes.
5) ROC-AUC Variants
ROC-AUC: Measures model's ability to distinguish between classes. AUC = 1 is perfect; 0.5 is random guessing. Use for binary classification.
Macro ROC-AUC: Averages the AUC across all classes equally. Suitable when classes are balanced and of equal importance.
Micro ROC-AUC: Computes AUC from aggregated predictions across all classes. Useful in multiclass or multilabel settings with imbalance.
6) Ranking Metrics
Hits at K: Measures whether the true label is among the top-K predictions. Common in recommendation systems and retrieval tasks.
7) Confusion Matrix Stats (Per Class)
True Positives / Negatives (TP / TN): Correct predictions for positives and negatives respectively.
False Positives / Negatives (FP / FN): Incorrect predictions — false alarms and missed detections.
8) Other Useful Metrics
Cohen's Kappa: Measures agreement between predicted and actual values adjusted for chance. Useful for multiclass classification with imbalanced labels.
Matthews Correlation Coefficient (MCC): Balanced measure of prediction quality that takes into account TP, TN, FP, and FN. Particularly effective for imbalanced datasets.
9) Metric Recommendations
- Use Accuracy + F1 for balanced data.
- Use Precision, Recall, ROC-AUC for imbalanced datasets.
- Use Average Precision Micro for multilabel or class-imbalanced problems.
- Use Macro scores when all classes should be treated equally.
- Use Weighted scores when class imbalance should be accounted for without ignoring small classes.
- Use Confusion Matrix stats to analyze class-wise performance.
- Use Hits at K for recommendation or ranking-based tasks.