# Problem Statement

1. Once we’ve trained multiple detection/classification models, how to choose the best model ?
2. Once we’ve chosen the best model, how to choose the optimum operating point (the best threshold) ?

# Solution

There are two metrics which can help us here. If you’ve come across terms like AP, mAP and F-1 score in research papers, these are precisely the metrics that help us with the above mentioned problems. Let’s begin by defining precision and recall, which are pre-requisities to understanding other metrics.

### Precision and Recall

Let’s assume that we’ve trained a car detector.

$Precision = \frac{TP}{TP+FP}$ $Recall = \frac{TP}{TP+FN}$

### Average Precision (AP)

$AP = \int_0^1 P(r) \,dr$

We can use the standard sklearn package to compute the AUC (area under the curve), or we can approximate the area, as shown in the figure above.

Mean Average Precision (mAP) is the mean of AP for all classes.

$mAP = \sum_0^N \frac{AP(i)}{N}\$

### F1 score

$F1 = \frac{2*P*R}{P+R}\$

This is also known as the Equal Error Rate (EER) point.

## Quick Takeaways

1. For choosing the best model from multiple variants (differing in architecture, augmentation or training methodology), use mAP
2. For choosing the best operating threshold, use F1 score.