Recently I have been reading a lot about evaluation metrics in information retrieval for Recommender Systems and I have discovered (with great surprise) that there is no general consensus over the definition of some of these metrics. I am obviously not the first one to notice this, as demonstrated by a full tutorial at ACM RecSys 2015 discussing this issue.
In the past few weeks I have spent some time going through this maze of definitions. My hope is that you won’t have to do the same after reading this post.
Here I will present a summary of what I have discovered during this time. The material presented is inspired by several sources, which are detailed in the “References” section at the end of this post. I will cite these sources as we proceed, together with identifying edge scenarios, test cases and example Scala code.
We will be looking at six popular metrics: Precision, Recall, F1measure, Average Precision, Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).
Before starting, it is useful to write down a few definitions. Let us first assume that there are $U$ users. Each user $i$ ($i=1,\dots,U$) receives a set $R_i$ of recommendations of dimension $\rho_i$ (predictions vector). Out of all these recommendations, only a fraction of them will be actually useful (relevant) to the user. The set of all relevant items represents the “ground truth” or labels vector. Here it will represented by the vector $G_i$, with dimension $\gamma_i$. In summary,
$$\begin{eqnarray} R_i = \{r_{i1},\dots,r_{i\rho_i}\} \\ G_i = \{g_{i1},\dots,g_{i\gamma_i}\} \nonumber \end{eqnarray}$$Whether a recommended document $r_{ij}\in R_i$ is relevant or not, i.e. whether or not it belongs to the ground truth set $G_i$, is denoted by an indicator function $\textrm{rel}_i$$_j$ (“relevance”)
$$\textrm{rel}_{ij} = \left\{
\begin{eqnarray}
1 \ \ \textrm{if} \ r_{ij} \in G_i\nonumber \\
0 \ \ \textrm{otherwise}
\nonumber
\end{eqnarray}
\right.$$
Also, as usual I will denote with TP$_i$ the number of true positives for user $i$ (i.e., the number of recommended items that are indeed relevant) and with FP$_i$ the number of false positives (i.e., the number of recommended items that are not relevant).
At this point we have all the ingredients needed to properly define our metrics. For each metric, we will compute the results related to the four test cases shown here below (click on the arrow to display the text). These will be our “unit tests”, in that any code that implements these metrics (in the way defined here) should recover the same results.
User Test Cases
User#1.
$$\begin{equation} \textrm{Labels} = \{1,2,3,4,5,6\} \\ \textrm{Predictions} = \{1,6,8\} \nonumber \end{equation}$$
For this user, the dimension of the label vector is larger than the dimension of the prediction vector.
User#2.
$$\begin{equation} \textrm{Labels} = \{2,4,6\} \\ \textrm{Predictions} = \{1,2,3,4,5\} \nonumber \end{equation}$$
For this user, the dimension of the label vector is smaller than the dimension of the prediction vector.
User#3.
$$\begin{equation} \textrm{Labels} = \{2,4,6\} \\ \textrm{Predictions} = \{\} \nonumber \end{equation}$$
For this user, there are no recommendations available.
User#4.
$$\begin{equation} \textrm{Labels} = \{\} \\ \textrm{Predictions} = \{1,2,3,4\} \nonumber \end{equation}$$
For this user, there are no labels available.
User#5.
$$\begin{equation} \textrm{Labels} = \{\} \\ \textrm{Predictions} = \{\} \nonumber \end{equation}$$
For this user, there are no labels and no predictions available.
1. Precision@k
Precision is the probability that a predicted item is relevant,
$$\begin{equation}
\textrm{P}_i@k = \frac{\textrm{TP}_i}{\textrm{TP}_i + \textrm{FP}_i} = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}{k}
\label{eq:precision}
\end{equation}$$
1.1 Edge cases
1. There are less recommended documents than the value $k$ (in other words, $\rho_i$<$\,k$). The definition above takes care of this scenario by using the value $\min\{k,\rho_i\}$ in the summation.2. There are no recommended documents. Expected behavior: $\textrm{P}_i@k$ should return zero.
3. The vector of labels is empty. Expected behavior: $\textrm{P}_i@k$ should return NaN.
1.2 Mean Precision@k
This is simply the useraveraged value of the Precision@k,
$$\textrm{P}@k = \frac{\sum_{i=1}^U\textrm{P}_i@k}{U}$$
Edge case. Some precision values are NaNs. Expected behaviour: it should exclude NaN values.
1.3 Sources of confusion
There is not much to say about the general definition of Precision@k, there is really not much room for misunderstandings here. However, the devil lies in the details and in particular in the edge cases. When calculating the Precision@k for an item that does not have labels (case #3 in the “edge cases” subsection above), should the metric return zero or NaN? Currently the choice of Spark MLlib is to return zero and a warning, as shown in the method precisionAt in the Scala code here (commit f830bb9170f6b853565d9dd30ca7418b93a54fe3 of 29 Nov 2016). What this means is that users for which there are no labels available count as if the recommender did not retrieve any relevant document for them (precision=0).For this reason, personally I would rather use a NaN value in the sense of just discarding cases where the label vectors are empty. In any case, this choice should not be critical as hopefully only a negligible fraction of the label vectors should be empty.
1.4 Application to User Test Cases
Using the definition in eq. (\ref{eq:precision}), here are the precision values for the test cases above,
User Test  Precision@1  Precision@3  Precision@5 

#1  1/1=1  (1+1+0)/3=2/3=0.666  (1+1+0+0+0)/5=2/5=0.400 
#2  0/1=0  (0+1+0)/3=1/3=0.333  (0+1+0+1+0)/5=2/5=0.400 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean Precision  (1+0+0)/3=1/3=0.333  (2/3+1/3+0)/3=1/3=0.333  (2/5+2/5+0)/3=4/15=0.267 
1.5 Scala worksheet
This code returns the same values shown in the subsection “Application to User Test Cases” above.


2. Recall@k
Recall is the probability that a relevant item (i.e., an item belongings to the labels vector) is recommended,
$$\begin{equation}
\textrm{R}_i@k = \frac{\textrm{TP}_i}{\textrm{TP}_i + \textrm{TN}_i} = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}{\gamma_i}
\label{eq:recall}
\end{equation}$$
where as before $\gamma_i$ is the total number of relevant documents (labels).
2.1 Edge cases
(These are the same as for the Precision@k metric)1. There are less recommended documents than the value $k$ (in other words, $\rho_i$<$\,k$). The definition above takes care of this scenario by using the value $\min\{k,\rho_i\}$ in the summation.
2. There are no recommended documents. Expected behavior: $\textrm{R}_i@k$ should return zero.
3. The vector of labels is empty. Expected behavior: $\textrm{R}_i@k$ should return NaN.
2.2 Mean Recall@k
This is simply the useraveraged value of the Recall@k,$$\textrm{R}@k = \frac{\sum_{i=1}^U\textrm{R}_i@k}{U}$$
Edge case. Some precision values are NaNs. Expected behaviour: it should exclude NaN values.
2.3 Sources of confusion
As for the Precision@k metric discussed above, there is a potential confusion arising from the scenario in which the label vector is empty. Actually MLlib does not seem to support recall using the same pattern as with the method precisionAt in RankingMetrics.scala, as there is no equivalent recallAt method. A recall metric is defined in MulticlassMetrics.scala via the method recall, but this does not seem to take care of the case in which the label vector is empty.2.4 Application to User Test Cases
Using the definition in eq. (\ref{eq:recall}), here are the recall values for the test cases above,
User Test  Recall@1  Recall@3  Recall@5 

#1  1/6=0.167  (1+1+0)/6=1/3=0.333  (1+1+0)/6=1/3=0.333 
#2  0/3=0  (0+1+0)/3=1/3=0.333  (0+1+0+1+0)/3=2/3=0.666 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean Recall  (1/6+0+0)/3=1/18=0.056  (1/3+1/3+0)/3=2/9=0.222  (1/3+2/3+0)/3=1/3=0.333 
2.5 Scala worksheet
This code returns the same values shown in the subsection “Application to User Test Cases” above.


3. F1score@k
The F1score aims at capturing in a single metric the precision and recall metrics and is usually defined as the harmonic mean between them,
$$\begin{equation}
\textrm{F1}_i@k = 2\frac{\textrm{P}_i@k \cdot \textrm{R}_i@k}{\textrm{P}_i@k + \textrm{R}_i@k}
\label{eq:F1score}
\end{equation}$$
3.1 Edge cases
In addition to the edge cases inherited from the definitions of Precision and Recall, I would add this one: if $P_i@k=0$ and $R_i@k=0 \ \Rightarrow \ F_i@k=0$.3.2 Mean F1score@k
This is simply the useraveraged value of the F1score@k,
$$\textrm{F1}@k = \frac{\sum_{i=1}^U\textrm{F1}_i@k}{U}$$
3.3 Sources of confusion
The F1score really just inherits the “confusion” coming from the Precision and Recall metrics, so there is not much to add here.3.4 Application to User Test Cases
Using the definition in eq. (\ref{eq:F1score}), here are the F1scores for the test cases above,User Test  F1@1  F1@3  F1@5 

#1  2/6/(1+1/6)=2/7=0.286  4/9/(2/3+1/3)=4/9=0.444  4/15/(2/5+1/3)=4/11=0.364 
#2  0  2/9/(1/3+1/3)=1/3=0.333  8/15/(2/3+2/5)=1/2=0.500 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean F1score  (2/7+0+0)/3=2/21=0.095  (4/9+1/3+0)/3=7/27=0.259  (4/11+1/2+0)/3=19/66=0.288 
4. Average Precision@k
This metric represents the average of the precision values for the relevant items from 1 to k,
$$\begin{equation}
\textrm{AP}_i@k = \frac{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij} \textrm{P}_i@j}{\sum_{j=1}^{\min\{k,\rho_i\}}\textrm{rel}_{ij}}
\label{eq:ap}
\end{equation}$$
Compared to the Precision, the Average Precision biases the result towards the top ranks. It is designed to work for binary outcomes (in this sense, the NDCG defined at the end of this post is more general as it also takes into account nonbinary relevance).
4.1 Edge cases
1. There are less recommended documents than the value $k$ (in other words, $\rho_i$<$\,k$). The definition above takes care of this scenario by using the value $\min\{k,\rho_i\}$ in the summation.
2. There are less label available than the value $k$ (in other words, $\gamma_i$<$\,k$). The definition above takes care of this scenario by using the value $\min{k,\gamma_i}$ in the denominator.
3. There are no recommended documents. Expected behavior: $\textrm{AP}_i@k$ should return zero.
4. The vector of labels is empty. Expected behavior: $\textrm{AP}_i@k$ should return NaN.
4.2 Mean Average Precision@k (MAP@k)
This is simply the average of the Average Precision@k for all users,
$$\textrm{MAP}@k = \frac{\sum_{i=1}^U\textrm{AP}_i@k}{U}$$
4.3 Sources of confusion
Here is complete chaos. The problem is that the denominator in eq. (\ref{eq:ap}) is not the only possible option. We shall see why. The usual formal way in which the average precision is defined is as the area under the precisionrecall curve, or$$\begin{equation} AP = \int P\textrm{d}R \simeq \sum_j P_{@j}\,(R_{@j}R_{@j1}) = \sum_j P_{@j}\Delta R_{@j} \end{equation}$$
where P and R stand for Precision and Recall. Note that $\Delta R_{@j} = 0$ when the recall does not change between two consecutive retrieved documents (i.e., when the second recommended document is not relevant). If the recall does change, it is because the next document is relevant and in this case $\Delta R_{@j} = 1/$#relevant documents. The expression above can thus be rewritten as a sum over only the relevant documents, divided by a generic “total number of relevant documents”,
$$\begin{equation} AP = \frac{\sum_j P_{@j}\textrm{rel}_j}{\textrm{number of relevant documents}}\ . \end{equation}$$
Here is the crux of the matter: what is the number of relevant documents (the ground truth)? There are a few options here:
1. The ground truth is only represented by the relevant items that are recommended up to the kth document. This leads to the denominator being the one used in equation (\ref{eq:ap}). The bottom line is that with this definition, the Average Precision @k represents the average of the relevant recommendations, up to the kth recommendation. This definition is consistent with several sources, including papers such as this one and this one, the Coursera course from the University of Minnesota and this video from Victor Lavrenko.
2. The ground truth is represented by the number of items that are recommended up to a maximum of $k$. This leads to the denominator being $\min\{k,\rho_i\}$, as for example used in this Kaggle definition.
3. The ground truth is represented by the actual ground truth vector $G_i$. This leads to the denominator being $\gamma_i$. It is the choice made by MLlib, as can be appreciated by looking at the MLLIB documentation as well as in the function meanAveragePrecisionin the scala source code (commit f830bb9170f6b853565d9dd30ca7418b93a54fe3 of 29 Nov 2016). I don’t agree with this definition as in real Recommender Systems applications each user may have a largely different number of ground truth recommendations and so the Average Precision defined this way seems to be overdependent on something other from the precision and ranking values.
4.4 Application to User Test Cases
Using the definition in eq. (\ref{eq:ap}), here are the Average Precision scores for the test cases above,
User Test  AP@1  AP@3  AP@5 

#1  1  (1+1+0)/2=1.000  (1+1+0)/2=1.000 
#2  0  (0+1/2+0)/1=1/2=0.500  (0+1/2+0+1/2+0)/2=1/2=0.500 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
MAP  (1+0+0)/3=1/3=0.333  (1+1/2+0)/3=1/2=0.500  (1+1/2+0)/3=1/2=0.500 
4.5 Scala worksheet
This code returns the same values shown in the subsection “Application to User Test Cases” above.


5. Reciprocal Rank@k
The reciprocal rank of a set of recommendations served to user $i$ is defined as the inverse position (rank) of the first relevant document among the first k ones,
$$\begin{equation}
\textrm{RR}_i@k = \frac{1}{\textrm{rank}_{i,1}}
\nonumber
\end{equation}$$
Because of this definition, this metric is mostly suited to cases in which it is the topranked result that matters the most.
5.1 Edge case
If there are no relevant documents within the set of recommendations, it is assumed that $\textrm{rank}_{i,1}=\infty$ and so $\textrm{RR}_i@k=0$.5.2 Mean Reciprocal Rank@k (MRR@k)
This is simply the useraveraged value of the Reciprocal Rank@k,$$\begin{equation} \textrm{MRR}@k = \frac{\sum_{i=1}^{U}\textrm{RR}_i@k}{U} \label{eq:mrr} \end{equation}$$
5.3 Sources of confusion
Here the main source of confusion comes from the fact that the typical definition of the MRR is for search queries (see for example wikipedia),$$\begin{equation} \textrm{MRR} = \frac{1}{Q}\sum_{i=1}^{Q}\frac{1}{\textrm{rank}_i} \end{equation}$$
where $Q$ is the total number of queries. The temptation (as I have seen for example in a project cloned from MLlib  not the official MLlib release) is therefore to replace queries with recommendations for a single user. In this wrong interpretation, the summation would be over all the relevant documents for a single user. The resulting quantity would then need to be averaged again over all the users to give the final result. Instead, the correct way of translating this formula to Recommender Systems is to replace queries with users. The summation becomes over the users, and each term simply represents the reciprocal rank of only the top relevant document for each user.
5.4 Application to User Test Cases
Using the definition in eq. (\ref{eq:mrr}), here are the Reciprocal Rankscores for the test cases above,User Test  RR@1  RR@3  RR@5 

#1  1  1  1 
#2  0  1/2=0.500  1/2=0.500 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
MRR  (1+0+0)/3=1/3=0.333  (1+1/2+0)/3=1/3=0.333  (1+1/2+0)/3=1/3=0.333 
5.5 Scala worksheet
This code returns the same values shown in the subsection “Application to User Test Cases” above.


6. Normalized Discounted Cumulative Gain@k
The Normalized Discounted Cumulative Gain (NDCG) is similar to MAP, except that NDCG also works for nonbinary relevance. NDCG has a heavier tail at high ranks, which means that it does not discount lower ranks as much as MAP does. For this reason, some practitioners prefer MAP to NDCG (for binary outcomes).
To define NDCG, we first need to define the discounted cumulative gain (DCG)
$$\begin{equation}
\textrm{DCG}_i@k = \sum_{j=1}^{k}\frac{2^{\textrm{rel}_i}1}{\ln (i+1)}
\label{eq:dcg}
\end{equation}$$
and then the Ideal Discounted Cumulative Gain (IDCG)
$$\begin{equation}
\textrm{IDCG}_i@k = \sum_{j=1}^{k}\frac{2^{\textrm{SORT(rel)}_i}1}{\ln (i+1)}\ .
\label{eq:idcg}
\end{equation}$$
The IDCG represents the ideal scenario in which the given recommendations are ranked as best as possible, this is the origin of the SORT function in the definition above.
Using eqs. (\ref{eq:dcg}) and (\ref{eq:idcg}) we finally arrive at the definition of the Normalized Discounted Cumulative Gain (NDCG) as
$$\begin{equation}
\textrm{NDCG}_i@k = \frac{\textrm{DCG}_i@k}{\textrm{IDCG}_i@k}\ .
\label{eq:ndcg}
\end{equation}$$
6.1 Edge case
1. If there are no relevant documents within the set of recommendations, it is assumed that NDCG$=0$.2. If the vector of labels is empty, NDCG should return NaN.
6.2 Sources of confusion
Just as MAP, I have found NDCG to be defined differently by different authors. The two main sources of confusion that I found are:1. the denominator is sometimes used as a natural logarithm, some other time as a log based 2.
2. the IDCG is sometimes calculated without using the SORT function on the recommended items, but rather assuming that an ideal recommender would recommend all relevant items in the best ranking order. In this case the numerator in eq. (\ref{eq:idcg}) would be positive for all items (and strictly equal to one for binary outcomes). This is how MLlib defines NDCG. In my research, I have found the definition in eq. (\ref{eq:idcg}) to be more common, for example this paper (see their Definition 1) or (again) the Coursera course from the University of Minnesota or this video from Victor Lavrenko.
6.3 Application to User Test Cases
Before going into the actual calculations, it is useful to elucidate what the SORT function in eq. (\ref{eq:idcg}) does. For our test example #2, we have a label vector = {2,4,6} and a prediction vector = {1,2,3,4,5}. The relevance vector is then rel={0,1,0,1,0} and the “ideal” relevance vector is SORT(rel)={1,1,0,0,0}. In other words, an ideal recommender would have ranked items 2 and 4 at the top of the recommendations list. We can now go ahead and use eq. (\ref{eq:ndcg}) for our test cases,k = 1
User Test  DCG@1  IDCG@1  NDCG@1 

#1  1/log(2)  1/log(2)  DCG@1/IDCG@1=1.000 
#2  0  0  0 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean NDCG@1      (1+0+0)/3=1/3=0.333 
k = 3
User Test  DCG@3  IDCG@3  NDCG@3 

#1  1/log(2)+1/log(3)  1/log(2)+1/log(3)  DCG@3/IDCG@3=1.000 
#2  1/ln(3)  1/ln(2)  DCG@3/IDCG@3=0.631 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean NDCG@3      (1+0.631+0)/3=0.544 
k = 5
User Test  DCG@5  IDCG@5  NDCG@5 

#1  1/log(2)+1/log(3)  1/log(2)+1/log(3)  DCG@5/IDCG@5=1.000 
#2  1/log(3)+1/log(5)  1/log(2)+1/log(3)  DCG@5/IDCG@5=0.651 
#3  0  0  0 
#4  NaN  NaN  NaN 
#5  NaN  NaN  NaN 
Mean NDCG@5      (1+0.651+0)/3=0.550 
6.4 Scala worksheet
This code returns the same values shown in the subsection “Application to User Test Cases” above.


References
 Coursera course by the University of Minnesota.
 Spark MLlib documentation
 B. McFee and G.R.G. Lanckriet, Metric learning to Rank, Twentyseventh International Conference on Machine Learning (ICML, 2010).
 C. Moreira, P. Calado and B. Martins, Learning to Rank for Expert Search in Digital Libraries of Academic Publications, arXiv:1302.0413 (2013).
 C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press (2008).
 M. Weimer, A. Karatzoglou, Q. Le and Alex Smola, COFI RANK  Maximum Margin Matrix Factorization for Collaborative Ranking, Advances in Neural Information Processing Systems 20 (2008).
 The Python Recsys rankbased evaluation metrics.
⁂