Title: Best Practices For Offline Evaluation For Top-N Recommendation: Candidate Set Sampling And Statistical Inference
Program: Doctor of Philosophy in Computing
Advisor: Dr. Edoardo Serra, Computer Science
Committee Members: Dr. Michael Ekstrand, Computer Science (Co-Chair); and Dr. Hoda Mehrpouyan, Computer Science
Evaluation of recommender systems is key to ensuring that we are making progress by only promoting proposed algorithms that actually outperforms the state-of-the-art. Evaluation can be performed offline using logged historic data, or online such as A/B tests. Offline evaluation is the most popular evaluation paradigm used in recent research publications because of its accessibility. It has been an area of study since its earliest days, and is still currently an active area of research. Even though significant work has been done to improve the process and metrics of offline evaluation, it still faces fundamental difficulties. Given the significant role of offline evaluation in recommender system evaluation, it is important that the difficulties that embodies it be mitigated. This dissertation aims to improve offline evaluation of recommender systems by identifying gaps in existing practices in two specific components of the offline evaluation protocol: candidate set sampling which is the selection of the set of candidate items that the recommendation system is expected to rank for each user in an experiment; and statistical inference techniques which are used to analyze the evaluation results in order to make inference about the effectiveness of a proposed system relative to a baseline.
This dissertation addresses gaps in candidate set sampling by showing that uniform sampling of the candidate set exacerbates popularity bias, while popularity-weighted sampling mitigates this bias. Additionally, it demonstrated that candidate set sampling improves the accuracy of effectiveness performance estimation in top-N recommender systems. With respect to statistical inference, this dissertation identified a lack of rigorous statistical analysis in evaluations within RecSys, and revealed that the Wilcoxon and Sign tests display higher-than-expected Type-1 error rates for large sample sizes, recommending their discontinuation in recommender system experiments. It demonstrates that in Top-N recommendation and large search evaluation data, most tests are likely to yield statistically significant results, emphasizing the need to prioritize effect size for practical or scientific significance. Additionally, the dissertation found that the Benjamini-Yekutieli test exhibited the lowest error rate and greater power than the Bonferroni test, recommending it as the default correction for comparing multiple systems in information retrieval and recommender system experiments. By addressing these gaps, the dissertation significantly contributes to the improvement of offline evaluation, equipping recommender system researchers with evidence based knowledge to make informed decisions when configuring their evaluation experiments.