[Colloq] PhD Thesis Defense - July 24, 1:30pm, 166 WVH - Optimally Selecting and Combining Assessment and Assessor Types for Information Retrieval Evaluation

Jessica Biron bironje at ccs.neu.edu
Mon Jul 21 10:37:30 EDT 2014



Title: Optimally Selecting and Combining Assessment and Assessor Types for Information Retrieval Evaluation 



Speaker: Maryam Bashir 





Date: Thursday, July 24, 2014 



Time: 1:30pm to 2:30pm 



Location: 166WVH 







Committee: 



Mark D. Smucker (External Committee Member, University of Waterloo) David A. Smith Yizhou Sun Javed A. Aslam -- Thesis Advisor 





Abstract: 



Current test collection construction methodologies for Information Retrieval evaluation generally rely on large numbers of document relevance assessments, obtained from experts at great cost. Recently, the use of inexpensive crowd workers has been proposed instead. However, while crowd workers are inexpensive, their assessments are also generally highly inaccurate, rendering their collective assessments far less useful than those obtained from experts in the traditional manner. Our thesis is that instead of using either experts or crowd workers, one can obtain the advantages of both---inexpensive and accurate assessments---by optimally combining them. Another related problem in Information Retrieval evaluation is asking right kind of question to the assessors for the collection relevance judgments. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by factors such as inter-assessor disagreement and the arbitrariness of grades. Previous research has shown that it is easier for assessors to make pairwise preference judgments. However, unless the preferences collected are largely transitive, it is not clear how to combine them in order to obtain document relevance scores. Another difficulty is that the number of pairs that need to be assessed is quadratic in the number of documents. We show how to combine a linear number of pairwise preference judgments from multiple assessors to compute relevance scores for every document. We propose a general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most cost-effectively for test collection construction. Experiments with Mechanical Turks and expert assessors show promising results for our framework. 



More information about the Colloq mailing list