[Colloq] PhD Defense - Maryam Bashir - Time/Location Changed to 7/24, 1pm, 366 WVH - Optimally Selecting and Combining Assessment and Assessor Types for Information Retrieval Evaluation

Thu Jul 24 08:30:53 EDT 2014

Title: Optimally Selecting and Combining Assessment and Assessor Types for Information Retrieval Evaluation 

Speaker: Maryam Bashir 

Date: Thursday, July 24, 2014 

Time: 1:00pm to 2:00pm 

Location: 366WVH 

Committee: 

Mark D. Smucker (External Committee Member, University of Waterloo) 
David A. Smith 
Yizhou Sun 
Javed A. Aslam -- Thesis Advisor 

Abstract: 

Current test collection construction methodologies for Information Retrieval evaluation generally rely on large numbers of document relevance assessments, obtained from experts at great cost. Recently, the use of inexpensive crowd workers has been proposed instead. However, while crowd workers are inexpensive, their assessments are also generally highly inaccurate, rendering their collective assessments far less useful than those obtained from experts in the traditional manner. Our thesis is that instead of using either experts or crowd workers, one can obtain the advantages of both---inexpensive and accurate assessments---by optimally combining them. Another related problem in Information Retrieval evaluation is asking right kind of question to the assessors for the collection relevance judgments. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by factors such as inter-assessor disagreement and the arbitrariness of grades. Previous research has shown that it is easier for assessors to make pairwise preference judgments. However, unless the preferences collected are largely transitive, it is not clear how to combine them in order to obtain document relevance scores. Another difficulty is that the number of pairs that need to be assessed is quadratic in the number of documents. We show how to combine a linear number of pairwise preference judgments from multiple assessors to compute relevance scores for every document. We propose a general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most cost-effectively for test collection construction. Experiments with Mechanical Turks and expert assessors show promising results for our framework.