[Colloq] PhD Defense - Maryam Bashir - Time/Location Changed to 7/24, 1pm, 366 WVH - Optimally Selecting and Combining Assessment and Assessor Types for Information Retrieval Evaluation
Jessica Biron
bironje at ccs.neu.edu
Thu Jul 24 08:30:53 EDT 2014
Title: Optimally Selecting and Combining Assessment and Assessor Types for Information Retrieval Evaluation
Speaker: Maryam Bashir
Date: Thursday, July 24, 2014
Time: 1:00pm to 2:00pm
Location: 366WVH
Committee:
Mark D. Smucker (External Committee Member, University of Waterloo)
David A. Smith
Yizhou Sun
Javed A. Aslam -- Thesis Advisor
Abstract:
Current test collection construction methodologies for Information Retrieval evaluation generally rely on large numbers of document relevance assessments, obtained from experts at great cost. Recently, the use of inexpensive crowd workers has been proposed instead. However, while crowd workers are inexpensive, their assessments are also generally highly inaccurate, rendering their collective assessments far less useful than those obtained from experts in the traditional manner. Our thesis is that instead of using either experts or crowd workers, one can obtain the advantages of both---inexpensive and accurate assessments---by optimally combining them. Another related problem in Information Retrieval evaluation is asking right kind of question to the assessors for the collection relevance judgments. Traditional methods of collecting relevance judgments are based on collecting binary or graded nominal judgments, but such judgments are limited by factors such as inter-assessor disagreement and the arbitrariness of grades. Previous research has shown that it is easier for assessors to make pairwise preference judgments. However, unless the preferences collected are largely transitive, it is not clear how to combine them in order to obtain document relevance scores. Another difficulty is that the number of pairs that need to be assessed is quadratic in the number of documents. We show how to combine a linear number of pairwise preference judgments from multiple assessors to compute relevance scores for every document. We propose a general Bayesian framework for leveraging disparate categories of workers on the cost-accuracy scale, asking the right worker the right kind of question at the right time in order to obtain accurate assessments most cost-effectively for test collection construction. Experiments with Mechanical Turks and expert assessors show promising results for our framework.
More information about the Colloq
mailing list