[Colloq] PhD thesis propsal presentation - Keshi Dai

Tue Jun 21 10:59:56 EDT 2011

The College of Computer and Information Science presents a PhD proposal presentation 

Speaker: Keshi Dai 

Date: Thursday, June 23, 2011 
Time: 10:00am 
Location: 366 WVH 

Tittle: 
Modeling Score Distributions for Information Retrieval 

Abstract: 

Inferring score distributions of relevant and non-relevant documents is an essential task for many retrieval applications (e.g. information filtering, recall-oriented IR, meta-search, distributed IR). Modeling score distributions in an accurate manner is the basis of any inference. In the first part of the proposal, we propose a better empirical model by modeling the relevant documents’ scores by a mixture of Gaussians and the non-relevant scores by a Gamma distribution. By applying variational Bayesian inference we automatically trade-off the goodness-of-fit with the model complexity. We show our model outperforms the traditional model on typical retrieval functions and actual search engines submitted to TREC. 

Furthermore, we model score distributions in a rather different, systematic manner. We start with a basic assumption on the distribution of terms in a document. Following the transformations applied on term frequencies, we derive the distribution of the produced scores for retrieved documents. Then we present a general mathematical framework which, given any score distribution for all retrieved documents, produces an analytical formula for the score distribution of relevant documents. In particular, assuming a Gamma distribution for all retrieved documents, we show that the derived distribution for the relevant documents resembles a Gaussian distribution with a heavy right tail. 

Finally, we propose a novel framework to infer the probability of document relevance in absence of relevance information by utilizing its estimations through score distributions from multiple retrieval systems. We also extend the expectation maximization algorithm to be integrated with this new framework to estimate the mixture model parameters of the score distribution. Combined, we propose to show how one can utilize score distributions to solve practical IR problems when relevance information is unavailable or limited. 

Committee: 
* Javed Aslam (advisor) 
* Harriet Fell 
* Rajmohan Rajaraman 
* Avi Arampatzis (external)