[Colloq] Hiring Talk - Eugene Wu - Closing the Loop on Data Analysis - March 25, 10:30am, 366 WVH

Northeastern University CCIS bironje at ccs.neu.edu
Wed Mar 19 10:21:17 EDT 2014


Closing the Loop on Data Analysis

Tuesday March 25th, 2014 10:30am - 11:30am

366 WVH

Eugene Wu

Although data processing systems now execute queries faster than ever before,
they only address first half of the data analysis cycle. The latter half —
presenting and interpreting the results in order to clean the data, formulating
new queries, generating hypothesis, and summarizing and presenting results
— is currently ill-served by existing systems. In this talk, I will
describe two examples of systems that “close the loop” by letting
users query the results of their data analysis.
The first, Scorpion, answers “why are these results outliers?” in
the context of aggregation queries. Aggregation is commonly used to reduce large
data sets to a managable size, but also obscures the input records that are
correlated with outliers from those that are uncorrelated. Scorpion identifies
the input records that most contributed to an outlier value and generates
predicates that describe their common properties.
The second, SubZero, answers “what records generated this result?”
in the context of scientific workflows. For example, astronomers want to know
which pixels in the set of all input images were used to detect an interesting
star. Naively storing input-output relationships (lineage) for every pixel in
each step of the workflow can incur significant storage and runtime costs.
SubZero is a workflow system that efficiently tracks lineage information while
also meeting user specified storage and runtime overhead constraints.


Eugene Wu is a Ph.D. student in the database group at MIT, advised by Samuel
Madden and Michael Stonebraker. He is broadly interested in building systems for
data management and has contributed to research in a wide variety of areas
including data cleaning, core database performance, human computation, and
complex event processing.




More information about the Colloq mailing list