Distinguished Lecture Series:
Accounting for burstiness in topic models
|
|
| Speaker: |
Dr. Charles Elkan |
| When: |
Friday, April 10th, 2009 |
| Time: |
2:00pm |
| Where: |
ECS 243 |
|
Abstract:
A topic model is a statistical model of a collection of documents that identifies the multiple themes that are present in the documents. Topic modeling is more useful than clustering, because clustering assumes that each document belongs to a single theme, which is often not true in reality. All current topic models assume that each theme is represented by a multinomial distribution, an assumption that is also not true in reality, since it contradicts the fact that words in topics and in documents are bursty: if a word is used once, it is more likely to be used again. In this talk, I will present a new topic model that uses the Dirichlet compound multinomial (DCM) distribution to allow for burstiness. Experimental results show that the DCM-based model fits real-world data much better than the corresponding traditional topic model. I will also describe a new extended topic model that lets us learn correlations between topics using a generalization of the DCM. Note: Joint work with Gabe Doyle from Linguistics at UCSD.
Biography:
Dr. Charles Elkan is a professor in the Department of Computer Science and Engineering at the University of California, San Diego. In 2005/06 he was on sabbatical at MIT, and in 1998/99 he was Visiting Associate Professor at Harvard. Dr. Elkan is known for his research in machine learning, data mining, and computational biology. The MEME algorithm he developed with his Ph.D. student Tim Bailey has been cited over 1000 times in biology and computer science. Dr. Elkan has won several best paper awards and data mining contests, and his Ph.D students have held tenure-track or similar positions at Columbia University, the University of Washington, the University of Queensland, Google, IBM Research, and other universities and companies.
|