-
Notifications
You must be signed in to change notification settings - Fork 31
Home
Welcome to the Yahoo_LDA wiki!
This repository contains Yahoo!'s topic modelling framework using Latent Dirichlet Allocation.
Topic Modelling is a Machine Learning technique to categorize data. If we have to group the home pages of Singapore Airline, National University of Singapore & Chijmes which is a restaurant in Singapore, we can group them as all belonging to Singapore. Now if we have more pages to group lets say, United Airlines, Australian National University and a restaurant in Berkeley, then we can group the combined set of pages in multiple ways: by country, by type of business and so on. Choosing one of these different ways is hard because each one of them has multiple roles to play depending on the context. In a document that talks about nations strengths then the grouping by nationality is good and in some document which talks about universties its apt to use the grouping by type of business. So the only alternative is to assign or tag each page with all the categories it belongs to. So we tag the United Airlines page as an airliner company in US and so on.
So whats the big difference between grouping objects or clustering & Topic Models? The following example clarifies the distinction: Consider objects of different colors. Clustering them is to find that there are 3 prototypical colors RGB & each object can be grouped by what its primary color is. That is we group object by prototypes. On the other hand with topic models we try to find the the composition of RGB in the color of each object, that is we say that this color is composed of 80% R, 9% G & 11% B. So topic models are definitely richer in the sense that any color can be decomposed into the prototypical colors but not all colors can be unambiguously grouped
Though conceptually it sounds very good, to make it work on 100s of millions of pages, with 1000s of topics to infer & no editorial data is a very hard problem. The state of the art can only handle sizes that are 10 to 100 times smaller. We would also like the solution to scale with the number of computers so that we can add more machines and be able to solve bigger problems.
One way of solving the problem of Topic Modelling is called Latent Dirichlet Allocation. This is a statistical model which specifies a probabilistic procedure to generate data. It defines a topic as a probability distribution over words. Essentially think of topics as having a vocabulary of their own with preference over words specified as a probability distribution.
We have implemented a framework to solve the topic modelling problem using LDA which can work at very large scale. Considerable effort has also been spent on creating an architecture for the framework which is flexible enough to allow the reuse of infrastructure for the implementation fancier models and extension. One of the main aims here is that scaling a new model should take minimal effort. For more details please take a look at An Architecture for Parallel Topic Models
It provides a fast C++ implementation of the inferencing algorithm which can use both multi-core parallelism and multi-machine parallelism using a hadoop cluster. It can infer about a thousand topics on a million document corpus while running for a thousand iterations on an eight core machine in one day.
To aid discussion we have created [email protected]