What are the real world problems:
- Generating photo from text.
- Generating video from script
- Generate voice from text
- Generate fake news ! :P
- Generate UI code from images, stories, speeches, movies, ... anything !
- Data compression ! (as compression is also about prediction, to compress data. It fits into generative models mould quite well)
Estimate the distribution of data from some samples. For example, from the image net dataset, find the distribution of natural images data. We have an empirical small sample of distributiuon (Imagenet) and from it's sample, estimate it's true distribution.
"How likely is this data point from the true distribution ?"
A liklihood based model is a model whihc is a joint distribution over data.
We have some data points from the true data distribution which are IID. We constrain the data distribution to a dataset (maybe Imagenet)
For this class, a distribution is a function that takes in input and spits out a prob whether it is generated from true data process or not.
In the first lecture we deal with only discrete data, will move on to continous data later.
Potential uses are anomaly detection fast.
Sampling: Generate random variable X that has the same distribution of the model.
Deep learning helps to estimate distributions of a complex high dimensional data. Hence this class is motivated by that, unlike older classical stastical techniques, which fail in this high dimensional scenario.
We want our model to be:
- Small
- Fast
- Expressive
- Generalisable
We estimate p_data from samples.
TO do that, we have a function approximator to learn theta where theta apprxomates the real distribution.
- How do we design function approximations to effectiely represent complex joint distributions over x, yet be easy to train.
- There will be many choices, each with different tradeoffs.
Designing model and training procedure go hand in hand
We also want
- For loss function + search procedure to work for large dataset
- yeild theta simialr to true data distribution, think of loss as distance between distributions.
- Note that the training procedure can only see empiral data distributions, should be able to generalise.
Maximum Liklihood finds theta given dataset by solving optiisation problem.
Also statistics say that if model is expressive enough and given enough data, then solving maximum liklihood problem will yield parameters that generate the data.
It is equivalent to mimimising KL divergence between true and approximate model.
We can solve Maximum Liklihood by using SGD to minimise expectations.
If f is differentiable function of theat , it solves arg_min(Expectation of f)
As maximum liklihood is an average, and SGD minimises averages !
With maximum liklihood, our optimisation problem is:
argmin_theta(For data in dataset ( we check log(prob(x))))
The noise is coming from sample of true dataset, in supervised setting we can think of this noise as mini batches over dataset. (Slightly connfusing)