GEOMDN
is an implementation of Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks (EMNLP2017).
The neural-network is implemented using Theano/Lasagne but it shouldn't be difficult to adopt it to other NN frameworks.
The work has 3 main modules:
-
lang2loc.py implements mixture density networks to predict location from text input
-
lang2loc_mdnshared.py implements mixture density networks to predict location from text input with the difference that the mus, sigmas and corxys of the mixure of Gaussians are shared between all the input samples and only pis of samples are conditioned on input. This improved the model as the global mixture of Gaussian sturcture exists and can be learned from all the samples rather than predicted for each individual sample.
-
loc2lang.py implements a lexical dialectology model where given 2d coordinate inputs predicts a unigram probability distribution over vocabulary. The input is a normal 2d input layer but the hidden layer consisits of several Gaussian distributions whose mus, sigmas and corxys are learned and its output is the probability of input in each of the Gaussian components.
Look at some of the maps, a lot of local words including named entities for several DARE dialect regions and city terms including named entities for about 100 U.S. cities.
local words retrieved for dialect region Delmarva:
"delmarva": [
"llsssss",
"llssss",
"llsss",
"downingtown",
"ardd",
"dickeating",
"llss",
"brovah",
"millersville",
"erked",
"rehoboth",
"suitland",
"arddd",
"oldhead",
"deptford",
"exton",
"youngbull",
"harford",
"fraudin",
"drawlin",
"dfl",
"cheltenham",
"reisterstown",
"ared",
"parkville",
"nizz",
"#ttm",
"marlton",
"xib",
"llls",
"norristown",
"horsham",
"owings",
"schuylkill",
"ard",
"kutztown",
"manayunk",
"bensalem",
"elkridge",
"btfu",
"fyd",
"llab",
Datasets are GEOTEXT a.k.a CMU (a small Twitter geolocation dataset) and TwitterUS a.k.a NA (a bigger Twitter geolocation dataset) both covering continental U.S. which can be downloaded from here
-
Download the datasets and place them in ''./datasets/cmu'' and ''./datasets/na'' for GEOTEXT and TwitterUS (contact me for the datasets).
-
For lang2loc geolocation run:
For GEOTEXT a.k.a CMU run:
THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc.py -d ./datasets/cmu/ -enc latin1 -reg 0 -drop 0.5 -mindf 10 -hid 100 -ncomp 100
For TwitterUS a.k.a NA run:
THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc.py -d ./datasets/na/ -enc utf-8 -reg 1e-5 -drop 0.0 -mindf 10 -hid 300 -ncomp 100
- For lang2loc_mdnshared geolocation run:
For GEOTEXT a.k.a CMU run:
THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc_mdnshared.py -d ~/datasets/cmu/ -enc latin1 -reg 0.0 -drop 0.0 -mindf 10 -hid 100 -ncomp 300 -batch 200
For TwitterUS a.k.a NA run:
THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python lang2loc_mdnshared.py -d ~/datasets/na/ -enc utf-8 -reg 0.0 -drop 0.0 -mindf 10 -hid 900 -ncomp 900 -batch 2000
- For loc2lang lexical dialectology model run:
THEANO_FLAGS='device=cpu,floatX=float32' nice -n 10 python loc2lang.py -d ~/datasets/na/ -enc utf-8 -reg 0.0 -drop 0.0 -mindf 100 -hid 1000 -ncomp 500 -batch 5000
Note that cmu is very small to be used for lexical dialectology.
@InProceedings{rahimicontinuous2017,
author = {Rahimi, Afshin and Baldwin, Timothy and Cohn, Trevor},
title = {Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks },
booktitle = {Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP2017)},
month = {September},
year = {2017},
address = {Copenhagen, Denmark},
publisher = {Association for Computational Linguistics},
url = {http://people.eng.unimelb.edu.au/tcohn/papers/emnlp17geomdn.pdf}
}
Afshin Rahimi [email protected]