Replies: 2 comments 1 reply
-
Hello @crtnx, I appreciate your interest in the inner workings of the project. You can find the answer to many theoretical (as well as practical) questions by checking out the scientific paper on which the project is based. It contains details about our models' architecture, the experimental setup within which the models were trained, some challenges we faced and how they oriented the research, as well as a detailed analysis of our models. As for the questions you have listed, here's a brief answer for each one:
In the context of natural language processing, Sequence-to-Sequence is a general term to describe a model composed of and encoder and a decoder. The encoder ingests an entire sequence as input and encodes it while the decoder produces one as output using the provided encoding. The underlying model can be a transformer indeed, however our models use a type of recurrent neural networks called Long Sort-Term memory (LSTM for short). The reason we chose a Seq2Seq composed of LSTMs is that this architecture was the state-of-the-art in address parsing when we first started working on the project back in 2019. In addition, the problem of address parsing can also (and more simply) be addressed with an encoder architecture using a single recurrent network or a transformer encoder such as BERT.
That's a good question, unfortunately I would need to run some experiments to answer with a 100% confidence. However, here are some impressions I have about the subject: our framework is similar to NER with an IO tag scheme, but instead of the O each word is a type of entity so I would imagine that a performant NER model following this framework would yield a good enough performance. If other tag schemes are employed, the task is complicated and I would expect a relative drop in performance depending on the models used. As for Spacy's SpanCat, it would have to be tested as I'm not really sure..
The models were coded in PyTorch and the training was conducted using Poutyne on Nvidia GPUs (RTX-2080 and Titan Xp). The training time varies by model where fasttext models can take up to few hours and BPEmb can take a few weeks because of the added complexity of the model.
You can indeed train/fine-tune our models on a different dataset and a different set of tags. You can check out the docs for an example. I hope this answers your questions, and do not hesitate to let me know if you have further inquiries ! |
Beta Was this translation helpful? Give feedback.
-
Hi @crtnx, First, I think it is important to recall the reasons behind this project. The project was an academic machine learning summer internship for @MAYAS3. His objective was to develop a tool to parse addresses for my pre-processing step for my master thesis. Our constraints were 1) it has to be bilingual, 2) it should not require an Internet connection to work (e.g. not an API), and 3) developed in Python to be maintained (if needed) by the industrial partner (thus not in C/C++ or Java). Along the way, we developed a Canadian bilingual address parser for my master thesis and found promising results (and a dataset) for a multinational address parser. However, since it is more of an academic-first project, we did not simply train with all the data in the world and focused more on transfer learning (capacities to learn task B from task A).
We chose the Se2Seq architecture as our first approach (since it was a research project) since we framed the problem as a translation (sequence to tag) problem. Then, we improved our approach using a Seq2Seq model attention. If a recall right, we have used the Bahdanau attention implementation. The approach is framed as a sequence (the address) that must be translated into a sequence of tags. We also have experimented with LSTM and RNN approach bus Seq2Seq showed more promising results for multinational parsing than LSTM. We also experimented with Transformer-like architecture (BERT), and there is currently a live branch for this approach. Preliminary experimentation shows a performance improvement. However, we don't have enough time to implement such a large feature RN.
To continue on @MAYAS3 response. It could work with similar performance since actual (I'm pretty sure of it but have not verified) SpaCy NER uses Transformer-like architecture with Bloom subword embeddings. They train the model using different tasks that I don't recall. One significant advantage of this approach (the NER) is that it could even be possible to extract within a text the address and parse it. BUT, from my research, I could not find a multinational dataset for that. Thus, it would require a large effort to develop a single language model, use various techniques to develop a transfer learning scheme and develop a multinational model (now that I think about it, it could be a Google Summer code project).
Totaly! We would be more than happy to add your usage example in the doc also!! The architecture allows you to change the tags, the prediction layer (a fully-connected network), and all the architecture (hidden size mostly) of the Se2Seq. |
Beta Was this translation helpful? Give feedback.
-
Dear project's founders,
I personally, and probably many others, would be very interested in understanding a theoretical foundation, decisions and challenges along the lines of the project's development.
Here is the partial list of questions which could be interesting for the dev community (I believe many have some basic understanding of neural networks, like me, but are interested to understand deeper)
I would appreciate if you could find a time to answer these questions.
Beta Was this translation helpful? Give feedback.
All reactions