Theoretical foundation and internals of the project #145

crtnx · 2022-07-31T18:34:51Z

crtnx
Jul 31, 2022

Dear project's founders,

I personally, and probably many others, would be very interested in understanding a theoretical foundation, decisions and challenges along the lines of the project's development.

Here is the partial list of questions which could be interesting for the dev community (I believe many have some basic understanding of neural networks, like me, but are interested to understand deeper)

Why did you choose Seq2Seq approach? Is it the same as Transformer? Seems it is mostly used for language translation tasks, not for parsing...
Can similar performance be achieved with NER or Spacy's Span Categorizer?
What environment do you use for training (number of GPUs, GPU model)? How long does it take to fully train a model?
Could someone use your framework to train a model on completely different dataset and set of labels? Say, email signatures (closest to postal addresses I can think of)

I would appreciate if you could find a time to answer these questions.

MAYAS3 · 2022-08-01T14:57:28Z

MAYAS3
Aug 1, 2022
Maintainer

Hello @crtnx,

I appreciate your interest in the inner workings of the project. You can find the answer to many theoretical (as well as practical) questions by checking out the scientific paper on which the project is based. It contains details about our models' architecture, the experimental setup within which the models were trained, some challenges we faced and how they oriented the research, as well as a detailed analysis of our models.

As for the questions you have listed, here's a brief answer for each one:

Why did you choose Seq2Seq approach? Is it the same as Transformer? Seems it is mostly used for language translation tasks, not for parsing...

In the context of natural language processing, Sequence-to-Sequence is a general term to describe a model composed of and encoder and a decoder. The encoder ingests an entire sequence as input and encodes it while the decoder produces one as output using the provided encoding. The underlying model can be a transformer indeed, however our models use a type of recurrent neural networks called Long Sort-Term memory (LSTM for short). The reason we chose a Seq2Seq composed of LSTMs is that this architecture was the state-of-the-art in address parsing when we first started working on the project back in 2019. In addition, the problem of address parsing can also (and more simply) be addressed with an encoder architecture using a single recurrent network or a transformer encoder such as BERT.

- Can similar performance be achieved with NER or Spacy's Span Categorizer?

That's a good question, unfortunately I would need to run some experiments to answer with a 100% confidence. However, here are some impressions I have about the subject: our framework is similar to NER with an IO tag scheme, but instead of the O each word is a type of entity so I would imagine that a performant NER model following this framework would yield a good enough performance. If other tag schemes are employed, the task is complicated and I would expect a relative drop in performance depending on the models used. As for Spacy's SpanCat, it would have to be tested as I'm not really sure..

- What environment do you use for training (number of GPUs, GPU model)? How long does it take to fully train a model?

The models were coded in PyTorch and the training was conducted using Poutyne on Nvidia GPUs (RTX-2080 and Titan Xp). The training time varies by model where fasttext models can take up to few hours and BPEmb can take a few weeks because of the added complexity of the model.

- Could someone use your framework to train a model on completely different dataset and set of labels? Say, email signatures (closest to postal addresses I can think of)

You can indeed train/fine-tune our models on a different dataset and a different set of tags. You can check out the docs for an example.

I hope this answers your questions, and do not hesitate to let me know if you have further inquiries !

1 reply

davebulaval Aug 2, 2022
Maintainer

About the training time, this was for training from scratch using our complete dataset. If you wish to fine-tune and retrain some weights using our current architecture, it can take just a couple of minutes per epoch, depending on the size of your dataset.

davebulaval · 2022-08-01T15:08:46Z

davebulaval
Aug 1, 2022
Maintainer

Hi @crtnx,

First, I think it is important to recall the reasons behind this project. The project was an academic machine learning summer internship for @MAYAS3. His objective was to develop a tool to parse addresses for my pre-processing step for my master thesis. Our constraints were 1) it has to be bilingual, 2) it should not require an Internet connection to work (e.g. not an API), and 3) developed in Python to be maintained (if needed) by the industrial partner (thus not in C/C++ or Java).

Along the way, we developed a Canadian bilingual address parser for my master thesis and found promising results (and a dataset) for a multinational address parser. However, since it is more of an academic-first project, we did not simply train with all the data in the world and focused more on transfer learning (capacities to learn task B from task A).

Why did you choose Seq2Seq approach? Is it the same as Transformer? Seems it is mostly used for language translation tasks, not for parsing...

We chose the Se2Seq architecture as our first approach (since it was a research project) since we framed the problem as a translation (sequence to tag) problem. Then, we improved our approach using a Seq2Seq model attention. If a recall right, we have used the Bahdanau attention implementation. The approach is framed as a sequence (the address) that must be translated into a sequence of tags. We also have experimented with LSTM and RNN approach bus Seq2Seq showed more promising results for multinational parsing than LSTM.

We also experimented with Transformer-like architecture (BERT), and there is currently a live branch for this approach. Preliminary experimentation shows a performance improvement. However, we don't have enough time to implement such a large feature RN.

Can similar performance be achieved with NER or Spacy's Span Categorizer?

To continue on @MAYAS3 response. It could work with similar performance since actual (I'm pretty sure of it but have not verified) SpaCy NER uses Transformer-like architecture with Bloom subword embeddings. They train the model using different tasks that I don't recall. One significant advantage of this approach (the NER) is that it could even be possible to extract within a text the address and parse it. BUT, from my research, I could not find a multinational dataset for that. Thus, it would require a large effort to develop a single language model, use various techniques to develop a transfer learning scheme and develop a multinational model (now that I think about it, it could be a Google Summer code project).

Could someone use your framework to train a model on completely different dataset and set of labels? Say, email signatures (closest to postal addresses I can think of)

Totaly! We would be more than happy to add your usage example in the doc also!! The architecture allows you to change the tags, the prediction layer (a fully-connected network), and all the architecture (hidden size mostly) of the Se2Seq.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Theoretical foundation and internals of the project #145

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Theoretical foundation and internals of the project #145

crtnx Jul 31, 2022

Replies: 2 comments · 1 reply

MAYAS3 Aug 1, 2022 Maintainer

davebulaval Aug 2, 2022 Maintainer

davebulaval Aug 1, 2022 Maintainer

crtnx
Jul 31, 2022

Replies: 2 comments 1 reply

MAYAS3
Aug 1, 2022
Maintainer

davebulaval Aug 2, 2022
Maintainer

davebulaval
Aug 1, 2022
Maintainer