Idea: Date and Time Parsing #513

cmuell89 · 2016-09-30T23:06:27Z

cmuell89
Sep 30, 2016

@honnibal congrats on the release milestone!

I'm posting to see what sort of community exists amonst SpaCy users for more robust date and time parsing. For many NL-based applications, date and time parsing is tremendously useful but is a difficult task for a statistical parser to provide consistent results from application to application. There are so many arbitrary dates, ranges, and holidays that can confound a purely statistical date time parser. There is potential within SpaCy's existing features (Matcher and PhraseMatcher classes) to build out a rule-based parser similar to that of Stanford's SUTime (see SUTime in action) or wit.ai's open source Duckling.

SUTime parses text using token-oriented regex patterns to map to semantic objects representing dates, duration, ranges etc,. These semantic objects can be combined to create higher-order date and time annotations based on additional rules.

Duckling takes a more functional approach (written in Clojure) and uses primitive rule based functions that can be composed into higher order functions, ultimately returning potential date and time annotations. I believe it further uses a Naive Bayes classifier to pick out the likeliest annotation as Duckling can also parse out any sort of rule-based annotation (currency etc,.) thereby resolving conflicts.

I am curious if there is a market for creating a similar parser built within the SpaCy ecosystem (and under the same license of course). Some requirements would be as follows:

Create a DSL to represent rules and patterns.
Recognize what usable features in SpaCy can be extended Named Entity Recognition : get accepted patterns using regex #486 how to add hashtags to the part of speech tagger #503 and what features must be implemented
Method to combine rules in to higher order representations (object composition? tree structure? function composition?)
Documentation: Tutorials on rule creation, source code, etc,.
Integrate parser into SpaCy's pipeline.
Make it FAST, in the spirit of SpaCy

If this is not the right forum for this discussion please let me know. (Is SpaCy's gitter up a running?). I'd be happy to maintain this project's development but it would be a significant undertaking and a bit beyond my level of expertise.

Cheers!

Carl

aborsu · 2016-10-01T10:50:01Z

aborsu
Oct 1, 2016

I have a pretty good date parsing grammar that works for en_US, en_UK and fr that was written a while back using unitex but I would be willing to convert it and open source it.

I have been trying to switch my company's nlp activities to spaCy, but since we have french and dutch nlp pipelines, we are still far away from it. This would definitely help though.

0 replies

honnibal · 2016-10-01T11:33:41Z

honnibal
Oct 1, 2016
Maintainer

@aborsu: That'd be really great!

I'm keen to have time and date parsing in spaCy. I'd love for someone to step forward to lead development on this.

Lately I've been thinking a bit about local grammars/finite-state automata. I think these things are standard and well-proven. It'd be good to have an implementation of this, especially if we support a standard format such as unitex.

0 replies

cmuell89 · 2016-10-03T01:41:03Z

cmuell89
Oct 3, 2016
Author

@aborsu That sounds wonderful. Might you elaborate more on what you meant by converting what you have written or used from unitex for use in SpaCy? I am not very familiar with unitex but I am assuming you patched together some of the underlying C/C++ modules for your purposes rather than make use of the GUI. I would be glad to help out if you need.

@honnibal Do you have any reading recommendations on how FSAs are generally implemented for use in something like date time parsing? What sort of states are represented, what constitutes a transition to a new state etc,. Do you envision a custom implementation or do you think an external library might suit our needs?

What do you see as the major steps needed to bring a date and time parsing into SpaCy?

Just some questions to get ideas brewing.

0 replies

aborsu · 2016-10-03T08:32:03Z

aborsu
Oct 3, 2016

Unitex is actually written in c++. The gui uses jni bindings (very well done jni bindings may I add) to direct the c++ code. Unitex is actually a lot more than just finite state automata's they also support using transformative grammars, cascades of transformative grammars, probabilistic taggers (but they don't provide any) and other stuff...
Since we don't currently have the ability to train our own models for lack of data, we mostly use it as a date and event extractors using hand written rules.
I really recommend reading the manual of Unitex as it is really powerful for pattern matching.

Patterns in unitex can be reprsented as graphs and look like this. They are read left to right and the top path is preferred in case of ambiguity. Graphs created using unitex are saved as .grf files which look like this.

#Unigraph
SIZE 1188 840
FONT Times New Roman:  10
OFONT Arial Unicode MS:B 12
BCOLOR 16777215
FCOLOR 0
ACOLOR 13487565
SCOLOR 16711680
CCOLOR 255
DBOXES y
DFRAME y
DDATE y
DFILE y
DDIR n
DRIG n
DRST n
FITS 100
PORIENT L
#
8
"<E>" 70 203 1 5 
"" 624 203 0 
"<!DIC>" 209 203 1 3 
"road+street+way+avenue" 284 203 1 6 
"<E>/{,street=$NAME$.LOC}" 429 203 1 1 
"$NAME(" 134 203 2 2 7 
"$NAME)" 354 203 1 4 
"upper+lower" 154 203 1 2

The top part contains information about how to display the graph, the bottom part contains the graph. This graph detects street names by detecting a word that is not in our dictionnary and followed by road OR street OR way OR avenue and it outputs this ,street=UNKNOWN_WORD.LOC.

If I run unitex and tell it to output a concordancer, it will give me a list of spans output so for example ->
17 26 ,street=Berthelot.LOC

The grf files can link one another and can be compiled to a binary format used in unitex.

To incorporate this in spacy would require either rewriting the code in spaCy to be compatible with the grf or bin files used in unitex. Or wrapping around unitex.

0 replies

honnibal · 2016-10-03T08:55:51Z

honnibal
Oct 3, 2016
Maintainer

There's a Python wrapper for unitex here: https://pypi.python.org/pypi/pyunitex

We should check out whether this is good enough, and if it's not, write our own wrapper using Cython. The advantage of the Cython wrapper is that we can call into the C++ code with no performance penalties --- this might be appealing for some workflows.

Unitex is LGPL licensed, so we can't have it as an installation requirement of spaCy. Instead, we could have it as a capability requirement, so you do pip install spacy[unitex]. Some organisations have different policies about using MIT and LGPL licensed code, and we can't stay MIT licensed if we have an LGPL installation requirement.

0 replies

aborsu · 2016-10-03T09:06:15Z

aborsu
Oct 3, 2016

The wrapper hasn't been maintained in a long time. ALso to work properly, we would have to share the dictionnary the tokens and the token features which coupled to the licensing issue tell me that it might be better to inspire ourselves from unitex than to wrap it. (But I'm willing to admit that i don't have much time to do either... 😁) (I don't even really have time to do much nlp)

0 replies

cmuell89 · 2016-10-03T18:21:07Z

cmuell89
Oct 3, 2016
Author

@honnibal Is date time parsing something you would like to incorporate into SpaCy under the same license? Do you think it is worth the development time or would you be satisfied with a capability requirement? I guess another potential cost would be the context switch from SpaCy to unitex. To make such a capability useful, a wrapper would have to ensure sensible mappings between annotations created by unitex and SpaCy (doc, span, token, lexemes, etc,.). And if we're learning towards capability requirement it might be worth looking into other candidates (although unitex appears to be a very good one the more I look into it).

@aborsu and @honnibal I'd certainly like to assist in leading any sort of implementation (in-house or external capability) but would likely need some help/guidance from those with more expertise than myself. Capable and customizable date, time, and temporal expression parsing will be very valuable for users of SpaCy, including myself, which is why I am willing to commit time to its development.

0 replies

cmuell89 · 2016-10-05T00:28:19Z

cmuell89
Oct 5, 2016
Author

In case anyone is interested, I've come across a paper that describes CasSys, a software system uses Finite State Transducer cascades for NER and various other tasks. It was developed using the Intex tool set which is the close-sourced precursor to Unitex.

0 replies

honnibal · 2016-10-05T08:47:03Z

honnibal
Oct 5, 2016
Maintainer

Let's have a call to discuss this. Carl, are you free Friday morning (PDT)? Alternatively, if this can wait until next Wednesday, I'll be able to do a bit more reading, and be better prepared.

0 replies

cmuell89 · 2016-10-05T15:36:29Z

cmuell89
Oct 5, 2016
Author

Yes, I am free for a call either day. Let's shoot for next Wednesday, the 12th. As you guessed, I am in California (PDT) but I am flexible to waking up early or staying up late if it works best for you.

0 replies

cmuell89 · 2016-10-05T16:39:06Z

cmuell89
Oct 5, 2016
Author

Additional resources referenced in this tutorial for OpenFST provide good background and motivation on FST for application in language. All still quite new to me so I have quite a lot of reading to do.

0 replies

cmuell89 · 2016-10-10T21:58:55Z

cmuell89
Oct 10, 2016
Author

@honnibal Any update on having a discussion this Wednesday? Cheers.

0 replies

viksit · 2016-10-14T08:45:50Z

viksit
Oct 14, 2016

Incidentally one of the best date parsers out there is happens to be https://github.com/bear/parsedatetime. Might be worth incorporating with a nice spacy API. It uses the Apache license.

0 replies

cmuell89 · 2016-10-14T15:57:38Z

cmuell89
Oct 14, 2016
Author

I have toyed with this library a bit. Correct me if I am wrong but I found it effective at parsing expressions but not at extract those expressions from text. It may parse "last week" into a date and time annotation but does not parse "last week" out of "We met for dinner and a movie last week."

0 replies

rmdort · 2016-10-20T08:43:44Z

rmdort
Oct 20, 2016

I actually wrote a date parser for chatterbot.

https://github.com/gunthercox/ChatterBot/blob/master/chatterbot/utils/parsing.py

If interested, i can port it to spacy.

0 replies

evanmiltenburg · 2017-11-07T23:13:43Z

evanmiltenburg
Nov 7, 2017

Nobody seems to have mentioned HeidelTime yet. As far as I know, this is one of the best tools for getting times and dates in many different languages.

0 replies

ines · 2017-11-09T16:53:40Z

ines
Nov 9, 2017
Maintainer

Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0!

0 replies

DuyguA · 2017-11-20T10:56:28Z

DuyguA
Nov 20, 2017

Date time format is a tricky question by the way, which sort of strings one wants to recognize.

For instance, my company does sales and customer care. We have CFG parsers both available in English and German. We recognize also relative time strings:

- yesterday around 15
- in 6 months
- after Christmas vacation
- 2 March Mon., 15.00
- tomorrow 12.00
- this Friday, after 12.00

Here are some example strings that are recognized in German:

seit ein paar Jahren
seit Februar
seit mehr als 10 Jahren
seit bald 4 Jahren
seit September letztes jahres
seit Mitte 2010
seit vielen Jahren

Ende September
im Januar
ende Februar/Anfang Maerz
ab Maerz nachsten Jahres
im Sommer
Sommer 2016
2. Quartal 2016
dritten Quartal 2016
vierten Quartal
IV. Quartal 2015
am Donnerstag um 10.00Uhr
Dienstag den 22.3. von 08:30 - 10:30 und 13:00-15:00Uhr
Morgen oder Freitag um 9Uhr
Di., den 29.12. oder Mi, den 30.12
freitag Nachmittag zwischen 1530 und 17Uhr
Freitag 15:30 oder 16:00 Uhr

... it really depends what sort of strings one wants to recognize. For instance, we recognize Quartals because of we work in business domain.

Both German and English parsers are CFG parsers, hence lexical and language dependent. We use Lark and pyparsing.

@ines , if you're interested I'll speak to my boss and we can provide the parsers. We'll be happy to provide:)

0 replies

ines · 2017-11-21T23:15:16Z

ines
Nov 21, 2017
Maintainer

@DuyguA This looks great – and yes, this is definitely interesting! 👍

It'd be really cool to have this a separate plugin that users could plug into their spaCy pipeline. I don't know if your company already publishes open-source software, but this could be nice opportunity – you get to publish the parser and maintain the project, and I'm sure a lot of spaCy users would really appreciate it. We have a lot of users working with business text in similar domains, so your plugin could definitely be very popular!

As an idea, once you've extracted the spans, the pipeline component could set additional TIME and DATE entities on the Doc – see here for more info. For example, something like this (pseudo code and all, but I hope you get the idea):

from spacy.tokens import Span

parser = load_your_parser()  # load your parser somehow

def time_and_date_parser(doc):
     times = parser(doc)  # run your parser over the document
     for start, end in times:  # assuming it returns the start and end token of the span
         entity = Span(doc, start, end, label=doc.vocab.strings['TIME'])  # define span
         doc.ents = list(doc.ents) + [entity]  # add entity to doc.ents
    return doc  # return Doc object for next pipeline component

Usage could then look something like this:

import spacy
import time_and_date_parser  # or some other cool name

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(time_and_date_parser, before='ner')  # add it before spaCy's entity recognizer

doc = nlp("How about this Friday, after 12.00?")
print([ent.text, ent.label for ent in doc.ents])
# ['this Friday, after 12.00', 'TIME']

spaCy's entity recognizer respects already set entities, so if the component is added before 'ner' in the pipeline, it shouldn't interfere.

0 replies

Meekohi · 2017-11-30T19:41:59Z

Meekohi
Nov 30, 2017

I would be extremely interested in this. Currently trying to parse stuff like "Fourth and Second Wednesday of every month, 8-9 a.m. (2018)" and am pretty stumped.

0 replies

DuyguA · 2018-03-29T15:30:02Z

DuyguA
Mar 29, 2018

Sorry for late answer. I made a blog post to describe how to design a context free grammar:

https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/

I used Lark, but one can use PyParsing and of course yacc:)

For more advanced colleagues, I also included efficiency issues and language modelling.

0 replies

azarezade · 2018-05-13T07:45:15Z

azarezade
May 13, 2018

@DuyguA Thanks for the blog post!
But, how can someone convert the parsed string to an standard time format (given a reference time)?

0 replies

DuyguA · 2018-05-13T08:09:23Z

DuyguA
May 13, 2018

One needs an attribute grammar. Basically parse tree is mapped to an assembly language. Sth like this:

https://en.m.wikipedia.org/wiki/Attribute_grammar

https://github.com/lark-parser/lark/blob/master/examples/calc.py

I can extend the blog post. This way one can convert relative dates to Python objects.

0 replies

azarezade · 2018-05-13T11:07:34Z

azarezade
May 13, 2018

Thanks, that would be nice if you can extend your blog post.

0 replies

azarezade · 2018-05-14T04:32:44Z

azarezade
May 14, 2018

Did anybody compared NLTK with lark in terms of speed and accuracy?

0 replies

DuyguA · 2018-05-14T08:41:24Z

DuyguA
May 14, 2018

Yes, myself :)

Lark is the fastest, PyParsing takes the second place, NLTK is not very competent.

NLTK is not very competent in efficiency issues in general. It's not very really competent for realtime applications, NLTK looks like it's good for academical purposes. Nobody I know uses NLTK for industrial applications. Most of my text miner friends make SpaCy + Tensorflow/Keras backends (including myself 😁 )

0 replies

azarezade · 2018-05-14T08:45:11Z

azarezade
May 14, 2018

Great! 👍
So, I will use Lark too ;)

0 replies

DuyguA · 2018-05-14T15:48:47Z

DuyguA
May 14, 2018

OK, I don't want to speak too much 😁 but:

Importing two similar packages (in this case both spacy and nltk) is not good in the view of software engineering.
Importing a big library for only one functionality (in this case nltk-its parser) is not good for efficiency as well.

If you really want to use NLTK, go for Earley parser (there exists one). We both English and German CFG for our inhouse e-mail bots and both grammars are very ambigious. I think any parser less then Earley cannot cover the language of date-time strings.

0 replies

piyushrj · 2018-06-12T12:48:19Z

piyushrj
Jun 12, 2018

Hello, I'm following this thread to extract Dates from text corpus, I'm using Spacy's ner model to get dates and have created some custom rules using Matcher class to identify some missed date formats. Here is my code, I need to test my approach and want to see its coverage for different kinds of date formats but I don’t have the data to do so, if I could get such data for doing this coverage analysis it would be of great help.

0 replies

honnibal · 2019-03-11T15:08:46Z

honnibal
Mar 11, 2019
Maintainer

I think we can finally close this 🎉

While we don't have time and date parsing in the library itself, in v2.1 we now have:

Correct handling of quantifiers in the Matcher
Support for matching against token extensions
Support for matching with regular expressions
Support for matching with comparison and set-valued predicates
The EntityRuler class, to make the match dictionaries into an entity recognizer
Entry point hooks for custom pipeline components

The entry point hooks mean that grammars for things like times and dates don't need to live in the main library --- they can be published as extension modules, with no difference to user experience. I think the matcher support we now have basically completes the core library functionality needed to do this.

0 replies

Idea: Date and Time Parsing #513

Replies: 45 comments

honnibal Oct 1, 2016 Maintainer

cmuell89 Oct 3, 2016 Author

honnibal Oct 3, 2016 Maintainer

cmuell89 Oct 3, 2016 Author

cmuell89 Oct 5, 2016 Author

honnibal Oct 5, 2016 Maintainer

cmuell89 Oct 5, 2016 Author

cmuell89 Oct 5, 2016 Author

cmuell89 Oct 10, 2016 Author

cmuell89 Oct 14, 2016 Author

ines Nov 9, 2017 Maintainer

ines Nov 21, 2017 Maintainer

honnibal Mar 11, 2019 Maintainer

honnibal
Oct 1, 2016
Maintainer

cmuell89
Oct 3, 2016
Author

honnibal
Oct 3, 2016
Maintainer

cmuell89
Oct 3, 2016
Author

cmuell89
Oct 5, 2016
Author

honnibal
Oct 5, 2016
Maintainer

cmuell89
Oct 5, 2016
Author

cmuell89
Oct 5, 2016
Author

cmuell89
Oct 10, 2016
Author

cmuell89
Oct 14, 2016
Author

ines
Nov 9, 2017
Maintainer

ines
Nov 21, 2017
Maintainer

honnibal
Mar 11, 2019
Maintainer