Idea: Date and Time Parsing #513
Replies: 45 comments
-
I have a pretty good date parsing grammar that works for en_US, en_UK and fr that was written a while back using unitex but I would be willing to convert it and open source it. I have been trying to switch my company's nlp activities to spaCy, but since we have french and dutch nlp pipelines, we are still far away from it. This would definitely help though. |
Beta Was this translation helpful? Give feedback.
-
@aborsu: That'd be really great! I'm keen to have time and date parsing in spaCy. I'd love for someone to step forward to lead development on this. Lately I've been thinking a bit about local grammars/finite-state automata. I think these things are standard and well-proven. It'd be good to have an implementation of this, especially if we support a standard format such as unitex. |
Beta Was this translation helpful? Give feedback.
-
@aborsu That sounds wonderful. Might you elaborate more on what you meant by converting what you have written or used from unitex for use in SpaCy? I am not very familiar with unitex but I am assuming you patched together some of the underlying C/C++ modules for your purposes rather than make use of the GUI. I would be glad to help out if you need. @honnibal Do you have any reading recommendations on how FSAs are generally implemented for use in something like date time parsing? What sort of states are represented, what constitutes a transition to a new state etc,. Do you envision a custom implementation or do you think an external library might suit our needs? What do you see as the major steps needed to bring a date and time parsing into SpaCy? Just some questions to get ideas brewing. |
Beta Was this translation helpful? Give feedback.
-
Unitex is actually written in c++. The gui uses jni bindings (very well done jni bindings may I add) to direct the c++ code. Unitex is actually a lot more than just finite state automata's they also support using transformative grammars, cascades of transformative grammars, probabilistic taggers (but they don't provide any) and other stuff... Patterns in unitex can be reprsented as graphs and look like this. They are read left to right and the top path is preferred in case of ambiguity. Graphs created using unitex are saved as .grf files which look like this.
The top part contains information about how to display the graph, the bottom part contains the graph. This graph detects street names by detecting a word that is not in our dictionnary and followed by road OR street OR way OR avenue and it outputs this If I run unitex and tell it to output a concordancer, it will give me a list of spans output so for example -> The grf files can link one another and can be compiled to a binary format used in unitex. To incorporate this in spacy would require either rewriting the code in spaCy to be compatible with the grf or bin files used in unitex. Or wrapping around unitex. |
Beta Was this translation helpful? Give feedback.
-
There's a Python wrapper for unitex here: https://pypi.python.org/pypi/pyunitex We should check out whether this is good enough, and if it's not, write our own wrapper using Cython. The advantage of the Cython wrapper is that we can call into the C++ code with no performance penalties --- this might be appealing for some workflows. Unitex is LGPL licensed, so we can't have it as an installation requirement of spaCy. Instead, we could have it as a capability requirement, so you do |
Beta Was this translation helpful? Give feedback.
-
The wrapper hasn't been maintained in a long time. ALso to work properly, we would have to share the dictionnary the tokens and the token features which coupled to the licensing issue tell me that it might be better to inspire ourselves from unitex than to wrap it. (But I'm willing to admit that i don't have much time to do either... 😁) (I don't even really have time to do much nlp) |
Beta Was this translation helpful? Give feedback.
-
@honnibal Is date time parsing something you would like to incorporate into SpaCy under the same license? Do you think it is worth the development time or would you be satisfied with a capability requirement? I guess another potential cost would be the context switch from SpaCy to unitex. To make such a capability useful, a wrapper would have to ensure sensible mappings between annotations created by unitex and SpaCy (doc, span, token, lexemes, etc,.). And if we're learning towards capability requirement it might be worth looking into other candidates (although unitex appears to be a very good one the more I look into it). @aborsu and @honnibal I'd certainly like to assist in leading any sort of implementation (in-house or external capability) but would likely need some help/guidance from those with more expertise than myself. Capable and customizable date, time, and temporal expression parsing will be very valuable for users of SpaCy, including myself, which is why I am willing to commit time to its development. |
Beta Was this translation helpful? Give feedback.
-
In case anyone is interested, I've come across a paper that describes CasSys, a software system uses Finite State Transducer cascades for NER and various other tasks. It was developed using the Intex tool set which is the close-sourced precursor to Unitex. |
Beta Was this translation helpful? Give feedback.
-
Let's have a call to discuss this. Carl, are you free Friday morning (PDT)? Alternatively, if this can wait until next Wednesday, I'll be able to do a bit more reading, and be better prepared. |
Beta Was this translation helpful? Give feedback.
-
Yes, I am free for a call either day. Let's shoot for next Wednesday, the 12th. As you guessed, I am in California (PDT) but I am flexible to waking up early or staying up late if it works best for you. |
Beta Was this translation helpful? Give feedback.
-
Additional resources referenced in this tutorial for OpenFST provide good background and motivation on FST for application in language. All still quite new to me so I have quite a lot of reading to do. |
Beta Was this translation helpful? Give feedback.
-
@honnibal Any update on having a discussion this Wednesday? Cheers. |
Beta Was this translation helpful? Give feedback.
-
Incidentally one of the best date parsers out there is happens to be https://github.com/bear/parsedatetime. Might be worth incorporating with a nice spacy API. It uses the Apache license. |
Beta Was this translation helpful? Give feedback.
-
I have toyed with this library a bit. Correct me if I am wrong but I found it effective at parsing expressions but not at extract those expressions from text. It may parse "last week" into a date and time annotation but does not parse "last week" out of "We met for dinner and a movie last week." |
Beta Was this translation helpful? Give feedback.
-
I actually wrote a date parser for chatterbot. https://github.com/gunthercox/ChatterBot/blob/master/chatterbot/utils/parsing.py If interested, i can port it to spacy. |
Beta Was this translation helpful? Give feedback.
-
Nobody seems to have mentioned HeidelTime yet. As far as I know, this is one of the best tools for getting times and dates in many different languages. |
Beta Was this translation helpful? Give feedback.
-
Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0! |
Beta Was this translation helpful? Give feedback.
-
Date time format is a tricky question by the way, which sort of strings one wants to recognize. For instance, my company does sales and customer care. We have CFG parsers both available in English and German. We recognize also relative time strings:
Here are some example strings that are recognized in German:
... it really depends what sort of strings one wants to recognize. For instance, we recognize Quartals because of we work in business domain. Both German and English parsers are CFG parsers, hence lexical and language dependent. We use Lark and pyparsing. @ines , if you're interested I'll speak to my boss and we can provide the parsers. We'll be happy to provide:) |
Beta Was this translation helpful? Give feedback.
-
@DuyguA This looks great – and yes, this is definitely interesting! 👍 It'd be really cool to have this a separate plugin that users could plug into their spaCy pipeline. I don't know if your company already publishes open-source software, but this could be nice opportunity – you get to publish the parser and maintain the project, and I'm sure a lot of spaCy users would really appreciate it. We have a lot of users working with business text in similar domains, so your plugin could definitely be very popular! As an idea, once you've extracted the spans, the pipeline component could set additional from spacy.tokens import Span
parser = load_your_parser() # load your parser somehow
def time_and_date_parser(doc):
times = parser(doc) # run your parser over the document
for start, end in times: # assuming it returns the start and end token of the span
entity = Span(doc, start, end, label=doc.vocab.strings['TIME']) # define span
doc.ents = list(doc.ents) + [entity] # add entity to doc.ents
return doc # return Doc object for next pipeline component Usage could then look something like this: import spacy
import time_and_date_parser # or some other cool name
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(time_and_date_parser, before='ner') # add it before spaCy's entity recognizer
doc = nlp("How about this Friday, after 12.00?")
print([ent.text, ent.label for ent in doc.ents])
# ['this Friday, after 12.00', 'TIME'] spaCy's entity recognizer respects already set entities, so if the component is added before |
Beta Was this translation helpful? Give feedback.
-
I would be extremely interested in this. Currently trying to parse stuff like "Fourth and Second Wednesday of every month, 8-9 a.m. (2018)" and am pretty stumped. |
Beta Was this translation helpful? Give feedback.
-
Sorry for late answer. I made a blog post to describe how to design a context free grammar: https://duygua.github.io/blog/2018/03/28/chatbot-nlu-series-datetimeparser/ I used Lark, but one can use PyParsing and of course yacc:) For more advanced colleagues, I also included efficiency issues and language modelling. |
Beta Was this translation helpful? Give feedback.
-
@DuyguA Thanks for the blog post! |
Beta Was this translation helpful? Give feedback.
-
One needs an attribute grammar. Basically parse tree is mapped to an assembly language. Sth like this: https://en.m.wikipedia.org/wiki/Attribute_grammar https://github.com/lark-parser/lark/blob/master/examples/calc.py I can extend the blog post. This way one can convert relative dates to Python objects. |
Beta Was this translation helpful? Give feedback.
-
Thanks, that would be nice if you can extend your blog post. |
Beta Was this translation helpful? Give feedback.
-
Did anybody compared NLTK with lark in terms of speed and accuracy? |
Beta Was this translation helpful? Give feedback.
-
Yes, myself :) Lark is the fastest, PyParsing takes the second place, NLTK is not very competent. NLTK is not very competent in efficiency issues in general. It's not very really competent for realtime applications, NLTK looks like it's good for academical purposes. Nobody I know uses NLTK for industrial applications. Most of my text miner friends make SpaCy + Tensorflow/Keras backends (including myself 😁 ) |
Beta Was this translation helpful? Give feedback.
-
Great! 👍 |
Beta Was this translation helpful? Give feedback.
-
OK, I don't want to speak too much 😁 but:
If you really want to use NLTK, go for Earley parser (there exists one). We both English and German CFG for our inhouse e-mail bots and both grammars are very ambigious. I think any parser less then Earley cannot cover the language of date-time strings. |
Beta Was this translation helpful? Give feedback.
-
Hello, I'm following this thread to extract Dates from text corpus, I'm using Spacy's ner model to get dates and have created some custom rules using |
Beta Was this translation helpful? Give feedback.
-
I think we can finally close this 🎉 While we don't have time and date parsing in the library itself, in v2.1 we now have:
The entry point hooks mean that grammars for things like times and dates don't need to live in the main library --- they can be published as extension modules, with no difference to user experience. I think the matcher support we now have basically completes the core library functionality needed to do this. |
Beta Was this translation helpful? Give feedback.
-
@honnibal congrats on the release milestone!
I'm posting to see what sort of community exists amonst SpaCy users for more robust date and time parsing. For many NL-based applications, date and time parsing is tremendously useful but is a difficult task for a statistical parser to provide consistent results from application to application. There are so many arbitrary dates, ranges, and holidays that can confound a purely statistical date time parser. There is potential within SpaCy's existing features (Matcher and PhraseMatcher classes) to build out a rule-based parser similar to that of Stanford's SUTime (see SUTime in action) or wit.ai's open source Duckling.
SUTime parses text using token-oriented regex patterns to map to semantic objects representing dates, duration, ranges etc,. These semantic objects can be combined to create higher-order date and time annotations based on additional rules.
Duckling takes a more functional approach (written in Clojure) and uses primitive rule based functions that can be composed into higher order functions, ultimately returning potential date and time annotations. I believe it further uses a Naive Bayes classifier to pick out the likeliest annotation as Duckling can also parse out any sort of rule-based annotation (currency etc,.) thereby resolving conflicts.
I am curious if there is a market for creating a similar parser built within the SpaCy ecosystem (and under the same license of course). Some requirements would be as follows:
If this is not the right forum for this discussion please let me know. (Is SpaCy's gitter up a running?). I'd be happy to maintain this project's development but it would be a significant undertaking and a bit beyond my level of expertise.
Cheers!
Carl
Beta Was this translation helpful? Give feedback.
All reactions