How to use multiple regex patterns with Normalizer spark NLP? #2599

SameekshaS · 2021-03-26T21:01:27Z

SameekshaS
Mar 26, 2021

I am working with pyspark dataframe. I need to perform tf-idf and for that I am used prior steps of tokenizing, normalization, etc using spark NLP.

I have df that looks like this after applying tokenizer:

df.select('tokenizer').show(5, truncate = 130)

+----------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                  tokenized       |
+----------------------------------------------------------------------------------------------------------------------------------+
|[content, type, multipart, alternative, boundary, nextpart, da, df, nextpart, da, df, content, type, text, plain, charset, asci...|
|[receive, ameurht, eop, eur, prod, protection, outlook, com, cyprmb, namprd, prod, outlook, com, https, via, cyprca, namprd, pr...|
|[plus, every, photographer, need, mm, lens, digital, photography, school, email, newsletter, http, click, aweber, com, ct, l, m...|
|[content, type, multipart, alternative, boundary, nextpart, da, beb, nextpart, da, beb, content, type, text, plain, charset, as...|
|[original, message, customer, service, mailto, ilpjmwofnst, qssadxnvrvc, narrig, stepmotherr, eviews, com, send, thursday, dece...|
+----------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

The next step is to apply normalizer:

I want to set multiple clean up patterns:

1) remove all numerics and numerics from words
-> example: [jhghgb56, 5897t95, fhgbg4, 7474, hfgbgb]
-> expected output: [jhghgb, fhgbg, hfgbgb]

2) remove all words less than 4
-> example: [gfh, ehfufibf, hi, df, jdfh]
-> expected output: [ehfufibf, jdfh]

so far cleanup = ["[^A-Za-z]"] fulfils the first condition but I don't understand how to use the second one.

I tried this:

cleanup = ["[^A-Za-z]"]
normalizer = Normalizer()\
     .setInputCols(['tokenized'])\
     .setOutputCol('normalized')\
     .setLowercase(True)\
     .setCleanupPatterns(cleanup)

Help would be much appreciated !

maziyarpanahi · 2021-03-26T21:20:00Z

maziyarpanahi
Mar 26, 2021
Maintainer

Spark NLP Tokenizer has minLength and maxLength parameters, you can set the minLength and filter those less than a certain length.

4 replies

SameekshaS Mar 26, 2021
Author

@maziyarpanahi thank you for looking into it. I have already setMinLength for the tokenizer but after using Normalizer the token become less than 3 characters. I would like to get rid of them.

maziyarpanahi Mar 27, 2021
Maintainer

That's interesting! I never thought about it that way that the Normalizer will break already filtered tokens into smaller tokens so the min and max length in Tokenizer won't be respected.
Maybe we can add minLength and maxLength params to Normalizer as well.

SameekshaS Mar 27, 2021
Author

@maziyarpanahi, I will be looking forward to the new editions :) thanks a lot!
but is there still a way to use the multiple conditions in clean up patterns for normalizer? I am not sure what regex pattern would fit best in this case.

maziyarpanahi Mar 29, 2021
Maintainer

The min and max will be available in the next release for Normalizer. I am not familiar with regex for length, but setCleanupPatterns accepts an array of regex patterns. If you can find a way to have a good pattern that works, you can use it here as it is acceptable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use multiple regex patterns with Normalizer spark NLP? #2599

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to use multiple regex patterns with Normalizer spark NLP? #2599

SameekshaS Mar 26, 2021

Replies: 1 comment · 4 replies

maziyarpanahi Mar 26, 2021 Maintainer

SameekshaS Mar 26, 2021 Author

maziyarpanahi Mar 27, 2021 Maintainer

SameekshaS Mar 27, 2021 Author

maziyarpanahi Mar 29, 2021 Maintainer

SameekshaS
Mar 26, 2021

Replies: 1 comment 4 replies

maziyarpanahi
Mar 26, 2021
Maintainer

SameekshaS Mar 26, 2021
Author

maziyarpanahi Mar 27, 2021
Maintainer

SameekshaS Mar 27, 2021
Author

maziyarpanahi Mar 29, 2021
Maintainer