Extracting Consumer information from 10-k text #2122

yoonchanheee · 2018-03-21T04:55:44Z

yoonchanheee
Mar 21, 2018

Hi!

I am so surprised by how efficient this spaCy module is!

Thanks for the great help!

I'm trying to extract customers information from the 10-k data.

Below are example sentences.

Net sales to the Company’s three major customers, Staples, Inc., Office Max, and United Stationers, Inc., represented approximately 43% in 2004, 46% in 2003 and 46% in 2002.

I want to extract Staples, Office Max, United Stationers from this text.

At first, I thought NER can deal with this problem.

However, there are entities that are not customer in some of the sentences.

For example,

For fiscal 2003, Fujitsu accounted for approximately 31 percent of our consolidated accounts receivable and approximately 13 percent of our consolidated gross sales.

Since NER pipe line classifies both Fleetwood and Home Depot as organizations it can only solve the problem partly.

So Next, I thought dependency parser would help me. However, there are many other forms and verbs that characterizes whether a entity is a customer or not...

To deal with those problems, I tried to train NER pipeline. I marked 2000+ sentences whether each word in sentences is customer or not in IOB format. However, when I try to train NER pipeline with those texts the amount of Loss did not goes down, and it seems like the overall accuracy is bad.

I suspect that this is because NER pipeline cannot catch the context in text.

So my question is, is there any way that I could deal with problems. (I'm thinking extracting features(entities, dependency parser feature) by Spacy and try machine learning with these features.)
Do you have any suggestion?

Your Environment

Operating System: Win 10 / 64bit
Python Version Used: 3.6
spaCy Version Used: 2.0 +
Environment Information: Anaconda

jainaayush05 · 2018-07-04T03:48:07Z

jainaayush05
Jul 4, 2018

@ines @honnibal , any help on this?

0 replies

honnibal · 2018-07-06T13:06:15Z

honnibal
Jul 6, 2018
Maintainer

Hi @yoonchanheee ,

It's not always easy to tag arbitrary text with the NER model. The NER model looks for information in the surrounding context to figure out what the tag is. Tags like "organization" are good because the context provides good clues about whether the word is an organization. But how is the model supposed to predict whether something is a customer?

I think you're best off focussing on getting high accuracy on labelling organizations, and then having a follow-up process figure out whether it's a customer. I think this will probably look like a list of organizations that are customers. You might need another process to match the name of the organization into your list, e.g. if you need some name normalization.

For improving the training of the ORG model, you might find our annotation tool Prodigy useful: https://prodi.gy . Hope that helps!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Consumer information from 10-k text #2122

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Extracting Consumer information from 10-k text #2122

yoonchanheee Mar 21, 2018

Your Environment

Replies: 2 comments

jainaayush05 Jul 4, 2018

honnibal Jul 6, 2018 Maintainer

yoonchanheee
Mar 21, 2018

jainaayush05
Jul 4, 2018

honnibal
Jul 6, 2018
Maintainer