Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/sentiment analysis #34

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

tbittencourt
Copy link

This macro iterates through a piece of text to return the overall sentiment of that text.

First, the macro pre-processes the text removing unnecessary punctuation and stopwords to help increase the accuracy of the model. Subsequently, using the transformers library it applies a sentiment analysis pipeline based on a pre-trained model that will return either a score or a label for the text.

Recommendation is to use the following popular models:

  1. cardiffnlp/twitter-roberta-base-sentiment-latest:
    (https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest)
    This model is trained on 124M tweets from January 2018 to December 2021, and is finetuned for sentiment analysis.
    It outputs a label - Neutral, Positive or Negative - and a score ranging from 0 to 1 - 0 being the most negative and 1,
    the most positive.

  2. nlptown/bert-base-multilingual-uncased-sentiment:
    (https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)
    This model is fine-tuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French,
    Spanish and Italian. It outputs a label - 1 to 5 stars - and a score ranging from 0 to 1 - 0 being the
    most negative and 1, the most positive.

Macro returns a STRING data type. If 'score' is used as an output, then it will have to be cast to FLOAT data type.

Copy link

@cris-seaton cris-seaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first iteration of review. let's chat about it when your back in the 'office'

This model is trained on 124M tweets from January 2018 to December 2021,
and is finetuned for sentiment analysis.
It outputs a label - Neutral, Positive or Negative - and a score ranging
from 0 to 1 - 0 being the most negative and 1, the most positive.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've mis-interpreted score here. Score is more akin to the confidence in the assessment.

image

image

I actually have mixed results with nlptown model as well - the same positive text in the screenshots only yield 1 star with a score of 0.76.

Comment on lines +3 to +4
create or replace function {{target.schema}}.udf_sentiment_analysis(text STRING, output INTEGER)
returns STRING

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your arguments need revision
text STRING, model INTEGER or STRING [more on this later], output VARIANT (if you still want to differentiate between label and score)

and returns STRING, you have to make sure L32 has to cast as a string.

model = "cardiffnlp/twitter-roberta-base-sentiment-latest"
-- model = 'nlptown/bert-base-multilingual-uncased-sentiment'

def sentiment_analysis(text, model_id, output='score'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your function name (sentiment_analysis) and your handler on L8 have to be equivalent. you also should have the equivalent # of arguments here and in your UDF call.

@Mayurjit
Copy link

Mayurjit commented Sep 6, 2024

function name and handler should be quivalent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants