Use NLP to predict stock price movement based on news from Reuters, we need the following 5 steps:
-
Data Collection
1.1 get the whole ticker list
1.2 crawl news from Reuters using BeautifulSoup
1.3 crawl prices using urllib2 (Yahoo Finance API is outdated)
-
Applied GloVe to train a dense word vector from Reuters corpus in NLTK
2.1 build the word-word co-occurrence matrix
2.2 factorizing the weighted log of the co-occurrence matrix
-
Feature Engineering
3.2 Unify word format: unify tense, singular & plural, remove punctuations & stop words
3.2 Extract feature using feature hashing based on the trained word vector (step 2)
3.3 Pad word senquence (essentially a matrix) to keep the same dimension
-
Trained a ConvNet to predict the stock price movement based on a reasonable parameter selection
-
The result shows a significant 1-2% improve on the test set
1.1 Download the ticker list from NASDAQ
./crawler_allTickers.py 20 # keep the top e.g. 20% marketcap companies
1.2 Use BeautifulSoup to crawl news headlines from Reuters
Note: you may need over one month to fetch the news you want.
Suppose we find a news about Facebook on Dec.13, 2016 at reuters.com
We can use the following script to crawl it and format it to our local file
./crawler_reuters.py # we can relate the news with company and date, this is more precise than Bloomberg News
By brute-force iterating company tickers and dates, we can get the dataset with about 30,000 ~ 200,000 news in the end. Since a company may have multiple news in a single day, the current version will only deal with topStory and ignore the others.
Improvement here, use normalized return [4] over S&P 500 instead of return.
./crawler_yahoo_finance.py # generate stock price raw data: stockPrices_raw.json, containing open, close, ..., adjClose
./create_label.py # use raw price data to generate stockReturns.json
To use our customized word vector, apply GloVe to train word vector from Reuters corpus in NLTK
./embeddingWord.py
Read the detail of the method here, implementation here
We can also directly use a pretrained GloVe word vector from here
Unify the word format, project word to a word vector, so every sentence results in a matrix.
Detail about unifying word format are: lower case, remove punctuation, get rid of stop words, unify tense and singular & plural using en
Seperate test set away from training+validation test, otherwise we would get a too optimistic result.
./genFeatureMatrix.py
For the sake of simplicity, I just applied a ConvoNet in Keras, the detail operations in text data is slighly differnt from the image, we can use the architecture from FIgure 1 in Yoon Kim's paper
./model_cnn.py
As shown in the result, the prediction accuracy signifinantly improves around 1% - 2% compared to random pick.
From the work by Tim Loughran and Bill McDonald, some words have strong indication of positive and negative effects in finance, we may need to dig into these words to find more information. A very simple but interest example can be found in Financial Sentiment Analysis part1, part2
As suggested by H Lee, we may consider to include features of earnings surprise due to its great value
- remove_punctuation() handles middle name (e.g., P.F -> pf)
- Yoon Kim, Convolutional Neural Networks for Sentence Classification, EMNLP, 2014
- J Pennington, R Socher, CD Manning, GloVe: Global Vectors for Word Representation, EMNLP, 2014
- Tim Loughran and Bill McDonald, 2011, “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks,” Journal of Finance, 66:1, 35-65.
- H Lee, etc, On the Importance of Text Analysis for Stock Price Prediction, LREC, 2014
- Xiao Ding, Deep Learning for Event-Driven Stock Prediction, IJCAI2015
- IMPLEMENTING A CNN FOR TEXT CLASSIFICATION IN TENSORFLOW
- Keras predict sentiment-movie-reviews using deep learning
- Keras sequence-classification-lstm-recurrent-neural-networks
- tf-idf + t-sne
- Implementation of CNN in sequence classification
- Getting Started with Word2Vec and GloVe in Python