Table of Contents
Intro
In this article, I’ll present a demo project for classifying the sentiment of posts from the Stocktwits social media.
The community on Stocktwits is full of investors, traders, and entrepreneurs. Each message posted is called a Twit. This is similar to Twitter’s version of a post, called a Tweet. Using Pytorch, we’ll build a model around these twits that generate a sentiment score.
This project represents my solution for one of the hands-on assignments for the Udacity “AI for Trading” Nanodegree.
For this task, the Udacity team has collected and hand-labeled a bunch of twits with their sentiment score. To capture the degree of sentiment, they’ve used a five-point scale: very negative, negative, neutral, positive, and very positive. Each twit is labeled -2 to 2 in steps of 1, from very negative to very positive, respectively.
We’ll build a sentiment analysis model that will learn to assign sentiment to twits on its own, using this labeled data.
Run the Notebook
You can access the notebook in Google Colab here. It is completely self-contained with all the required resources, so you can dig into it and see how it works step by step. If you want to run the code, you’ll have to make a copy in your own Colab environment.
Alternatively, I’ve also posted the notebook on GitHub.
“AI for Trading” Nanodegree – Overview
First, let me do a quick summary of the Nanodegree in case you’re interested in what it offers.
You will learn the basics of quantitative analysis, including data processing, trading signal generation, and portfolio management. You’ll use Python to work with historical stock data, develop trading strategies, and construct a multi-factor model with optimization.
Following is a list of the main topics covered and the practical projects you’ll complete.
Basic Quantitative Trading
Learn about market mechanics and how to generate signals with stock data. Work on developing a momentum-trading strategy in your first project.
Hands-on project – Trading with Momentum
Implement a momentum trading strategy and test if it has the potential to be profitable. You will work with historical data of a given stock universe and generate a trading signal based on a momentum indicator. You will then compute the signal and produce projected returns. Finally, you will perform a statistical test to conclude if there is alpha in the signal.
Advanced Quantitative Trading
Learn the quant workflow for signal generation, and apply advanced quantitative methods commonly used in trading.
Hands-on project – Breakout Strategy
Code and evaluate a breakout signal. You will run statistical tests to test for normality and to find alpha. You will also learn about the effect that filtered outliers could have on your trading signal and identify if the outliers could be a valid trading signal. You will make a judgment call about what should be kept versus what should not.
Stocks, Indices, and ETFs
Learn about portfolio optimization and financial securities formed by stocks, including market indices, vanilla ETFs, and Smart Beta ETFs.
Hands-on project – Smart Beta and Portfolio Optimization
Create two portfolios using smart beta methodology and optimization. You will evaluate the performance of the portfolios by calculating tracking errors. You will also calculate the turnover of your portfolio and find the best timing to rebalance. You will come up with the portfolio weights by analyzing fundamental data and quadratic programming.
Factor Investing and Alpha Research
Learn about alpha and risk factors, and construct a portfolio with advanced optimization techniques.
Hands-on project – Alpha Research and Factor Modeling
Research and generate multiple alpha factors. You will apply various techniques to evaluate the performance of your alpha factors and learn to pick the best ones for your portfolio. You will formulate an advanced portfolio optimization problem by working with constraints such as risk models, leverage, market neutrality, and limits on factor exposures.
Sentiment Analysis with Natural Language Processing
Learn the fundamentals of text processing, and analyze corporate filings to generate sentiment-based trading signals.
Hands-on project – Sentiment Analysis using NLP
Work with corporate 10Q and 10K filings and apply your newly-learned knowledge in Natural Language Processing, from cleaning data and text processing to feature extraction and modeling. You will use bag-of-words and TF-IDF to generate company-specific sentiments. Then you will come up with trading strategies and measure the performance of your strategies.
Advanced Natural Language Processing with Deep Learning
Learn to apply deep learning in quantitative analysis and use recurrent neural networks and long short-term memory to generate trading signals.
Hands-on project – Deep Neural Network with News Data
Build deep neural networks to process and interpret news data. You will play with different ways of embedding words into vectors. You will construct and train LSTM networks for classifying sentiments. You will run backtests and apply the models to news data for signal generation.
Combining Multiple Signals
Learn advanced techniques to select and combine the factors you’ve generated from traditional and alternative data.
Hands-on project – Combine Signals for Enhanced Alpha
Create a model for the S&P 500 and its constituent stocks by selecting a model for a large data set that includes market data, fundamental data, and alternative data. You will validate your model to ensure there is no overfitting. You will rank and select stocks to construct a long/short portfolio based on the prediction results.
Simulating Trades with Historical Data
Learn to refine trading signals by running rigorous backtests. Track your P&L while your algorithm buys and sells.
Hands-on project – Backtesting
Construct an OHLC data feed and a backtesting framework. You will learn about various visualization techniques for backtesting. You will construct trading strategies using various parameters such as trade days, take profit levels, stop loss levels, etc. You will then optimize the parameters and evaluate the performance by analyzing the results of your backtests.
Sentiment Analysis of Stocktwits Messages – Implementation
In this section, I’ll walk you through the Sentiment Analysis notebook, but I encourage you to visit the Colab notebook to play with the actual implementation.
Import Twits
Load Twits Data
The input twits.json
JSON file contains a list of objects for each twit in the data
field:
{'data': {'message_body': 'Neutral twit body text here', 'sentiment': 0}, {'message_body': 'Happy twit body text here', 'sentiment': 1}, ... }
The fields represent the following:
message_body
: The text of the twit.sentiment
: Sentiment score for the twit, ranges from -2 to 2 in steps of 1, with 0 being neutral.
Let’s add a few utility methods for storing and loading files publicly accessible on the internet or saved on the file system. They will be handy later on when we’d like to load pre-saved data in order to spare some processing time.
Let’s load the twits and print a few to the output.
You can see that we’re using the load_file_from_external_source
function. The results of all of the expensive
procedures are pre-stored into files and hosted publicly free for download, so anyone can run the Colab notebook.
Of course, if you wish, you can run the code yourself instead of using pre-calculated values.
Length of Data
Now let’s look at the number of twits in the dataset.
Split Message Body and Sentiment Score
Preprocessing the Data
With our data in hand, we need to preprocess our text. These twits are collected by filtering on ticker symbols, where these are denoted with a leader $
symbol in the twit itself. For example:
{'message_body': 'RT @google Our annual look at the year in Google blogging (and beyond) http://t.co/sptHOAh8 $GOOG', 'sentiment': 0}
The ticker symbols don’t provide information on the sentiment, and they are in every twit, so we should remove them. This twit also has the @google
username, again not providing sentiment information, so we should also remove it. We also see a URL http://t.co/sptHOAh8
. Let’s remove these too.
The easiest way to remove specific words or phrases is with regex using the re
module. You can sub out specific patterns with a space:
re.sub(pattern, ' ', text)
This will substitute a space with anywhere the pattern matches in the text. Later, when we tokenize the text, we’ll split it appropriately in those spaces.
Pre-Processing
Preprocess All the Twits
Now we can preprocess each of the twits in our dataset.
Bag of Words
Now with all of our messages tokenized, we want to create a vocabulary and count how often each word appears in our entire corpus.
Frequency of Words Appearing in Message
With our vocabulary, now we’ll remove some of the most common words such as the
, and
, it
, etc. These words don’t contribute to identifying sentiment and are really common, resulting in a lot of noise in our input. If we can filter these out, then our network should have an easier time learning.
We also want to remove really rare words that show up in a only a few twits.
Note: There is no exact number for low and high-frequency cut-offs. However, there is a correct optimal range. You should ideally set up a low-frequency cut-off from 0.0000002 to 0.000007 (inclusive) and a high-frequency from 5 to 20 (inclusive). If the number is too big, we lose lots of important words that we can use in our data.
Updating Vocabulary by Removing Filtered Words
Let’s create three variables that will help with our vocabulary.
Balancing the classes
Let’s do a few last pre-processing steps. If we look at how our twits are labeled, we’ll find that 50% of them are neutral. This means that our network will be 50% accurate just by guessing 0 every single time. To help our network learn appropriately, we’ll want to balance our classes. That is, make sure each of our different sentiment scores show up roughly as frequently in the data.
We should also take this opportunity to remove messages with length 0.
Check we did it correctly
Finally, let’s convert our tokens into integer ids which we can pass to the network.
Neural Network
Now we have our vocabulary, which means we can transform our tokens into ids, which are then passed to our network. So, let’s define the network now!
The network architecture looks like so:
Implement the text classifier
The network consists of three main parts: 1) init function __init__
2) forward pass forward
3) hidden state init_hidden
We are using softmax
to find the probability for each outcome.
Test Model
Training
DataLoaders and Batching
Now we should build a generator that we can use to loop through our data. It’ll be more efficient if we can pass our sequences in batches. Our input tensors should look like (sequence_length, batch_size). So if our sequences are 40 tokens long and we pass in 25 sequences, then we’d have an input size of (40, 25).
If we set our sequence length to 40, what do we do with messages that are more or less than 40 tokens? For messages with fewer than 40 tokens, we will pad the empty spots with zeros. We should be sure to left pad so that the RNN starts from nothing before going through the data. If the message has 20 tokens, then the first 20 spots of our 40 long sequence will be 0. If a message has more than 40 tokens, we’ll just keep the first 40 tokens.
Training and Validation
With our data in nice shape, we’ll split it into training and validation sets.
Training
It’s time to train the neural network!
Making Predictions
Prediction
Okay, now that you have a trained model, try it on some new twits and see if it works appropriately. Remember that for any new text, you’ll need to preprocess it first before passing it to the network. Implement the predict
function to generate the prediction vector from a message.
You can see that the prediction returns a probability of 0.69 for the “positive” class and 0.28 for the “very positive” class.