This article was first published on March 6, 2022 on Medium and was now moved to this website. It details how I use Recurrent Neural Network (RNN) to train a model for analyzing the sentiment of a tweet.
There are 4 mains steps that we will go through:
- Step 1: Understand the problem.
- Step 2: Load, Analyze & Process Data.
- Step 3: Build the Model.
- Step 4: Train the Model.
The notebook can be found here.
Step 1: Understand the Problem
In this step we will take a look at the problem we’re trying to solve. Here we want to build a machine learning model that, for a given string input (a tweet), it can predict whether that tweet’s content is negative, positive or neutral one.
Now after have a solid understanding of the problem that we want to solve, the next step is to understand the available data.
Step 2: Load, Analyze & Process Data
Before diving into loading data, make sure that you’ve imported all the necessary libraries:
import re
import string
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import InputLayer, TextVectorization, Embedding, LSTM, Dense, Activation
from tensorflow.keras.callbacks import ModelCheckpoint,EarlyStopping,ReduceLROnPlateau
from tensorflow.keras import layers
from tensorflow.keras.layers import Embedding
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
2.1. Load Data
There are several ways to obtain data such as manually gather them, use other data augmentation technique to enrich the data or look for existing available data. For our problem, the data is available on Kaggle and if you’re running your Jupyter Notebook on Kaggle then it can be easily downloaded:
url_train = "/kaggle/input/twitter-entity-sentiment-analysis/twitter_training.csv"
url_val = "/kaggle/input/twitter-entity-sentiment-analysis/twitter_validation.csv"
data_train_org = pd.read_csv(url_train, names=["ID", "entity", "sentiment", "content"])
data_val_org = pd.read_csv(url_val, names=["ID", "entity", "sentiment", "content"])
If your notebook is not on Kaggle, you may want to download the file from this link and upload it to the platform that you’re using. After that change the url_train
and the url_val
in the script above to the file location on your system.
2.2. Analyze Data
Let’s take a look at the first few records:
data_train_org.head()

We will only care about the sentiment
and the content
fields, the content
will be the input sequence whereas the sentiment
will be the label.
Let’s check the number of sample in the dataset. Running the script below, it will output 74682
and it’s the number of instances in our dataset:
len(data_train_org) # 74682
Next, we will plot the sentiment classes and theirs counts:
data_train_org.sentiment.value_counts().plot(kind='bar')

There’re four classes in total: negative, positive, neutral and irrelevant. Since the irrelevant tweets add noise to our data, we will remove them later on.
Also, though the numbers of negative, positive and neutral tweets are pretty close to each other, there’s a quite noticeable gap between irrelevant and the rest. In some cases, this problem of class imbalance may affect model performance and we may need carefully watch out for that. However, since we’re not including irrelevant tweets in our data, we can safely discard this issue.
Next, we will plot the frequency histogram of the tweet lengths:
data_train_org.content.str.len().plot.hist(bins=np.arange(0, 1000, 50))

You can change the number of bins as well as bin size in the plot. For this plot, we can see that roughly about 50% or the tweets have length less than 100. The plot’s right-skewed (or positively skewed) which is expected for tweets.
You may also want to examine some other sample statistics:
print(data_train_org.content.str.len().max())
# 957.0
print(data_train_org.content.str.len().mean())
# 108.78365046759285
print((data_train_org.content.str.len() > 100).sum())
# 33223
print((data_train_org.content.str.len() > 100).sum() / len(data_train_org))
# 0.44485953777349296
And also take a quick look in to the contents:
for i, t_c_100 in data_train_org.content[data_train_org.content.str.len() > 200][0:20].items():
print('******\n', t_c_100, '===>', data_train_org.sentiment[i:i+1].values)
Understand the data is one of the key step before building any machine learning model. For the scope of this article, we can proceed to the next step.
2.3. Pre-process Data
As mentioned earlier, we will drop all tweets with the neutral label:
print(len(data_train_org[data_train_org.sentiment == 'Irrelevant']))
print(len(data_val_org[data_val_org.sentiment == 'Irrelevant']))
data_train = data_train_org.drop(data_train_org[data_train_org.sentiment == 'Irrelevant'].index)
data_val = data_val_org.drop(data_val_org[data_val_org.sentiment == 'Irrelevant'].index)
print(len(data_train[data_train.sentiment == 'Irrelevant']))
print(len(data_val[data_val.sentiment == 'Irrelevant']))
Drop missing values:
print(data_train.isna().sum())
data_train = data_train.dropna()
print(data_train.isna().sum())
print(data_val.isna().sum())
data_val = data_val.dropna()
print(data_val.isna().sum())
If you print out the labels, they will look like this:
data_train.sentiment[10:20]
# 10 Positive
# 11 Positive
# 12 Neutral
# 13 Neutral
# 14 Neutral
# 15 Neutral
# 16 Neutral
# 17 Neutral
# 18 Positive
# 19 Positive
# Name: sentiment, dtype: object
We need to transform those strings into numerical values, one way is to use the one-hot encode format:
# integer encode
label_encoder = LabelEncoder()
Y_train_integer_encoded = label_encoder.fit_transform( data_train.sentiment )
Y_val_integer_encoded = label_encoder.transform( data_val.sentiment )
# binary encode
Y_onehot_encoder = OneHotEncoder(sparse=False)
Y_train_onehot = Y_onehot_encoder.fit_transform( Y_train_integer_encoded.reshape(len(Y_train_integer_encoded), 1) )
Y_val_onehot = Y_onehot_encoder.transform( Y_val_integer_encoded.reshape(len(Y_val_integer_encoded), 1) )
After the above step, the labels look like this:
Y_train_onehot[10:20]
# array([[0., 0., 1.],
# [0., 0., 1.],
# [0., 1., 0.],
# [0., 1., 0.],
# [0., 1., 0.],
# [0., 1., 0.],
# [0., 1., 0.],
# [0., 1., 0.],
# [0., 0., 1.],
# [0., 0., 1.]])
Step 3: Build Model
We will leverage the Tensorflow’s Dataset API to make our training process easier. The code below will convert the training data into Tensorflow’s Dataset:
train_ds = tf.data.Dataset.from_tensor_slices((data_train.content.to_numpy(dtype=str), Y_train_onehot))
3.1. Vectorize Text Input
For sequence models, most frameworks will support inputs as encoded vectors instead of string sequences. Tensorflow’s TextVectorization
layer can be used to build a vocabulary list contains all the unique tokens extracted from our text data. Moreover, this layer can be used to normalize (or standardize) text, which is to remove unwanted characters or symbols such as stop words, punctuation marks, etc.
# Vocabulary size and number of words in a sequence.
vocab_max_size = 10000
tweet_max_length = 50
# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
max_tokens=vocab_max_size,
output_sequence_length=tweet_max_length,
)
# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
vocab_data = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(vocab_data)
You can define a custom standardizer and pass it to the to the standardizer
option in the TextVectorization layer constructor, for example:
# Create a custom standardization function to strip HTML break tags '<br />'.
def custom_standardization(input_data):
lowercase = tf.strings.lower(input_data)
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
return tf.strings.regex_replace(stripped_html,
'[%s]' % re.escape(string.punctuation), '')
To test the layer to see the the result:
vectorized_input = vectorize_layer(np.array(['test string']))
print(vectorized_input.shape) # (1, 100)
print(vectorized_input)
3.2. Use Pre-trained Embeddings
After vectorizing the input, the output vector will contains mostly zeros and just a few non-zero numbers. This sparse representation is inefficient in computing and also it doesn’t reflect the relationship between words in the vocabulary. Thus we will instead use another word embedding layer to help us capture the semantic meanings of words in the vocabulary.
You can train your own word embedding layer or using a pre-trained embedding. In this notebook, we will use the Glove pre-trained embeddings. First, we will need to download the embeddings:
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip
Load the embedding matrix from downloaded files, we will use the 200-dimensional embedding but you can use change this to a smaller number embedding dimension:
embedding_dim = 200
path_to_glove_file = f'./glove.6B.{embedding_dim}d.txt'
embeddings_index = {}
with open(path_to_glove_file) as f:
for line in f:
word, coefs = line.split(maxsplit=1)
coefs = np.fromstring(coefs, "f", sep=" ")
embeddings_index[word] = coefs
print("Found %s word vectors." % len(embeddings_index))
Next, we will convert each sparse representation of a word from vectorize_layer
to a dense representation. To do so, we first need to build a Python dictionary mapping each word in the vocabulary to an index:
voc = vectorize_layer.get_vocabulary()
word_index = dict(zip(voc, range(len(voc))))
Using the extracted embedding matrix to convert sparse vector into dense vector:
misses = 0
hits = 0
num_tokens = len(voc) + 2
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
# This includes the representation for "padding" and "OOV"
embedding_matrix[i] = embedding_vector
hits += 1
else:
misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
embedding_layer_trained = Embedding(
num_tokens,
embedding_dim,
embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
trainable=False,
)
Check the embedding layer:
embedding_layer_trained(vectorized_input).shape
3.3. Create Model
Besides the TextVectorization layer, the pre-trained embedding layer, we will add a couple of other layers to our model. Specifically, we will use the Bidirectional layer with a SimpleRNN layer. We’ll also be adding a Dropout layer and two other Dense layers:
output_len = Y_train_onehot.shape[-1]
model = Sequential([
InputLayer(input_shape=(1,), dtype=object),
vectorize_layer,
embedding_layer_trained,
tf.keras.layers.Bidirectional(layers.SimpleRNN(128)),
tf.keras.layers.Dropout(0.1),
Dense(16, activation='relu'),
Dense(output_len, activation=tf.keras.activations.softmax),
])
The model will use Adam algorithm to optimize the cost, also note that categorical_crossentropy
loss will be using since we’re working with multi-class classification problem.To evaluate the model performance, we will need a metric and in our specific case the accuracy is chosen. Also note that there are other metrics as well such as mean squared error (MSE), recall, precision, etc. The choice depends on the type of problem we’re working with. For our problem a reasonable choice is accuracy. And the higher the better.
opt = tf.keras.optimizers.Adam()
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['acc'])
model.summary()
Step 4: Train Model
For hyper-parameter tuning, we will leverage the the ReduceLROnPlateau callback and also EarlyStopping to stop the training if no progress has been made:
reduce_lr = ReduceLROnPlateau(monitor='val_loss',
factor=0.2,
pateince=3,
verbose=1,
min_delta=0.0001)
early_stopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=50, verbose=1)
callbacks_list = [early_stopping, reduce_lr]
Shuffle, repeat and batch the dataset before training:
BATCH_SIZE = 1024
SHUFFLE_BUFFER = 1000
train_ds = train_ds.shuffle(SHUFFLE_BUFFER).repeat().batch(BATCH_SIZE)
Finally, we can start training our model:
UM_EPOCHS = 100
model.fit(
train_ds,batch_size = BATCH_SIZE,
steps_per_epoch = len(X_train) // BATCH_SIZE,
epochs = NUM_EPOCHS,
validation_data = (X_val, Y_val_onehot),
callbacks = callbacks_list,
)
If you encounter errors, make sure that you only shuffle, repeat and batch the dataset once. One good way to try is to run the notebook gain from start.
Final Thoughts
Though the model accuracy is pretty high, we can still experiment with some parameters that I haven’t covered in detail. One of them is the length of output sequence in the TextVectorization layer, try to change it to a bigger number for example 200 and watch the training process closely. Another point that’s worth mentioning is the model architecture. Try to change the SimpleRNN layer with the LSTM layer and remove the BiDirectional layer and see what happens.
Leave a Reply