How To Train Your AI Models With tagtog
Training your machine learning models in a collaborative platform has never been easier.
tagtog is a collaborative text annotation platform to find, create, and maintain NLP datasets efficiently. You can find a quick introduction to tagtog here. But, apart from being just a text annotation platform, tagtog has many other functionalities, and in this article I will explain how to take advantage of one of them: training your artificial intelligence models.
The first step to start an AI project is to decide what problem we want to solve, and then choose a model that could perform adequately. Also, bear in mind that the performance of the model will depend on the quality of the data we gather. Thus, it is of paramount importance that we have access to some decent datasets. At tagtog, users can share their projects in a public way, so go to tagtog to check if you can find the dataset you were looking for among our public datasets!
In my example, I will be covering one of the most popular problems since the early stages of NLP: spam detection. That is, we are going to build an NLP model that, by analyzing the content of a text, will determine whether it's spam or not. The code of this example can be found in this repository.
This topic has been covered by many NLP practitioners over the past decades, therefore thankfully it is trivial to find well-structured and labeled data to train your model with. The dataset I used can be found in tagtog.
Regarding the model, for the purpose of explainability and simplicity, I have decided to use a simple recurrent neural network, using the popular Python library Keras. Nevertheless, simplicity must not be confused with inaccuracy, as we will see later that my model performs sufficiently well.
Let’s see now how to use the model. First, we need to clean and process the data. The dataset I chose was already clean enough, so I just included a Tokenizer to vectorize the text corpus. On the other hand, I mapped the values spam and ham (ham refers to the emails that are not spam) of the labels list into 1 and 0, respectively.
Another important thing to do is to split our dataset into two parts, one for training your model and the other one for testing it. This can be done following different methods, but I chose to take 80% of the dataset for training and the other 20% for testing. In general, you should use a part of the dataset for validation. Validation helps you detect overfitting, which results in your model being too closely fit to the training set. Again, for the sake of simplicity, I have not used a validation set.
Now, let’s move to the model definition. When building a basic Keras model, it is recommended to start with the Keras Sequential model, which is a linear stack of layers. We just need to pass a list of layer instances to the constructor. In my network architecture, the first layer is an Embedding layer. This type of layer can only be used as the first layer in a model, and it essentially represents the text data using a dense vector representation. Notice that the text data had to be previously processed with the Tokenizer API for using the embedding layer. The parameters that are passed to the Embedding constructor are the size of the vocabulary and the dimension of the dense embedding. Then we add a SimpleRNN layer, which is a fully connected RNN layer where the output is to be fed back to the input. Finally, we just need to add the output layer; in our case we will output a one dimensional array, and the activation function will be the sigmoid function. This output will be a number between 0 and 1.0, so I decided to set the decision boundary in 0.5. This means that the input text will be classified as spam when the output number is higher than 0.5, and ham otherwise.
Once we have our model defined, we must compile it, including some optional parameters such as the metrics we want our model to evaluate. After compiling the model, we need to fit it with our training data calling the method fit. In order to obtain some information about how our model performs, it is possible to use the method model.evaluate with the test data. In my case the accuracy is around 98%! I have verified that my model is not underperforming, although we should not believe that our model will make a right prediction 98% of the times we use it, as it may suffer from overfitting. If we wanted to prevent our model from overfitting, we could try many different techniques, but one of the most common ones would be regularization. Once we have followed all the previous steps, we just need to save the model to a h5 file.
Now let’s define some methods for retraining the model with new data. For this purpose, we need to follow three steps: load the model, retrain it, and then just save it again. To load the model, we just need to call the keras method load_model(model_file_path), where model_file_path is the path to the h5 file we had saved. Now we have to decide if the new data needs to be preprocessed, and if it is needed, proceed in a similar way I have explained before: use the tokenizer previously created to transform the training text into sequences and then pad them. After having our data ready, we just need to call the method model.fit() again, only this time with the new data. Now we just save the model again and the training is finished.
Now that we already know how to build and train a model, let’s go back to tagtog to see how it is possible to train the model easier. The first thing to do is to go to tagtog.net and create a new project. Then we go to the project settings and define the entities and document labels to annotate, for example. In my case, as I want to know if a text is spam or not, I created a boolean document label named IsSpam.
After that, let's create a webhook, which will trigger the training method each time a document's annotations are saved or confirmed. We just need to include the endpoint of our webhook (notice that we must include the http:// part) and then fill the rest of the parameters as follows:
We have already finished setting up our project… let’s see now how we can connect the tagtog webhook with the Python ML model! Here I include a list of the steps that we need to follow.
- Take a piece of unlabelled text
- Label it using the current version of your model
- Upload it to tagtog
- Review the annotations and change the ones that are not correct
- Confirm the document annotations
- The webhook is triggered and the model is retrained
- Repeat the previous steps so that the model can be improved
For the purpose of connecting the webhook with my ML model, I chose Flask, a Python micro web framework that stands out for its simplicity and for the absence of any external library requirements. To communicate the Flask app with the webhook, we can just import Flask, call the Flask app constructor and then just define the method as follows:
Having defined that method as explained before, now we just need to interact with tagtog’s API and with our AI model. The first thing to do is to get the data from the request sent by the webhook (tagtog -> your system), which includes the parameters we need to later obtain the training data. Notice that this method is the one that will be triggered when we confirm the annotations in a document, so that our model can be retrained. Once we have the parameters, we need to send two GET requests, the first to retrieve the annotations from a document that has been annotated, and the second to obtain its text.
The following steps are to load the model we had previously saved as a h5 file, train the model as explained before, make a prediction and then format the output of the prediction, so that it complies with the ann.json format.
The method collect_unlabeled_sample() should be implemented in such a way that it returns the text that is supposed to be annotated by the model. In my case, I will just return a list of strings to keep it simple, but it can be, for example, a method that reads a .txt file and returns its data. Following is how I implemented the prediction method, which also needs to preprocess the data the same way as before. Then we would just call the model.predict_classes() method and return its output.
Regarding the formatting of the annotations, this is the way I implemented the parser. I just needed to paste the ann.json structure and then change the metas to include the document label.
The remaining part is to upload the new documents to tagtog. For that, I changed the parameters variable for sending a new request. This parameters are essentially the same as the ones I used before, only without the tagtogID. After that, I sent a post request with the text I wanted to annotate, and this request returned the plain.html of the document, which I will use to annotate the document:
Finally, everything is ready for annotating the document. In order to do that, we need to include the format field to the parameters, and set it to "anndoc". Then, encapsulate the annotations and the text in a variable and send a new POST request.
Now, if we go to tagtog and have a look at the project, we will see that a new document has shown up! If we click in the document, we will see that it has already been labeled! 💫
If you repeat this process multiple times, you will see how your model will learn how to classify spam in a better way. Once you have your trained model, you can always take the model.h5 file and use it, for example, in any Web or Mobile Application. But spam detection is just one use case, you can use tagtog to train any model you want in any Artificial Intelligence field, from Sentimental Analysis to Machine translation, so please let me know in the comments what model you would train in tagtog!