Build a Sentiment Analysis app from scratch

Overview

This guide outlines the full process of building a state-of-the-art sentiment classifier app. We will tackle the following steps:

  1. Importing your WhatsApp data into your personal and private POD.
  2. Setting up your a new project
  3. Labelling imported data with the built-in labelling tool
  4. Loading and training your model in Google Colab
  5. Adding your model to an app plugin and testing
  6. Deploying your data app
  7. Adding a UI to your app

Import your WhatsApp data into your POD

With Memri, you can import data from third party services like Whatsapp to your Memri Pod. Once your data is in the pod, you can easily label it, and use it to build data powered applications. All your data is fully encrypted in the Pod, and is accessible with keys that only you own.

To import your Whatsapp messages:

In the memri app, on Data page click the Whatsapp importer and follow the instructions. You will need your mobile phone with WhatsApp installed to link a device from within a WhatsApp.

  • Select WhtasApp importer if you are using an up to date version of the app.
  • Select WhatsApp Legacy importer if you’re using WhatsApp version 2.22.10.10 or earler.

That’s it! Once WhatsApp is sucesfully authorised, your messages will automatically sync with your POD.

Notes:
You can use up to four linked devices at a time.
Your phone doesn’t need to stay online to use WhatsApp on linked devices, but your linked devices will be logged out if you don’t use your phone for over 14 days.

Set up your new project

Project name and data source

Once your data is imported into the pod, you can create your first ML project. To do that you need to provide a name for your new project (this will also be the name of your app).

Next, select the data source for your new app – in this tutorial – Whatsapp.

Select feature variables

Feature variables represent different information available in your connected datasets. You can think of them as column headers: for instance choosing “message text” means that your ML model will use the content of your messages as input to make predictions. The variables chosen should be relevant to the predictions you wish your app to make.

Since we are working on sentiment analysis app making predictions on the content of your messages, select: content.

You have now created a dataset (a subset all all the data in the pod) for your ML app. By default, the dataset has 500 items.

Label imported data with the built-in labelling tool

Once the dataset has been created, we need to configure the built-in labeling tool and label or annotate the items in the dataset to provide enough data samples for the model to make accurate predictions.

Add labels

Labels help your model to learn to detect and identify the class of objects in raw data (for instance, to identify all of the messages in your POD with positive sentiment).

We are building a sentiment analysis app, so you need to add the following labels:

  • positive
  • negative
  • neutral

and you’re ready to go to the next step – labelling your data.

Label your data

To train your machine learning algorithm, you will need data with propoer labels. It is important to label the data accurately, as it provides a basis for testing and validating your model.

To do that, tag each message with one of the previously created labels. If neither is relevant, you may skip item. Use the index numbers for keyboard shortcuts to help you label at warp speed!

You don’t need to label entire dataset upfront, though accurately labeled data will increase your model’s accuracy. Once you feel you have labeled enough to train the model, quit labelling and move to next step. If necessary, you will be able to come back to this step and label more data.

Load and train your model on Google Colab

Once you have labeled your data, you can use it to train your model. For this we are going to use Google Colab – a jupyter notebook that allows to create, edit and run live code in a single document.

If you have never user notebooks, please go through Google Colab quick intro.

Google Colab Sentiment analysis template

In the Google Colab sentiment analysis template for training your model you will:

  1. import the libraries necessary for training your model
  2. load your labeled dataset from the POD. In this step you will be prompted to provide the corresponding dataset name pod pod keys. You will find them in the Memri app under: ‘You will need this information’ section
  3. use the dataset to fine-tune a RoBERTa text classifier
  4. upload the trained model to your repo at gitlab.memri.io so that it can be used in a plugin*. To do this, you are required to create a personal project on gitlab.memri.io for your plugin.

At the end of this process, your model should be accessible via the package registry and you can use it as a plugin.

Google Colab is an interactive environment, so once you follow the instructions for each code cell (snippet), remember to execute them by either: selecting it with a click and pressing the play button to the left of the code or ‘Command/Ctrl+Enter’.

The first time you are uploading a model to gitlab you need to create an access_token. Click at the blue button with ‘Create personal access token’. Then, paste it at the last step of the Google Colab template.

Deploy your data app

You have created your model and made it accesible via the package registry. In this tutorial you will learn how to use this model from within your data app.

For the next steps we are assuming that you have python installed and know how to manage environments for python. If you don’t please start with a basic python setup guide.

Create a plugin from template

The backend of your data app is a Memri plugin, written in python. Memri has a python library that will help you building plugins: pymemri.

First things first, lets install pymemri:

pip install pymemri

The Pymemri template module offers a way to create a project from a template, with things like setup, testing and CI preconfigured.

On the Memri Gitlab*, create a blank public repository** for your plugin and clone the repo. Note that gitlab.memri.io is a self-hosted gitlab, so this wont work with an account from gitlab.com.

From within the new repo, run the pymemri plugin template CLI.

plugin_from_template --template="classifier_plugin"  --description="A transformer based sentiment analyis plugin" \
                     --install_requires=transformers,sentencepiece,protobuf,torch==1.10.0
pip install -e .

The plugin_from_template call creates the following folder structure:

├── setup.cfg                           
├── setup.py                            
├── Dockerfile                          <- Dockerfile for your plugin, which builds the plugin docker image
├── metadata.json                       <- Metadata for your plugin, the Pymemri frontend uses this during installation
├── .gitignore                          
├── .gitlab-ci.yml                      <- CI/CD for your plugin, which 1) installs your plugin and pod 2) runs tests 3) deploys your plugin
├── sentiment_plugin                    <- Source code for your plugin
│   ├── model.py                        <- Model definition of your classifier
│   ├── plugin.py                       <- Plugin class
│   ├── schema.py                       <- The schema definition for your plugin
│   └── utils.py                        <- Utility functions for plugins: converting items, converting photos, etc.
├── tests                               <- Tests for your plugin 
│   └── test_plugin.py
├── tools                               
│    └── preload.py                     <- You can define logic here that downloads models and assets required to run your plugin

The resulting plugin in sentiment_plugin/plugin.py is the entrypoint of your project. The Plugin.run method is called when the pod runs your plugin.

The model used by this plugin is defined in model.py, which we will edit in the next step to define our sentiment classifier.

Load a pretrained Transformer pipeline from Hugging Face

We are building a sentiment analysis plugin that could be used by users from different countries, which are owning data in different languages. In this guide we are not training the model from scatch. Instead we will be using RoBERTa model (twitter-xlm-roberta-base-sentiment).

The model card contains all the code we need to build a functioning sentiment analysis plugin. We are slightly modifying the standard example to deal with messages that are longer than the default max length. We insert the following code into the model template in sentiment_plugin/model.py:

from typing import List, Any

"""Add this line"""
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

class Model:
    def __init__(self, name: str = "cardiffnlp/twitter-xlm-roberta-base-sentiment", version: str = None):
        self.name = name
        self.version = version

        """Add these lines"""
        model = AutoModelForSequenceClassification.from_pretrained(self.name)
        tokenizer = AutoTokenizer.from_pretrained(self.name, model_max_length=512)
        self.pipeline = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, return_all_scores=True, truncation=True)

    def predict(self, x: List[str]) -> List[dict]:

        """Add this line"""
        return self.pipeline(x)

Test your plugin locally

To test your plugin, its needs to

  1. get data from the pod
  2. make predictions on that data
  3. write data back to the pod.

Step 3 is handled by the template, and we just implemented step 2. We need to manually define step 1: getting data from the pod.

In tests/test_plugin.py, a simple pytest setup is defined. To implement step 1 we write the create_dummy_data method in tests/test_plugin.py, which adds data for our tests to the Pod and returns a query that retrieves said data. In tests/test_plugin.py, we then run the plugin on this data and verify the output.

def create_dummy_data(client: PodClient) -> dict:
    client.add_to_schema(Message)
    
    # Add multilingual data
    sample_data = [
        "this is great",
        "this is awful",
        "c'est incroyable",
        "c'est horrible"
    ]

    client.bulk_action(
        create_items = [Message(content=sample, service="sentiment_test") for sample in sample_data]
    )
    
    # Query to retrieve dummy data
    search_query = {"type": "Message", "service": "sentiment_test"}
    return search_query

Next, start a local pod. You can either check the pod readme, or close your eyes and run this oneliner (assuming you have docker installed)

docker run --rm --init -p 3030:3030 --name pod -v /var/run/docker.sock:/var/run/docker.sock --entrypoint /pod gitlab.memri.io:5050/memri/pod:dev-latest --owners=ANY --insecure-non-tls=0.0.0.0 --plugins-callback-address=http://pod:3030

With a Pod running, you can now run your tests using:

pytest

Push the plugin to gitlab

To be able to use your plugin on the data saved in your POD, you have to publish a docker container with your plugin to the GitLab container registry of your repo.

Using the plugin template, publishing is just a matter of pushing your code to your repo in the dev or prod branch. Let’s try that:

git add -A
git commit -m "publish v0.1"
git push

It will take a few minute before the pipeline is completed and the image shows up into your container registry. You can see your ci pipeline and its progress in your repo under CI/CD -> Pipelines, for an example, see this.

Configure your plugin

To install the plugin in the frontend, we need to link the docker image from our container registry to the frontend. Before we can do this, we need a file called config.json. You can create the config file using the pymemri CLI:

create_plugin_config

The resulting config.json defines the arguments of your plugin that are defined in your Plugin class in sentiment_plugin/plugin.py. For instance, maybe you want to change your model in some cases, or change the language settings. It looks roughly like this:

[
  {
    "name": "model_name",
    "display": "Model Name",
    "data_type": "Text",
    "type": "textbox",
    "default": "cardiffnlp/twitter-xlm-roberta-base-sentiment",
    "optional": true
  },
  {
    "name": "model_version",
    "display": "Model Version",
    "data_type": "Text",
    "type": "textbox",
    "default": "0.1",
    "optional": true
  },
  ...
]

Register the plugin in the Memri app

Equiped with our config, we can now register the plugin in the frontend. First we push the new changes to git, after which we can proceed by simply entering the url to your repo in the Memri app (e.g. https://gitlab.memri.io/eelcovdw/sentiment_plugin).

Add a UI to your app

At the last step of this tutorial, you will be adding a user interface to your app. You can do this from within the memri app using an embedded ace editor (no recompilation is needed).

Memri comes with the CVU (pronounced as: c view) language that enables you to control how you view and use your information. CVU (c-view) stands for Cascading Views. Writing CVU feels like modern UI programming for mobile or desktop.

Building blocks

You can compose your UI using standard building blocks like VStacks, HStacks, Text and Buttons similar to how for instance SwiftUI and Flutter do this.

CVU Documentation

If you want to learn more about how to build apps using cvu, check out the cvu documentation