Text Classification

By: Frédéric Branchaud-Charron (@Dref360)

In this tutorial, we will see how to use Baal inside of Label Studio, a widely known labelling tool.

By using Bayesian active learning in your labelling setup, you will be able to label only the most informative examples and avoid duplicates and easy examples.

This is also a good way to start the conversation between your labelling team and your machine learning team as they need to communicate early in the process!

We will built upon Label Studio's Text classification example, so be sure to download it and try to run it before adding Baal to it. The full example can be found here.

More info:

Support:

Github
Slack

Installing Baal

To install Baal, you will need to add baal in the generated Dockerfile .

# Dockerfile
RUN pip install --no-cache \
                -r requirements.txt \
                uwsgi==2.0.19.1 \
                supervisor==4.2.2 \
                label-studio==1.0.2 \
                baal \
                click==7.1.2 \
                git+https://github.com/heartexlabs/label-studio-ml-backend

and when developing, you should install Baal in your local environment.

pip install baal[nlp]

Modifying `simple_text_classifier.py`

The overall changes are pretty minor, so we will go step by step, specifying the class and method we are modifying. Again, the full script is available here.

Model

The simplest way of doing Bayesian uncertainty estimation in active learning is MC-Dropout (Gal and Ghahramani, 2015) which requires Dropout layers. Fortunately, HuggingFace models come with one Dropout layer, but feel free to add more!

from baal.bayesian.dropout import patch_module

# SimpleTextClassifier
def reset_model(self):
    BASE_MODEL = 'distilbert-base-uncased'
    use_cuda = torch.cuda.is_available()

    # Load model using distilbert as base.
    self.model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path=BASE_MODEL,
                                                                    num_labels=self.num_classes)
    self.tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=BASE_MODEL)
    self.model = patch_module(self.model)
    if use_cuda:
        self.model.cuda()

    # Use BaalTransformerTrainer to replace HF Trainer.
    self.trainer = BaalTransformersTrainer(model=self.model)

Create dataset

We modify the function make_dataset to be fed to HuggingFace:

from baal.active.dataset.nlp_datasets import HuggingFaceDatasets
from datasets import Dataset

def make_dataset(self, texts, labels):
    dataset = Dataset.from_dict({
        'text': texts,
        'label': labels
    })
    return HuggingFaceDatasets(dataset, self.tokenizer, target_key='label', input_key='text', max_seq_len=128, )

Training loop

We can simplify the training loop by using HuggingFace. There is a lot of data manipulation that needs to be done, but the actual training can be done as such:

# SimpleTextClassifier
def train(self, annotations, num_epochs=5):
    ...
    # train the model
    print(f'Start training on {len(input_texts)} samples')
    self.reset_model()
    self.trainer.train_dataset = self.make_dataset(input_texts, output_labels_idx)
    self.trainer.train()
    ...

Prediction

We draw multiple predictions from the model's parameter distribution using MC-Dropout. In this script we will make 20 predictions per example. Next, we use BALD (Houlsby et al, 2013) to estimate the epistemic uncertainty of each item.

Finally, we notify Label Studio to prioritise uncertain items by adding a score field to the response.

# SimpleTextClassifier
NUM_DRAWS = 20
def predict(self, tasks, **kwargs):
    # collect input texts
    input_texts = []
    for task in tasks:
        input_text = task['data'].get(self.value) or task['data'].get(DATA_UNDEFINED_NAME)
        input_texts.append(input_text)
    dataset = self.make_dataset(input_texts, [0] * len(input_texts))

    # get model predictions
    probabilities = self.trainer.predict_on_dataset(dataset, NUM_DRAWS)
    uncertainties = BALD().get_uncertainties(probabilities).tolist()
    predictions = probabilities.mean(-1)
    predicted_label_indices = np.argmax(predictions, axis=1).tolist()
    predictions = []
    for idx, score in zip(predicted_label_indices, uncertainties):
        predicted_label = self.labels[idx]
        # prediction result for the single task
        result = [{
            'from_name': self.from_name,
            'to_name': self.to_name,
            'type': 'choices',
            'value': {'choices': [predicted_label]}
        }]

        # expand predictions with their scores for all tasks
        predictions.append({'result': result, 'score': score})

    return predictions

Launching LabelStudio

Following Label Studio tutorial, you can start your ML Backend as usual:

Environment:

export LABEL_STUDIO_HOSTNAME=http://localhost:8080
export LABEL_STUDIO_ML_BACKEND_V2=True
export API_KEY=${YOUR_API_KEY}

Dependencies:

pip install baal[nlp]

How to:

Run label-studio-ml init my_ml_backend --script label_studio_baal_hf.py --force
Run label-studio-ml start my_ml_backend
Run label-studio start my-annotation-project --init --ml-backend http://localhost:9090

In the Settings, do not forget to checkbox all boxes:

and to use active learning, order by Predictions score:

Results!

To test our methodology, we used the same parameters to perform active learning on CLINC-OOS.

In Kirsch et al. 2022, we compare Entropy to Uniform sampling on this dataset:

Note that for this particular dataset, BALD is not recommended.

In conlusion, we can now use Bayesian active learning in Label Studio which would help your labelling process be more efficient. Please do not hesitate to reach out on our Slack or on Label Studio's Slack if you have feedback or questions.