How to use BaaL with Scikit-Learn models

In this tutorial, you will learn how to use BaaL on a scikit-learn model. In this case, we will use RandomForestClassifier.

This tutorial is based on the tutorial from Saimadhu Polamuri.

First, if you have not done it yet, let’s install BaaL.

pip install baal
[12]:
%load_ext autoreload
%autoreload 2
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
HEADERS = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
           "SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]

import pandas as pd
data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
dataset = pd.read_csv(data)
dataset.columns = HEADERS

# Handle missing labels
dataset = dataset[dataset[HEADERS[6]] != '?']


# Split
train_x, test_x, train_y, test_y = train_test_split(dataset[HEADERS[1:-1]], dataset[HEADERS[-1]],
                                                        train_size=0.7)


clf = RandomForestClassifier()
clf.fit(train_x, train_y)

# Get metrics
predictions = clf.predict(test_x)
print("Train Accuracy :: ", accuracy_score(train_y, clf.predict(train_x)))
print("Test Accuracy  :: ", accuracy_score(test_y, predictions))
print(" Confusion matrix ", confusion_matrix(test_y, predictions))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Train Accuracy ::  1.0
Test Accuracy  ::  0.9658536585365853
 Confusion matrix  [[119   3]
 [  4  79]]
/home/fred/miniconda3/envs/pytorch/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)
/home/fred/miniconda3/envs/pytorch/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Now that you have a trained model, you can use it to perform uncertainty estimation. The SKLearn API directly propose RandomForestClassifier.predict_proba which would return the mean response from the RandomForest.

But if you wish to try one of our heuristics in baal.active.heuristics, here’s how.

[13]:
import numpy as np
from baal.active.heuristics import BALD
print(f"Using {len(clf.estimators_)} estimators")

# Predict independently for all estimators.
x = np.array(list(map(lambda e: e.predict_proba(test_x), clf.estimators_)))
# Roll axis because BaaL expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
print("Uncertainty per sample")
print(BALD().compute_score(x))

print("Ranks")
print(BALD()(x))

Using 10 estimators
Uncertainty per sample
[0.         0.         0.         0.         0.         0.
 0.32508297 0.         0.         0.32508297 0.         0.32508297
 0.         0.         0.         0.         0.32508297 0.
 0.         0.         0.         0.         0.         0.50040242
 0.         0.         0.32508297 0.         0.32508297 0.
 0.         0.         0.32508297 0.         0.         0.32508297
 0.         0.         0.         0.         0.         0.
 0.         0.50040242 0.         0.69314718 0.         0.
 0.         0.32508297 0.         0.6108643  0.         0.32508297
 0.         0.         0.         0.         0.         0.
 0.         0.         0.32508297 0.         0.         0.
 0.         0.32508297 0.         0.         0.         0.50040242
 0.         0.6108643  0.         0.         0.         0.
 0.         0.32508297 0.         0.         0.         0.
 0.         0.50040242 0.6108643  0.         0.         0.50040242
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.69314718 0.         0.         0.67301167 0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.32508297 0.         0.32508297 0.50040242 0.50040242
 0.         0.         0.         0.         0.         0.67301167
 0.         0.         0.         0.         0.         0.6108643
 0.         0.32508297 0.         0.         0.         0.32508297
 0.         0.         0.         0.         0.         0.
 0.         0.         0.6108643  0.         0.         0.
 0.         0.         0.         0.67301167 0.         0.
 0.         0.         0.         0.         0.         0.6108643
 0.32508297 0.         0.         0.         0.         0.
 0.         0.32508297 0.         0.         0.32508297 0.
 0.         0.         0.         0.         0.         0.32508297
 0.         0.         0.         0.         0.         0.
 0.         0.         0.6108643  0.32508297 0.67301167 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.        ]
Ranks
[102  45 105 125 153 190 146  73 161 188  51 131  86  89  23 118 119  71
  43  85  67  53  35  62 162 172  32 117 115 133 169  49  28  26  79   6
   9  11 189  16 137 179  77  70  58  90  88  59  87  72  84  60  74  61
  83  63  78  82  64  75  57  65  66  80  68  69  76  81  46  56  13  22
  21  20  19  18  17  15  14  12  25  10   8   7   5   4   3   2   1  24
  27  55  41  54  52  50  48  47  92  44  42  40  29  39  38  37  36  34
  33  31  30  91 204  93 164 175 174 173 171 170 168 167 166 165 163  94
 160 159 158 157 156 155 154 152 151 176 177 178 180 202 201 200 199 198
 197 196 195 194 193 192 191 187 186 185 184 183 182 181 150 149 148 120
 114 113 112 111 110 109 108 107 106 104 103 203 101 100  99  98  97  96
  95 116 121 147 122 145 144 143 142 141 140 139 138 136 135 134 132 130
 129 128 127 126 124 123   0]

Active learning with SkLearn

You can also try Active learning by using ActiveNumpyArray.

NOTE: Because we focus on images, we have not made experiments on this setup.

[14]:
from baal.active.dataset import ActiveNumpyArray
dataset = ActiveNumpyArray((train_x, train_y))

# We start with a 10 labelled samples.
dataset.label_randomly(10)

heuristic = BALD()

# We will use a RandomForest in this case.
clf = RandomForestClassifier()
def predict(test, clf):
    # Predict with all fitted estimators.
    x = np.array(list(map(lambda e: e.predict_proba(test[0]), clf.estimators_)))

    # Roll axis because BaaL expect [n_samples, n_classes, ..., n_estimations]
    x = np.rollaxis(x, 0, 3)
    return x

for _ in range(5):
  print("Dataset size", len(dataset))
  clf.fit(*dataset.dataset)
  predictions = clf.predict(test_x)
  print("Test Accuracy  :: ", accuracy_score(test_y, predictions))
  probs = predict(dataset.pool, clf)
  to_label = heuristic(probs)
  ndata_to_label = 10
  if len(to_label) > 0:
      dataset.label(to_label[: ndata_to_label])
  else:
    break
Dataset size 10
Test Accuracy  ::  0.9219512195121952
Dataset size 20
Test Accuracy  ::  0.9658536585365853
Dataset size 30
Test Accuracy  ::  0.9414634146341463
Dataset size 40
Test Accuracy  ::  0.9512195121951219
Dataset size 50
Test Accuracy  ::  0.9609756097560975
/home/fred/miniconda3/envs/pytorch/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)