Scikit-Learn: Active learning with Random Forest¶
In this tutorial, you will learn how to use Baal on a scikit-learn model.
In this case, we will use RandomForestClassifier
.
This tutorial is based on the tutorial from Saimadhu Polamuri.
First, if you have not done it yet, let's install Baal.
pip install baal
In [12]:
Copied!
%load_ext autoreload
%autoreload 2
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
HEADERS = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
"SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]
import pandas as pd
data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
dataset = pd.read_csv(data)
dataset.columns = HEADERS
# Handle missing labels
dataset = dataset[dataset[HEADERS[6]] != '?']
# Split
train_x, test_x, train_y, test_y = train_test_split(dataset[HEADERS[1:-1]], dataset[HEADERS[-1]],
train_size=0.7)
clf = RandomForestClassifier()
clf.fit(train_x, train_y)
# Get metrics
predictions = clf.predict(test_x)
print("Train Accuracy :: ", accuracy_score(train_y, clf.predict(train_x)))
print("Test Accuracy :: ", accuracy_score(test_y, predictions))
print(" Confusion matrix ", confusion_matrix(test_y, predictions))
%load_ext autoreload
%autoreload 2
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
HEADERS = ["CodeNumber", "ClumpThickness", "UniformityCellSize", "UniformityCellShape", "MarginalAdhesion",
"SingleEpithelialCellSize", "BareNuclei", "BlandChromatin", "NormalNucleoli", "Mitoses", "CancerType"]
import pandas as pd
data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data'
dataset = pd.read_csv(data)
dataset.columns = HEADERS
# Handle missing labels
dataset = dataset[dataset[HEADERS[6]] != '?']
# Split
train_x, test_x, train_y, test_y = train_test_split(dataset[HEADERS[1:-1]], dataset[HEADERS[-1]],
train_size=0.7)
clf = RandomForestClassifier()
clf.fit(train_x, train_y)
# Get metrics
predictions = clf.predict(test_x)
print("Train Accuracy :: ", accuracy_score(train_y, clf.predict(train_x)))
print("Test Accuracy :: ", accuracy_score(test_y, predictions))
print(" Confusion matrix ", confusion_matrix(test_y, predictions))
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload Train Accuracy :: 1.0 Test Accuracy :: 0.9658536585365853 Confusion matrix [[119 3] [ 4 79]]
/home/fred/miniconda3/envs/pytorch/lib/python3.7/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified. FutureWarning) /home/fred/miniconda3/envs/pytorch/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Now that you have a trained model, you can use it to perform uncertainty estimation.
The SKLearn API directly propose RandomForestClassifier.predict_proba
which would return the mean
response from the RandomForest.
But if you wish to try one of our heuristics in baal.active.heuristics
, here's how.
In [13]:
Copied!
import numpy as np
from baal.active.heuristics import BALD
print(f"Using {len(clf.estimators_)} estimators")
# Predict independently for all estimators.
x = np.array(list(map(lambda e: e.predict_proba(test_x), clf.estimators_)))
# Roll axis because Baal expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
print("Uncertainty per sample")
print(BALD().compute_score(x))
print("Ranks")
print(BALD()(x))
import numpy as np
from baal.active.heuristics import BALD
print(f"Using {len(clf.estimators_)} estimators")
# Predict independently for all estimators.
x = np.array(list(map(lambda e: e.predict_proba(test_x), clf.estimators_)))
# Roll axis because Baal expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
print("Uncertainty per sample")
print(BALD().compute_score(x))
print("Ranks")
print(BALD()(x))
Using 10 estimators Uncertainty per sample [0. 0. 0. 0. 0. 0. 0.32508297 0. 0. 0.32508297 0. 0.32508297 0. 0. 0. 0. 0.32508297 0. 0. 0. 0. 0. 0. 0.50040242 0. 0. 0.32508297 0. 0.32508297 0. 0. 0. 0.32508297 0. 0. 0.32508297 0. 0. 0. 0. 0. 0. 0. 0.50040242 0. 0.69314718 0. 0. 0. 0.32508297 0. 0.6108643 0. 0.32508297 0. 0. 0. 0. 0. 0. 0. 0. 0.32508297 0. 0. 0. 0. 0.32508297 0. 0. 0. 0.50040242 0. 0.6108643 0. 0. 0. 0. 0. 0.32508297 0. 0. 0. 0. 0. 0.50040242 0.6108643 0. 0. 0.50040242 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.69314718 0. 0. 0.67301167 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.32508297 0. 0.32508297 0.50040242 0.50040242 0. 0. 0. 0. 0. 0.67301167 0. 0. 0. 0. 0. 0.6108643 0. 0.32508297 0. 0. 0. 0.32508297 0. 0. 0. 0. 0. 0. 0. 0. 0.6108643 0. 0. 0. 0. 0. 0. 0.67301167 0. 0. 0. 0. 0. 0. 0. 0.6108643 0.32508297 0. 0. 0. 0. 0. 0. 0.32508297 0. 0. 0.32508297 0. 0. 0. 0. 0. 0. 0.32508297 0. 0. 0. 0. 0. 0. 0. 0. 0.6108643 0.32508297 0.67301167 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ] Ranks [102 45 105 125 153 190 146 73 161 188 51 131 86 89 23 118 119 71 43 85 67 53 35 62 162 172 32 117 115 133 169 49 28 26 79 6 9 11 189 16 137 179 77 70 58 90 88 59 87 72 84 60 74 61 83 63 78 82 64 75 57 65 66 80 68 69 76 81 46 56 13 22 21 20 19 18 17 15 14 12 25 10 8 7 5 4 3 2 1 24 27 55 41 54 52 50 48 47 92 44 42 40 29 39 38 37 36 34 33 31 30 91 204 93 164 175 174 173 171 170 168 167 166 165 163 94 160 159 158 157 156 155 154 152 151 176 177 178 180 202 201 200 199 198 197 196 195 194 193 192 191 187 186 185 184 183 182 181 150 149 148 120 114 113 112 111 110 109 108 107 106 104 103 203 101 100 99 98 97 96 95 116 121 147 122 145 144 143 142 141 140 139 138 136 135 134 132 130 129 128 127 126 124 123 0]
Active learning with SkLearn¶
You can also try Active learning by using ActiveNumpyArray
.
NOTE: Because we focus on images, we have not made experiments on this setup.
In [14]:
Copied!
from baal.active.dataset import ActiveNumpyArray
dataset = ActiveNumpyArray((train_x, train_y))
# We start with a 10 labelled samples.
dataset.label_randomly(10)
heuristic = BALD()
# We will use a RandomForest in this case.
clf = RandomForestClassifier()
def predict(test, clf):
# Predict with all fitted estimators.
x = np.array(list(map(lambda e: e.predict_proba(test[0]), clf.estimators_)))
# Roll axis because Baal expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
return x
for _ in range(5):
print("Dataset size", len(dataset))
clf.fit(*dataset.dataset)
predictions = clf.predict(test_x)
print("Test Accuracy :: ", accuracy_score(test_y, predictions))
probs = predict(dataset.pool, clf)
to_label = heuristic(probs)
query_size = 10
if len(to_label) > 0:
dataset.label(to_label[: query_size])
else:
break
from baal.active.dataset import ActiveNumpyArray
dataset = ActiveNumpyArray((train_x, train_y))
# We start with a 10 labelled samples.
dataset.label_randomly(10)
heuristic = BALD()
# We will use a RandomForest in this case.
clf = RandomForestClassifier()
def predict(test, clf):
# Predict with all fitted estimators.
x = np.array(list(map(lambda e: e.predict_proba(test[0]), clf.estimators_)))
# Roll axis because Baal expect [n_samples, n_classes, ..., n_estimations]
x = np.rollaxis(x, 0, 3)
return x
for _ in range(5):
print("Dataset size", len(dataset))
clf.fit(*dataset.dataset)
predictions = clf.predict(test_x)
print("Test Accuracy :: ", accuracy_score(test_y, predictions))
probs = predict(dataset.pool, clf)
to_label = heuristic(probs)
query_size = 10
if len(to_label) > 0:
dataset.label(to_label[: query_size])
else:
break
Dataset size 10 Test Accuracy :: 0.9219512195121952 Dataset size 20 Test Accuracy :: 0.9658536585365853 Dataset size 30 Test Accuracy :: 0.9414634146341463 Dataset size 40 Test Accuracy :: 0.9512195121951219 Dataset size 50 Test Accuracy :: 0.9609756097560975
/home/fred/miniconda3/envs/pytorch/lib/python3.7/site-packages/sklearn/ensemble/forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22. "10 in version 0.20 to 100 in 0.22.", FutureWarning)