Active learning functionality
In this module, we find all the utilities to do active learning.
- Dataset management
- Active loop implementation
Baal takes care of the dataset split between labelled and unlabelled examples. It also takes care of the active learning loop:
- Predict on the unlabelled examples.
- Label the most uncertain examples.
Example
from baal.active.dataset import ActiveLearningDataset
al_dataset = ActiveLearningDataset(your_dataset)
# To start, we can select 1000 random examples to be labelled
al_dataset.label_randomly(1000)
# Our training set is now 1000
len(al_dataset)
# We can label examples by their indices.
al_dataset.label([32, 10, 4])
# Our dataset length is now 1003.
len(al_dataset)
# At initialization, we can also swap attributes for the pool.
al_dataset = ActiveLearningDataset(your_dataset, pool_specifics={"transform": None})
assert al_dataset.pool.transform is None
API
baal.active.ActiveLearningDataset
Bases: SplittedDataset
A dataset that allows for active learning.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
The baseline dataset. |
required |
labelled |
Optional[ndarray]
|
An array that acts as a mask which is greater than 1 for every data point that is labelled, and 0 for every data point that is not labelled. |
None
|
make_unlabelled |
Callable
|
The function that returns an unlabelled version of a datum so that it can still be used in the DataLoader. |
_identity
|
random_state |
Set the random seed for label_randomly(). |
None
|
|
pool_specifics |
Optional[dict]
|
Attributes to set when creating the pool. Useful to remove data augmentation. |
None
|
last_active_steps |
int
|
If specified, will iterate over the last active steps instead of the full dataset. Useful when doing partial finetuning. |
-1
|
Source code in baal/active/dataset/pytorch_dataset.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 |
|
pool: ActiveLearningPool
property
Returns a new Dataset made from unlabelled samples.
ActiveIter
Iterator over an ActiveLearningDataset.
Source code in baal/active/dataset/pytorch_dataset.py
__getitem__(index)
Return items from the original dataset based on the labelled index.
check_dataset_can_label()
Check if a dataset can be labelled.
Returns:
Type | Description |
---|---|
Whether the dataset's label can be modified or not. |
Notes
To be labelled, a dataset needs a method label
with definition: label(self, idx, value)
where value
is the label for indice idx
.
Source code in baal/active/dataset/pytorch_dataset.py
get_raw(idx)
label(index, value=None)
Label data points. The index should be relative to the pool, not the overall data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index |
Union[list, int]
|
one or many indices to label. |
required |
value |
Optional[Any]
|
The label value. If not provided, no modification to the underlying dataset is done. |
None
|
Source code in baal/active/dataset/pytorch_dataset.py
load_state_dict(state_dict)
Load the labelled map and random_state with give state_dict.
reset_labelled()
state_dict()
baal.active.ActiveLearningLoop
Object that perform the active learning iteration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
ActiveLearningDataset
|
Dataset with some sample already labelled. |
required |
get_probabilities |
Function
|
Dataset -> **kwargs -> ndarray [n_samples, n_outputs, n_iterations]. |
required |
heuristic |
Heuristic
|
Heuristic from baal.active.heuristics. |
Random()
|
query_size |
int
|
Number of sample to label per step. |
1
|
max_sample |
int
|
Limit the number of sample used (-1 is no limit). |
-1
|
uncertainty_folder |
Optional[str]
|
If provided, will store uncertainties on disk. |
None
|
ndata_to_label |
int
|
DEPRECATED, please use |
None
|
**kwargs |
Parameters forwarded to |
{}
|
Source code in baal/active/active_loop.py
step(pool=None)
Perform an active learning step.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pool |
iterable
|
Optional dataset pool indices. If not set, will use pool from the active set. |
None
|
Returns:
Type | Description |
---|---|
bool
|
boolean, Flag indicating if we continue training. |
Source code in baal/active/active_loop.py
baal.active.FileDataset
Bases: Dataset
Dataset object that load the files and apply a transformation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
files |
List[str]
|
The files. |
required |
lbls |
List[Any]
|
The labels, -1 indicates that the label is unknown. |
None
|
transform |
Optional[Callable]
|
torchvision.transform pipeline. |
None
|
target_transform |
Optional[Callable]
|
Function that modifies the target. |
None
|
image_load_fn |
Optional[Callable]
|
Function that loads the image, by default uses PIL. |
None
|
seed |
Optional[int]
|
Will set a seed before and between DA. |
None
|
Source code in baal/active/file_dataset.py
label(idx, lbl)
Label the sample idx
with lbl
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
idx |
int
|
The sample index. |
required |
lbl |
Any
|
The label to assign. |
required |