nlpatl
nlpgatl is a library for active learning in machine learning experiments. The goal of NLPatl is assisting user to build high quality labeled dataset. It built on top of transformers, scikit-learn and other machine learning package. It can be applied into both cold start scenario (no any labeled data) and limited labeled data scenario.
Learning
Model
Classification
Clustering
nlpatl.models.clustering.sklearn_clustering
- class nlpatl.models.clustering.sklearn_clustering.SkLearnClustering(model_name='kmeans', model_config={}, name='sklearn_clustering')[source]
Bases:
nlpatl.models.clustering.clustering.Clustering
A wrapper of sci-kit learn clustering class.
- Parameters
model_name (str) – sci-kit learn clustering model name. Possible values are kmeans.
model_config (dict) – Model paramateters. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
name (str) – Name of this clustering
>>> import nlpatl.models.clustering as nmclu >>> model = nmclu.SkLearnClustering()
- predict_proba(x, predict_config={})[source]
- Parameters
x (np.ndarray) – Raw features
predict_config (dict) – Model prediction paramateters. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
- Returns
Feature and probabilities
- Return type
nlptatl.dataset.Dataset
nlpatl.models.clustering.sklearn_extra_clustering
Embeddings
nlpatl.models.embeddings.sentence_transformers
- class nlpatl.models.embeddings.sentence_transformers.SentenceTransformers(model_name_or_path, batch_size=16, name='sentence_transformers')[source]
Bases:
nlpatl.models.embeddings.embeddings.Embeddings
A wrapper of transformers class.
- Parameters
model_name_or_path (str) – sentence transformers model name.
batch_size (int) – Batch size of data processing. Default is 16
model_config (dict) – Model paramateters. Refer to https://www.sbert.net/docs/pretrained_models.html
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.SentenceTransformers()
nlpatl.models.embeddings.transformers
- class nlpatl.models.embeddings.transformers.Transformers(model_name_or_path, batch_size=16, padding=False, truncation=False, nn_fwk=None, name='transformers')[source]
Bases:
nlpatl.models.embeddings.embeddings.Embeddings
A wrapper of transformers class.
- Parameters
model_name_or_path (str) – transformers model name.
batch_size (int) – Batch size of data processing. Default is 16
padding (bool) – Inputs may not have same size. Set True to pad it. Default is False
truncation (bool) – Inputs may not have same size. Set True to truncate it. Default is False
nn_fwk (str) – Neual network framework. Either pt (for PyTorch) or tf (for TensorFlow)
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.Transformers()
nlpatl.models.embeddings.torchvision
- class nlpatl.models.embeddings.torchvision.TorchVision(model_name_or_path, batch_size=16, model_config={'pretrained': True}, transform=None, name='torchvision')[source]
Bases:
nlpatl.models.embeddings.embeddings.Embeddings
A wrapper of torch vision class.
- Parameters
model_name_or_path (str) – torch vision model name. Possible values are resnet18, alexnet and vgg16.
batch_size (int) – Batch size of data processing. Default is 16
model_config (dict) – Model paramateters. Refer to https://pytorch.org/vision/stable/models.html
transform – Preprocessing function
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.TrochVision()
nlpatl.models.embeddings.nemo
- class nlpatl.models.embeddings.nemo.Nemo(model_name_or_path='titanet_large', batch_size=16, target_sr=16000, device='cuda', name='nemo')[source]
Bases:
nlpatl.models.embeddings.embeddings.Embeddings
A wrapper of nemo class.
- Parameters
model_name_or_path (str) – nemo model name. Verifeid. titanet_large, speakerverification_speakernet and ecapa_tdnn. Refer to https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html
batch_size (int) – Batch size of data processing. Default is 16
target_sr (int) – Sample rate. Audio will be resample to this value.
device (str) – Device for processing data
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.Nemo()
Sampling
Certainity Sampling
nlpatl.sampling.certainty.most_confidence
- class nlpatl.sampling.certainty.most_confidence.MostConfidenceSampling(threshold=0.85, name='most_confidence_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling
Sampling data points if the confidence is higher than threshold. Refer to https://markcartwright.com/files/wang2019active.pdf
- Parameters
threshold (float) – Minimum probability of model prediction. Default value is 0.85
name (str) – Name of this sampling
Uncertainity Learning
nlpatl.sampling.uncertainty.least_confidence
- class nlpatl.sampling.uncertainty.least_confidence.LeastConfidenceSampling(name='least_confidence_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling
- Sampling data points according to the least confidence. Pick the lowest
probabilies for the highest class. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.1846&rep=rep1&type=pdf
- Parameters
name (str) – Name of this sampling
nlpatl.sampling.uncertainty.entropy
nlpatl.sampling.uncertainty.margin
- class nlpatl.sampling.uncertainty.margin.MarginSampling(name='margin_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling
- Sampling data points according to the margin confidence. Pick the lowest
probabilies difference between the highest class and second higest class.
- Parameters
name (str) – Name of this sampling
nlpatl.sampling.uncertainty.mismatch
- class nlpatl.sampling.uncertainty.mismatch.MismatchSampling(name='mismatch_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling
- Sampling data points according to the mismatch. Pick the N data points
randomly.
- Parameters
name (str) – Name of this sampling
- sample(data1, data2, num_sample)[source]
- Parameters
x – Values of determine the sampling
num_sample (int) – Total number of sample for labeling
data1 (Union[List[str], List[int], List[float], <MagicMock id='140276579514352'>]) –
data2 (Union[List[str], List[int], List[float], <MagicMock id='140276579600176'>]) –
- Returns
Tuple of target indices and sampling values
- Return type
Tuple of
numpy.ndarray
,numpy.ndarray
Clustering Sampling
nlpatl.sampling.clustering.farthest
- class nlpatl.sampling.clustering.farthest.FarthestSampling(name='farthest_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling
- Sampling data points according to the distances of cluster centriod. Picking n
farthest data points per number of cluster. http://zhaoshuyang.com/static/documents/MAL2.pdf
- Parameters
name (str) – Name of this sampling
- sample(data, groups, num_sample)[source]
- Parameters
x – Values of determine the sampling
num_sample (int) – Total number of sample for labeling
data (<MagicMock id='140276580894560'>) –
groups (<MagicMock id='140276580910656'>) –
- Returns
Tuple of target indices and sampling values
- Return type
Tuple of
numpy.ndarray
,numpy.ndarray
See Module Index for API.