nlpatl

nlpgatl is a library for active learning in machine learning experiments. The goal of NLPatl is assisting user to build high quality labeled dataset. It built on top of transformers, scikit-learn and other machine learning package. It can be applied into both cold start scenario (no any labeled data) and limited labeled data scenario.

Learning

nlpatl.learning.mismatch_farthest_learning

nlpatl.learning.semi_supervised_learning

nlpatl.learning.supervised_learning

nlpatl.learning.unsupervised_learning

Model

Classification

nlpatl.models.classification.sklearn_classification

nlpatl.models.classification.xgboost_classification

Clustering

nlpatl.models.clustering.sklearn_clustering

class nlpatl.models.clustering.sklearn_clustering.SkLearnClustering(model_name='kmeans', model_config={}, name='sklearn_clustering')[source]

Bases: nlpatl.models.clustering.clustering.Clustering

A wrapper of sci-kit learn clustering class.

Parameters
>>> import nlpatl.models.clustering as nmclu
>>> model = nmclu.SkLearnClustering()
predict_proba(x, predict_config={})[source]
Parameters
Returns

Feature and probabilities

Return type

nlptatl.dataset.Dataset

train(x)[source]
Parameters

x (np.ndarray) – Raw features

nlpatl.models.clustering.sklearn_extra_clustering

Embeddings

nlpatl.models.embeddings.sentence_transformers

class nlpatl.models.embeddings.sentence_transformers.SentenceTransformers(model_name_or_path, batch_size=16, name='sentence_transformers')[source]

Bases: nlpatl.models.embeddings.embeddings.Embeddings

A wrapper of transformers class.

Parameters
  • model_name_or_path (str) – sentence transformers model name.

  • batch_size (int) – Batch size of data processing. Default is 16

  • model_config (dict) – Model paramateters. Refer to https://www.sbert.net/docs/pretrained_models.html

  • name (str) – Name of this embeddings

>>> import nlpatl.models.embeddings as nme
>>> model = nme.SentenceTransformers()
convert(x)[source]
Parameters

x (np.ndarray) – Raw features

Returns

Vectors of features

Return type

np.ndarray

nlpatl.models.embeddings.transformers

class nlpatl.models.embeddings.transformers.Transformers(model_name_or_path, batch_size=16, padding=False, truncation=False, nn_fwk=None, name='transformers')[source]

Bases: nlpatl.models.embeddings.embeddings.Embeddings

A wrapper of transformers class.

Parameters
  • model_name_or_path (str) – transformers model name.

  • batch_size (int) – Batch size of data processing. Default is 16

  • padding (bool) – Inputs may not have same size. Set True to pad it. Default is False

  • truncation (bool) – Inputs may not have same size. Set True to truncate it. Default is False

  • nn_fwk (str) – Neual network framework. Either pt (for PyTorch) or tf (for TensorFlow)

  • name (str) – Name of this embeddings

>>> import nlpatl.models.embeddings as nme
>>> model = nme.Transformers()
convert(x)[source]
Parameters

x (np.ndarray) – Raw features

Returns

Vectors of features

Return type

np.ndarray

nlpatl.models.embeddings.torchvision

class nlpatl.models.embeddings.torchvision.TorchVision(model_name_or_path, batch_size=16, model_config={'pretrained': True}, transform=None, name='torchvision')[source]

Bases: nlpatl.models.embeddings.embeddings.Embeddings

A wrapper of torch vision class.

Parameters
  • model_name_or_path (str) – torch vision model name. Possible values are resnet18, alexnet and vgg16.

  • batch_size (int) – Batch size of data processing. Default is 16

  • model_config (dict) – Model paramateters. Refer to https://pytorch.org/vision/stable/models.html

  • transform – Preprocessing function

  • name (str) – Name of this embeddings

>>> import nlpatl.models.embeddings as nme
>>> model = nme.TrochVision()
convert(x)[source]
Parameters

x (np.ndarray) – Raw features

Returns

Vectors of features

Return type

np.ndarray

nlpatl.models.embeddings.nemo

class nlpatl.models.embeddings.nemo.Nemo(model_name_or_path='titanet_large', batch_size=16, target_sr=16000, device='cuda', name='nemo')[source]

Bases: nlpatl.models.embeddings.embeddings.Embeddings

A wrapper of nemo class.

Parameters
>>> import nlpatl.models.embeddings as nme
>>> model = nme.Nemo()
convert(x)[source]
Parameters

x (np.ndarray) – Raw features

Returns

Vectors of features

Return type

np.ndarray

Sampling

Certainity Sampling

nlpatl.sampling.certainty.most_confidence

class nlpatl.sampling.certainty.most_confidence.MostConfidenceSampling(threshold=0.85, name='most_confidence_sampling')[source]

Bases: nlpatl.sampling.sampling.Sampling

Sampling data points if the confidence is higher than threshold. Refer to https://markcartwright.com/files/wang2019active.pdf

Parameters
  • threshold (float) – Minimum probability of model prediction. Default value is 0.85

  • name (str) – Name of this sampling

sample(data, num_sample)[source]
Parameters
  • x – Values of determine the sampling

  • num_sample (int) – Total number of sample for labeling

  • data (<MagicMock id='140276581643216'>) –

Returns

Tuple of target indices and sampling values

Return type

Tuple of numpy.ndarray, numpy.ndarray

Uncertainity Learning

nlpatl.sampling.uncertainty.least_confidence

class nlpatl.sampling.uncertainty.least_confidence.LeastConfidenceSampling(name='least_confidence_sampling')[source]

Bases: nlpatl.sampling.sampling.Sampling

Sampling data points according to the least confidence. Pick the lowest

probabilies for the highest class. https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.219.1846&rep=rep1&type=pdf

Parameters

name (str) – Name of this sampling

sample(data, num_sample)[source]
Parameters
  • x – Values of determine the sampling

  • num_sample (int) – Total number of sample for labeling

  • data (<MagicMock id='140276579918176'>) –

Returns

Tuple of target indices and sampling values

Return type

Tuple of numpy.ndarray, numpy.ndarray

nlpatl.sampling.uncertainty.entropy

class nlpatl.sampling.uncertainty.entropy.EntropySampling(name='entropy_sampling')[source]

Bases: nlpatl.sampling.sampling.Sampling

Sampling data points according to the entropy. Pick the highest N data points

Parameters

name (str) – Name of this sampling

sample(data, num_sample)[source]
Parameters
  • x – Values of determine the sampling

  • num_sample (int) – Total number of sample for labeling

  • data (<MagicMock id='140276579804784'>) –

Returns

Tuple of target indices and sampling values

Return type

Tuple of numpy.ndarray, numpy.ndarray

nlpatl.sampling.uncertainty.margin

class nlpatl.sampling.uncertainty.margin.MarginSampling(name='margin_sampling')[source]

Bases: nlpatl.sampling.sampling.Sampling

Sampling data points according to the margin confidence. Pick the lowest

probabilies difference between the highest class and second higest class.

Parameters

name (str) – Name of this sampling

sample(data, num_sample)[source]
Parameters
  • x – Values of determine the sampling

  • num_sample (int) – Total number of sample for labeling

  • data (<MagicMock id='140276580187824'>) –

Returns

Tuple of target indices and sampling values

Return type

Tuple of numpy.ndarray, numpy.ndarray

nlpatl.sampling.uncertainty.mismatch

class nlpatl.sampling.uncertainty.mismatch.MismatchSampling(name='mismatch_sampling')[source]

Bases: nlpatl.sampling.sampling.Sampling

Sampling data points according to the mismatch. Pick the N data points

randomly.

Parameters

name (str) – Name of this sampling

sample(data1, data2, num_sample)[source]
Parameters
  • x – Values of determine the sampling

  • num_sample (int) – Total number of sample for labeling

  • data1 (Union[List[str], List[int], List[float], <MagicMock id='140276579514352'>]) –

  • data2 (Union[List[str], List[int], List[float], <MagicMock id='140276579600176'>]) –

Returns

Tuple of target indices and sampling values

Return type

Tuple of numpy.ndarray, numpy.ndarray

Clustering Sampling

nlpatl.sampling.clustering.farthest

class nlpatl.sampling.clustering.farthest.FarthestSampling(name='farthest_sampling')[source]

Bases: nlpatl.sampling.sampling.Sampling

Sampling data points according to the distances of cluster centriod. Picking n

farthest data points per number of cluster. http://zhaoshuyang.com/static/documents/MAL2.pdf

Parameters

name (str) – Name of this sampling

sample(data, groups, num_sample)[source]
Parameters
  • x – Values of determine the sampling

  • num_sample (int) – Total number of sample for labeling

  • data (<MagicMock id='140276580894560'>) –

  • groups (<MagicMock id='140276580910656'>) –

Returns

Tuple of target indices and sampling values

Return type

Tuple of numpy.ndarray, numpy.ndarray

See Module Index for API.

Indices and tables