nlpatl
nlpgatl is a library for active learning in machine learning experiments. The goal of NLPatl is assisting user to build high quality labeled dataset. It built on top of transformers, scikit-learn and other machine learning package. It can be applied into both cold start scenario (no any labeled data) and limited labeled data scenario.
Learning
nlpatl.learning.mismatch_farthest_learning
- class nlpatl.learning.mismatch_farthest_learning.MismatchFarthestLearning(clustering_sampling, embeddings, clustering, classification, embeddings_type=None, embeddings_model_config=None, clustering_model_config=None, classification_model_config=None, multi_label=False, name='mismatch_farthest_learning')[source]
Bases:
nlpatl.learning.learning.LearningApplying mis-match first farthest traversal method apporach (with modification) to annotate the most valuable data points. You may refer to http://zhaoshuyang.com/static/documents/MAL2.pdf . Here is the pseudo:1. [NLPatl] Convert raw data to features (Embeddings model)2. [NLPatl] Train model and clustering data points (Clustering model)3. [NLPatl] Estmiate the most valuable data points (Sampling)4. [Human] Subject matter exepknrnts annotates the most valuable data points5. [NLPatl] Train classification model (Classification model)6. [NLPatl] Classify unlabeled data points and comparing the clustering model result according to the farthest mismatch data points7. [Human] Subject matter exepknrnts annotates the most valuable data points8. Repeat Step 2 to 7 until acquire enough data points or reach other exit criteria.- Parameters
clustering_sampling (str or function) – Clustering sampling method for stage 1 exploration. Providing certified methods name (nearest_mean) or custom function.
embeddings (str or
nlpatl.models.embeddings.Embeddings) – Function for converting raw data to embeddings. Providing model name according to embeddings type. For example, multi-qa-MiniLM-L6-cos-v1 for sentence_transformers. bert-base-uncased` for transformers. vgg16 for torch_vision.embeddings_model_config (dict) – Configuration for embeddings models. Optional. Ignored if using custom embeddings class
embeddings_type (str) – Type of embeddings. sentence_transformers for text, transformers for text or torch_vision for image
clustering (str or
nlpatl.models.clustering.Clustering) – Function for clustering inputs. Either providing certified methods (kmeans) or custom function.clustering_model_config (dict) – Configuration for clustering models. Optional. Ignored if using custom clustering class
classification (
nlpatl.models.classification.Classification) – Function for classifying inputs. Either providing certified methods (logistic_regression, svc, linear_svc, random_forest and xgboost) or custom function.classification_model_config (dict) – Configuration for classification models. Optional. Ignored if using custom classification class
multi_label (bool) – Indicate the classification model is multi-label or multi-class (or binary). Default is False.
name (str) – Name of this learning.
- clear_learn_data()
Clear all learn data points
- educate(index, x, x_features, y)
Annotate data point. Only allowing annotate data point one by one. NOT batch.
- Parameters
index (int) – Index of data point.
x (string, int, float or
np.ndarray) – Raw data input. It can be text, number or numpy (for image).x_features (int, float or
np.ndarray) – Data featuresy (string, int, list of string (multi-label case) or list or int (multi-label case)) – Label of data point
- explore(x, return_type='dict', num_sample=10)
Estimate the most valuable data points for annotation.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
- Returns
The most valuable data points.
- Return type
nlpatl.dataset.Datasetobjects or dict
- explore_educate_in_notebook(x, num_sample=5, num_sample_per_cluster=2, data_type='text')[source]
Estimate the most valuable data points for annotation and annotate it in IPython Notebook. Executing explore function and educate function sequentially.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
data_type (str) – Indicate the data format for displying in IPython Notebook. Possible values are text and image.
num_sample_per_cluster (int) –
- get_learn_data()
Get all learn data points
- Returns
Learnt data points
- Return type
Tuple of index list of int, x (str or
numpy.ndarray) x_features (numpy.ndarray) and y (numpy.ndarray)
- learn(x, y, include_learn_data=True)
Train the classification model.
- Parameters
x (list of string, int or float or
np.ndarray.) – Raw data inputs. It can be text, number or numpy.y (bool) – Label of data inputs
include_learn_data (bool) – Train the model whether including human annotated data and machine learning self annotated data. Default is True.
nlpatl.learning.semi_supervised_learning
- class nlpatl.learning.semi_supervised_learning.SemiSupervisedLearning(sampling, embeddings, classification, embeddings_type=None, embeddings_model_config=None, classification_model_config=None, multi_label=False, self_learn_threshold=0.9, name='semi_supervised_learning')[source]
Bases:
nlpatl.learning.learning.LearningApplying both active learning and semi-supervised learning apporach to annotate the most valuable data points. You may refer to https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0162075&type=printable . Here is the pseudo:1. [NLPatl] Convert raw data to features (Embeddings model)2. [NLPatl] Train model and classifing data points (Classification model)3. [NLPatl] Estmiate the most valuable data points (Sampling)4. [Human] Subject matter experts annotates the most valuable data points5. [NLPatl] Retrain classification model6. [NLPatl] Classify unlabeled data points and labeling those confidences are higher than self_learn_threshold7. Repeat Step 2 to 6 until acquire enough data points.- Parameters
sampling (str or function) – Sampling method for get the most valuable data points. Providing certified methods name (most_confidence, entropy, least_confidence, margin, nearest_mean, fathest) or custom function.
embeddings (str or
nlpatl.models.embeddings.Embeddings) – Function for converting raw data to embeddings. Providing model name according to embeddings type. For example, multi-qa-MiniLM-L6-cos-v1 for sentence_transformers. bert-base-uncased` for transformers. vgg16 for torch_vision.embeddings_model_config (dict) – Configuration for embeddings models. Optional. Ignored if using custom embeddings class
embeddings_type (str) – Type of embeddings. sentence_transformers for text, transformers for text or torch_vision for image
classification (
nlpatl.models.classification.Classification) – Function for classifying inputs. Either providing certified methods (logistic_regression, svc, linear_svc, random_forest and xgboost) or custom function.classification_model_config (dict) – Configuration for classification models. Optional. Ignored if using custom classification class
self_learn_threshold (float) – The minimum threshold for classifying probabilities. Data will be labeled automatically if probability is higher than this value. Default is 0.9
name (str) – Name of this learning.
multi_label (bool) –
- clear_learn_data()
Clear all learn data points
- educate(index, x, x_features, y)
Annotate data point. Only allowing annotate data point one by one. NOT batch.
- Parameters
index (int) – Index of data point.
x (string, int, float or
np.ndarray) – Raw data input. It can be text, number or numpy (for image).x_features (int, float or
np.ndarray) – Data featuresy (string, int, list of string (multi-label case) or list or int (multi-label case)) – Label of data point
- explore(x, return_type='dict', num_sample=10)[source]
Estimate the most valuable data points for annotation.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
- Returns
The most valuable data points.
- Return type
nlpatl.dataset.Datasetobjects or dict
- explore_educate_in_notebook(x, num_sample=2, data_type='text')[source]
Estimate the most valuable data points for annotation and annotate it in IPython Notebook. Executing explore function and educate function sequentially.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
data_type (str) – Indicate the data format for displying in IPython Notebook. Possible values are text and image.
- get_learn_data()
Get all learn data points
- Returns
Learnt data points
- Return type
Tuple of index list of int, x (str or
numpy.ndarray) x_features (numpy.ndarray) and y (numpy.ndarray)
- get_self_learn_data()[source]
Get all self learnt data points
- Returns
Self learnt data points
- Return type
Tuple of index list of int, x (
numpy.ndarray) and y (numpy.ndarray)
- learn(x=None, y=None, include_learn_data=True)[source]
Train the classification model.
- Parameters
x (list of string, int or float or
np.ndarray.) – Raw data inputs. It can be text, number or numpy.y (bool) – Label of data inputs
include_learn_data (bool) – Train the model whether including human annotated data and machine learning self annotated data. Default is True.
nlpatl.learning.supervised_learning
- class nlpatl.learning.supervised_learning.SupervisedLearning(sampling, embeddings, classification, embeddings_type=None, embeddings_model_config=None, classification_model_config=None, multi_label=False, name='supervised_learning')[source]
Bases:
nlpatl.learning.learning.LearningApplying typical active learning apporach to annotate the most valuable data points. Here is the pseudo:1. [NLPatl] Convert raw data to features (Embeddings model)2. [NLPatl] Train model and classifing data points (Classification model)3. [NLPatl] Estmiate the most valuable data points (Sampling)4. [Human] Subject matter experts annotates the most valuable data points5. Repeat Step 2 to 4 until acquire enough data points.- Parameters
sampling (str or function) – Sampling method for get the most valuable data points. Providing certified methods name (most_confidence, entropy, least_confidence, margin, nearest_mean, fathest) or custom function.
embeddings (str or
nlpatl.models.embeddings.Embeddings) – Function for converting raw data to embeddings. Providing model name according to embeddings type. For example, multi-qa-MiniLM-L6-cos-v1 for sentence_transformers. bert-base-uncased` for transformers. vgg16 for torch_vision.embeddings_model_config (dict) – Configuration for embeddings models. Optional. Ignored if using custom embeddings class
embeddings_type (str) – Type of embeddings. sentence_transformers for text, transformers for text or torch_vision for image
classification (
nlpatl.models.classification.Classification) – Function for classifying inputs. Either providing certified methods (logistic_regression, svc, linear_svc, random_forest and xgboost) or custom function.classification_model_config (dict) – Configuration for classification models. Optional. Ignored if using custom classification class
multi_label (bool) – Indicate the classification model is multi-label or multi-class (or binary). Default is False.
name (str) – Name of this learning.
- clear_learn_data()
Clear all learn data points
- educate(index, x, x_features, y)
Annotate data point. Only allowing annotate data point one by one. NOT batch.
- Parameters
index (int) – Index of data point.
x (string, int, float or
np.ndarray) – Raw data input. It can be text, number or numpy (for image).x_features (int, float or
np.ndarray) – Data featuresy (string, int, list of string (multi-label case) or list or int (multi-label case)) – Label of data point
- explore(x, return_type='dict', num_sample=10)[source]
Estimate the most valuable data points for annotation.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
- Returns
The most valuable data points.
- Return type
nlpatl.dataset.Datasetobjects or dict
- explore_educate_in_notebook(x, num_sample=2, data_type='text')
Estimate the most valuable data points for annotation and annotate it in IPython Notebook. Executing explore function and educate function sequentially.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
data_type (str) – Indicate the data format for displying in IPython Notebook. Possible values are text and image.
- get_learn_data()
Get all learn data points
- Returns
Learnt data points
- Return type
Tuple of index list of int, x (str or
numpy.ndarray) x_features (numpy.ndarray) and y (numpy.ndarray)
- learn(x, y, include_learn_data=True)[source]
Train the classification model.
- Parameters
x (list of string, int or float or
np.ndarray.) – Raw data inputs. It can be text, number or numpy.y (bool) – Label of data inputs
include_learn_data (bool) – Train the model whether including human annotated data and machine learning self annotated data. Default is True.
nlpatl.learning.unsupervised_learning
- class nlpatl.learning.unsupervised_learning.UnsupervisedLearning(sampling, embeddings, clustering, embeddings_type=None, embeddings_model_config=None, clustering_model_config=None, multi_label=False, name='unsupervised_learning')[source]
Bases:
nlpatl.learning.learning.LearningApplying unsupervised learning apporach to annotate the most valuable data points. You may refer to https://homepages.tuni.fi/tuomas.virtanen/papers/active-learning-sound.pdf. Here is the pseudo:1. [NLPatl] Convert raw data to features (Embeddings model)2. [NLPatl] Train model and clustering data points (Clustering model)3. [NLPatl] Estmiate the most valuable data points (Sampling)4. [Human] Subject matter experts annotates the most valuable data points5. Repeat Step 2 to 4 until acquire enough data points.- Parameters
sampling (str or function) – Sampling method for get the most valuable data points. Providing certified methods name (most_confidence, entropy, least_confidence, margin, nearest_mean, fathest) or custom function.
embeddings (str or
nlpatl.models.embeddings.Embeddings) – Function for converting raw data to embeddings. Providing model name according to embeddings type. For example, multi-qa-MiniLM-L6-cos-v1 for sentence_transformers. bert-base-uncased` for transformers. vgg16 for torch_vision.embeddings_model_config (dict) – Configuration for embeddings models. Optional. Ignored if using custom embeddings class
embeddings_type (str) – Type of embeddings. sentence_transformers for text, transformers for text or torch_vision for image
clustering (str or
nlpatl.models.clustering.Clustering) – Function for clustering inputs. Either providing certified methods (kmeans) or custom function.clustering_model_config (dict) – Configuration for clustering models. Optional. Ignored if using custom clustering class
multi_label (bool) – Indicate the classification model is multi-label or multi-class (or binary). Default is False.
name (str) – Name of this learning.
- clear_learn_data()
Clear all learn data points
- educate(index, x, x_features, y)
Annotate data point. Only allowing annotate data point one by one. NOT batch.
- Parameters
index (int) – Index of data point.
x (string, int, float or
np.ndarray) – Raw data input. It can be text, number or numpy (for image).x_features (int, float or
np.ndarray) – Data featuresy (string, int, list of string (multi-label case) or list or int (multi-label case)) – Label of data point
- explore(inputs, return_type='dict', num_sample=2)[source]
Estimate the most valuable data points for annotation.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
inputs (List[str]) –
- Returns
The most valuable data points.
- Return type
nlpatl.dataset.Datasetobjects or dict
- explore_educate_in_notebook(x, num_sample=2, data_type='text')
Estimate the most valuable data points for annotation and annotate it in IPython Notebook. Executing explore function and educate function sequentially.
- Parameters
x (list of string, int or float or
np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
data_type (str) – Indicate the data format for displying in IPython Notebook. Possible values are text and image.
- get_learn_data()
Get all learn data points
- Returns
Learnt data points
- Return type
Tuple of index list of int, x (str or
numpy.ndarray) x_features (numpy.ndarray) and y (numpy.ndarray)
- learn(x, y, include_learn_data=True)
Train the classification model.
- Parameters
x (list of string, int or float or
np.ndarray.) – Raw data inputs. It can be text, number or numpy.y (bool) – Label of data inputs
include_learn_data (bool) – Train the model whether including human annotated data and machine learning self annotated data. Default is True.
Model
Classification
nlpatl.models.classification.sklearn_classification
sci-kit learn classification wrapper
- class nlpatl.models.classification.sklearn_classification.SkLearnClassification(model_name='logistic_regression', model_config={}, name='sklearn_classification')[source]
Bases:
nlpatl.models.classification.classification.ClassificationA wrapper of sci-kit learn classification class.
- Parameters
model_name (str) – sci-kit learn classification model name. Possible values are logistic_regression, svc, linear_svc and random_forest.
model_config (dict) – Model paramateters. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
name (str) – Name of this classification
>>> import nlpatl.models.classification as nmcla >>> model = nmcla.SkLearnClassification()
- predict_proba(x, predict_config={})[source]
- Parameters
x (np.ndarray) – Raw features
predict_config (dict) – Model prediction paramateters. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
- Returns
Feature and probabilities
- Return type
nlptatl.dataset.Dataset
nlpatl.models.classification.xgboost_classification
- class nlpatl.models.classification.xgboost_classification.XGBoostClassification(model_config={}, name='xgboost_classification')[source]
Bases:
nlpatl.models.classification.sklearn_classification.SkLearnClassificationA wrapper of xgboost classification class.
- Parameters
model_config (dict) – Model paramateters. Refer to https://xgboost.readthedocs.io/en/stable/python/python_api.html
name (str) – Name of this classification
>>> import nlpatl.models.classification as nmcla >>> model = nmcla.XGBoostClassification()
- predict_proba(x, predict_config={})[source]
- Parameters
x (np.ndarray) – Raw features
predict_config (dict) – Model prediction paramateters. Refer to https://xgboost.readthedocs.io/en/stable/python/python_api.html
- Returns
Feature and probabilities
- Return type
nlptatl.dataset.Dataset
- train(x, y)
- Parameters
x (np.ndarray) – Raw features
y (list of string, int or float or
np.ndarray.) – Label of data inputs
Clustering
nlpatl.models.clustering.sklearn_clustering
- class nlpatl.models.clustering.sklearn_clustering.SkLearnClustering(model_name='kmeans', model_config={}, name='sklearn_clustering')[source]
Bases:
nlpatl.models.clustering.clustering.ClusteringA wrapper of sci-kit learn clustering class.
- Parameters
model_name (str) – sci-kit learn clustering model name. Possible values are kmeans.
model_config (dict) – Model paramateters. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
name (str) – Name of this clustering
>>> import nlpatl.models.clustering as nmclu >>> model = nmclu.SkLearnClustering()
- predict_proba(x, predict_config={})[source]
- Parameters
x (np.ndarray) – Raw features
predict_config (dict) – Model prediction paramateters. Refer to https://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster
- Returns
Feature and probabilities
- Return type
nlptatl.dataset.Dataset
Embeddings
nlpatl.models.embeddings.sentence_transformers
- class nlpatl.models.embeddings.sentence_transformers.SentenceTransformers(model_name_or_path, batch_size=16, name='sentence_transformers')[source]
Bases:
nlpatl.models.embeddings.embeddings.EmbeddingsA wrapper of transformers class.
- Parameters
model_name_or_path (str) – sentence transformers model name.
batch_size (int) – Batch size of data processing. Default is 16
model_config (dict) – Model paramateters. Refer to https://www.sbert.net/docs/pretrained_models.html
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.SentenceTransformers()
nlpatl.models.embeddings.transformers
- class nlpatl.models.embeddings.transformers.Transformers(model_name_or_path, batch_size=16, padding=False, truncation=False, nn_fwk=None, name='transformers')[source]
Bases:
nlpatl.models.embeddings.embeddings.EmbeddingsA wrapper of transformers class.
- Parameters
model_name_or_path (str) – transformers model name.
batch_size (int) – Batch size of data processing. Default is 16
padding (bool) – Inputs may not have same size. Set True to pad it. Default is False
truncation (bool) – Inputs may not have same size. Set True to truncate it. Default is False
nn_fwk (str) – Neual network framework. Either pt (for PyTorch) or tf (for TensorFlow)
model_config (dict) – Model paramateters. Refer to https://huggingface.co/docs/transformers/index
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.Transformers()
nlpatl.models.embeddings.torchvision
- class nlpatl.models.embeddings.torchvision.TorchVision(model_name_or_path, batch_size=16, model_config={'pretrained': True}, transform=None, name='torchvision')[source]
Bases:
nlpatl.models.embeddings.embeddings.EmbeddingsA wrapper of torch vision class.
- Parameters
model_name_or_path (str) – torch vision model name. Possible values are resnet18, alexnet and vgg16.
batch_size (int) – Batch size of data processing. Default is 16
model_config (dict) – Model paramateters. Refer to https://pytorch.org/vision/stable/models.html
transform – Preprocessing function
name (str) – Name of this embeddings
>>> import nlpatl.models.embeddings as nme >>> model = nme.TrochVision()
Sampling
Certainity Sampling
nlpatl.sampling.certainty.most_confidence
- class nlpatl.sampling.certainty.most_confidence.MostConfidenceSampling(threshold=0.85, name='most_confidence_sampling')[source]
Bases:
nlpatl.sampling.sampling.SamplingSampling data points if the confidence is higher than threshold. Refer to https://markcartwright.com/files/wang2019active.pdf
- Parameters
threshold (float) – Minimum probability of model prediction. Default value is 0.85
name (str) – Name of this sampling
Uncertainity Learning
nlpatl.sampling.uncertainty.least_confidence
- class nlpatl.sampling.uncertainty.least_confidence.LeastConfidenceSampling(name='least_confidence_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling- Sampling data points according to the least confidence. Pick the lowest
probabilies for the highest class.
- Parameters
name (str) – Name of this sampling
nlpatl.sampling.uncertainty.entropy
nlpatl.sampling.uncertainty.margin
- class nlpatl.sampling.uncertainty.margin.MarginSampling(name='margin_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling- Sampling data points according to the margin confidence. Pick the lowest
probabilies difference between the highest class and second higest class.
- Parameters
name (str) – Name of this sampling
nlpatl.sampling.uncertainty.mismatch
- class nlpatl.sampling.uncertainty.mismatch.MismatchSampling(name='mismatch_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling- Sampling data points according to the mismatch. Pick the N data points
randomly.
- Parameters
name (str) – Name of this sampling
- sample(data1, data2, num_sample)[source]
- Parameters
x – Values of determine the sampling
num_sample (int) – Total number of sample for labeling
data1 (Union[List[str], List[int], List[float], <MagicMock id='140655953612656'>]) –
data2 (Union[List[str], List[int], List[float], <MagicMock id='140655953590544'>]) –
- Returns
Tuple of target indices and sampling values
- Return type
Tuple of
numpy.ndarray,numpy.ndarray
Clustering Sampling
nlpatl.sampling.clustering.farthest
- class nlpatl.sampling.clustering.farthest.FarthestSampling(name='farthest_sampling')[source]
Bases:
nlpatl.sampling.sampling.Sampling- Sampling data points according to the distances of cluster centriod. Picking n
farthest data points per number of cluster. http://zhaoshuyang.com/static/documents/MAL2.pdf
- Parameters
name (str) – Name of this sampling
- sample(data, groups, num_sample)[source]
- Parameters
x – Values of determine the sampling
num_sample (int) – Total number of sample for labeling
data (<MagicMock id='140655953397456'>) –
groups (<MagicMock id='140655954027952'>) –
- Returns
Tuple of target indices and sampling values
- Return type
Tuple of
numpy.ndarray,numpy.ndarray
See Module Index for API.