nlpatl.learning.mismatch_farthest_learning
- class nlpatl.learning.mismatch_farthest_learning.MismatchFarthestLearning(clustering_sampling, embeddings, clustering, classification, embeddings_type=None, embeddings_model_config=None, clustering_model_config=None, classification_model_config=None, multi_label=False, name='mismatch_farthest_learning')[source]
Bases:
nlpatl.learning.learning.Learning
Applying mis-match first farthest traversal method apporach (with modification) to annotate the most valuable data points. You may refer to http://zhaoshuyang.com/static/documents/MAL2.pdf . Here is the pseudo:1. [NLPatl] Convert raw data to features (Embeddings model)2. [NLPatl] Train model and clustering data points (Clustering model)3. [NLPatl] Estmiate the most valuable data points (Sampling)4. [Human] Subject matter exepknrnts annotates the most valuable data points5. [NLPatl] Train classification model (Classification model)6. [NLPatl] Classify unlabeled data points and comparing the clustering model result according to the farthest mismatch data points7. [Human] Subject matter exepknrnts annotates the most valuable data points8. Repeat Step 2 to 7 until acquire enough data points or reach other exit criteria.- Parameters
clustering_sampling (str or function) – Clustering sampling method for stage 1 exploration. Providing certified methods name (nearest_mean) or custom function.
embeddings (str or
nlpatl.models.embeddings.Embeddings
) – Function for converting raw data to embeddings. Providing model name according to embeddings type. For example, multi-qa-MiniLM-L6-cos-v1 for sentence_transformers. bert-base-uncased` for transformers. vgg16 for torch_vision.embeddings_model_config (dict) – Configuration for embeddings models. Optional. Ignored if using custom embeddings class
embeddings_type (str) – Type of embeddings. sentence_transformers for text, transformers for text or torch_vision for image
clustering (str or
nlpatl.models.clustering.Clustering
) – Function for clustering inputs. Either providing certified methods (kmeans) or custom function.clustering_model_config (dict) – Configuration for clustering models. Optional. Ignored if using custom clustering class
classification (
nlpatl.models.classification.Classification
) – Function for classifying inputs. Either providing certified methods (logistic_regression, svc, linear_svc, random_forest and xgboost) or custom function.classification_model_config (dict) – Configuration for classification models. Optional. Ignored if using custom classification class
multi_label (bool) – Indicate the classification model is multi-label or multi-class (or binary). Default is False.
name (str) – Name of this learning.
- clear_learn_data()
Clear all learn data points
- educate(index, x, x_features, y)
Annotate data point. Only allowing annotate data point one by one. NOT batch.
- Parameters
index (int) – Index of data point.
x (string, int, float or
np.ndarray
) – Raw data input. It can be text, number or numpy (for image).x_features (int, float or
np.ndarray
) – Data featuresy (string, int, list of string (multi-label case) or list or int (multi-label case)) – Label of data point
- explore(x, return_type='dict', num_sample=10)
Estimate the most valuable data points for annotation.
- Parameters
x (list of string, int or float or
np.ndarray
) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
- Returns
The most valuable data points.
- Return type
nlpatl.dataset.Dataset
objects or dict
- explore_educate_in_notebook(x, num_sample=5, num_sample_per_cluster=2, data_type='text')[source]
Estimate the most valuable data points for annotation and annotate it in IPython Notebook. Executing explore function and educate function sequentially.
- Parameters
x (list of string, int or float or
np.ndarray
) – Raw data inputs. It can be text, number or numpy (for image).return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.
num_sample (int) – Maximum number of data points for annotation.
data_type (str) – Indicate the data format for displying in IPython Notebook. Possible values are text and image.
num_sample_per_cluster (int) –
- get_learn_data()
Get all learn data points
- Returns
Learnt data points
- Return type
Tuple of index list of int, x (str or
numpy.ndarray
) x_features (numpy.ndarray
) and y (numpy.ndarray
)
- learn(x, y, include_learn_data=True)
Train the classification model.
- Parameters
x (list of string, int or float or
np.ndarray
.) – Raw data inputs. It can be text, number or numpy.y (bool) – Label of data inputs
include_learn_data (bool) – Train the model whether including human annotated data and machine learning self annotated data. Default is True.