nlpatl.learning.semi_supervised_learning

class nlpatl.learning.semi_supervised_learning.SemiSupervisedLearning(sampling, embeddings, classification, embeddings_type=None, embeddings_model_config=None, classification_model_config=None, multi_label=False, self_learn_threshold=0.9, name='semi_supervised_learning')[source]

Bases: nlpatl.learning.learning.Learning

Applying both active learning and semi-supervised learning apporach to annotate the most valuable data points. You may refer to https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0162075&type=printable . Here is the pseudo:
1. [NLPatl] Convert raw data to features (Embeddings model)
2. [NLPatl] Train model and classifing data points (Classification model)
3. [NLPatl] Estmiate the most valuable data points (Sampling)
4. [Human] Subject matter experts annotates the most valuable data points
5. [NLPatl] Retrain classification model
6. [NLPatl] Classify unlabeled data points and labeling those confidences are higher than self_learn_threshold
7. Repeat Step 2 to 6 until acquire enough data points.
Parameters
  • sampling (str or function) – Sampling method for get the most valuable data points. Providing certified methods name (most_confidence, entropy, least_confidence, margin, nearest_mean, fathest) or custom function.

  • embeddings (str or nlpatl.models.embeddings.Embeddings) – Function for converting raw data to embeddings. Providing model name according to embeddings type. For example, multi-qa-MiniLM-L6-cos-v1 for sentence_transformers. bert-base-uncased` for transformers. vgg16 for torch_vision.

  • embeddings_model_config (dict) – Configuration for embeddings models. Optional. Ignored if using custom embeddings class

  • embeddings_type (str) – Type of embeddings. sentence_transformers for text, transformers for text or torch_vision for image

  • classification (nlpatl.models.classification.Classification) – Function for classifying inputs. Either providing certified methods (logistic_regression, svc, linear_svc, random_forest and xgboost) or custom function.

  • classification_model_config (dict) – Configuration for classification models. Optional. Ignored if using custom classification class

  • self_learn_threshold (float) – The minimum threshold for classifying probabilities. Data will be labeled automatically if probability is higher than this value. Default is 0.9

  • name (str) – Name of this learning.

  • multi_label (bool) –

clear_learn_data()

Clear all learn data points

educate(index, x, x_features, y)

Annotate data point. Only allowing annotate data point one by one. NOT batch.

Parameters
  • index (int) – Index of data point.

  • x (string, int, float or np.ndarray) – Raw data input. It can be text, number or numpy (for image).

  • x_features (int, float or np.ndarray) – Data features

  • y (string, int, list of string (multi-label case) or list or int (multi-label case)) – Label of data point

explore(x, return_type='dict', num_sample=10)[source]

Estimate the most valuable data points for annotation.

Parameters
  • x (list of string, int or float or np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).

  • return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.

  • num_sample (int) – Maximum number of data points for annotation.

Returns

The most valuable data points.

Return type

nlpatl.dataset.Dataset objects or dict

explore_educate_in_notebook(x, num_sample=2, data_type='text')[source]

Estimate the most valuable data points for annotation and annotate it in IPython Notebook. Executing explore function and educate function sequentially.

Parameters
  • x (list of string, int or float or np.ndarray) – Raw data inputs. It can be text, number or numpy (for image).

  • return_type (str) – Data type of returning object. If dict is assigned. Return object is dict. Possible values are dict and object.

  • num_sample (int) – Maximum number of data points for annotation.

  • data_type (str) – Indicate the data format for displying in IPython Notebook. Possible values are text and image.

get_learn_data()

Get all learn data points

Returns

Learnt data points

Return type

Tuple of index list of int, x (str or numpy.ndarray) x_features (numpy.ndarray) and y (numpy.ndarray)

get_self_learn_data()[source]

Get all self learnt data points

Returns

Self learnt data points

Return type

Tuple of index list of int, x (numpy.ndarray) and y (numpy.ndarray)

learn(x=None, y=None, include_learn_data=True)[source]

Train the classification model.

Parameters
  • x (list of string, int or float or np.ndarray.) – Raw data inputs. It can be text, number or numpy.

  • y (bool) – Label of data inputs

  • include_learn_data (bool) – Train the model whether including human annotated data and machine learning self annotated data. Default is True.