nltk.tag.crf module

A module for POS tagging using CRFSuite

class nltk.tag.crf.CRFTagger[source]

Bases: TaggerI

A module for POS tagging using CRFSuite https://pypi.python.org/pypi/python-crfsuite

>>> from nltk.tag import CRFTagger
>>> ct = CRFTagger()  
>>> train_data = [[('University','Noun'), ('is','Verb'), ('a','Det'), ('good','Adj'), ('place','Noun')],
... [('dog','Noun'),('eat','Verb'),('meat','Noun')]]
>>> ct.train(train_data,'model.crf.tagger')  
>>> ct.tag_sents([['dog','is','good'], ['Cat','eat','meat']])  
[[('dog', 'Noun'), ('is', 'Verb'), ('good', 'Adj')], [('Cat', 'Noun'), ('eat', 'Verb'), ('meat', 'Noun')]]
>>> gold_sentences = [[('dog','Noun'),('is','Verb'),('good','Adj')] , [('Cat','Noun'),('eat','Verb'), ('meat','Noun')]]
>>> ct.accuracy(gold_sentences)  
1.0

Setting learned model file >>> ct = CRFTagger() # doctest: +SKIP >>> ct.set_model_file(‘model.crf.tagger’) # doctest: +SKIP >>> ct.accuracy(gold_sentences) # doctest: +SKIP 1.0

__init__(feature_func=None, verbose=False, training_opt={})[source]

Initialize the CRFSuite tagger

Parameters:
  • feature_func – The function that extracts features for each token of a sentence. This function should take 2 parameters: tokens and index which extract features at index position from tokens list. See the build in _get_features function for more detail.

  • verbose (boolean) – output the debugging messages during training.

  • training_opt (dictionary) – python-crfsuite training options

Set of possible training options (using LBFGS training algorithm).
‘feature.minfreq’:

The minimum frequency of features.

‘feature.possible_states’:

Force to generate possible state features.

‘feature.possible_transitions’:

Force to generate possible transition features.

‘c1’:

Coefficient for L1 regularization.

‘c2’:

Coefficient for L2 regularization.

‘max_iterations’:

The maximum number of iterations for L-BFGS optimization.

‘num_memories’:

The number of limited memories for approximating the inverse hessian matrix.

‘epsilon’:

Epsilon for testing the convergence of the objective.

‘period’:

The duration of iterations to test the stopping criterion.

‘delta’:

The threshold for the stopping criterion; an L-BFGS iteration stops when the improvement of the log likelihood over the last ${period} iterations is no greater than this threshold.

‘linesearch’:

The line search algorithm used in L-BFGS updates:

  • ‘MoreThuente’: More and Thuente’s method,

  • ‘Backtracking’: Backtracking method with regular Wolfe condition,

  • ‘StrongBacktracking’: Backtracking method with strong Wolfe condition

‘max_linesearch’:

The maximum number of trials for the line search algorithm.

set_model_file(model_file)[source]
tag(tokens)[source]

Tag a sentence using Python CRFSuite Tagger. NB before using this function, user should specify the mode_file either by

  • Train a new model using train function

  • Use the pre-trained model which is set via set_model_file function

Params tokens:

list of tokens needed to tag.

Returns:

list of tagged tokens.

Return type:

list(tuple(str,str))

tag_sents(sents)[source]

Tag a list of sentences. NB before using this function, user should specify the mode_file either by

  • Train a new model using train function

  • Use the pre-trained model which is set via set_model_file function

Params sentences:

list of sentences needed to tag.

Returns:

list of tagged sentences.

Return type:

list(list(tuple(str,str)))

train(train_data, model_file)[source]

Train the CRF tagger using CRFSuite :params train_data : is the list of annotated sentences. :type train_data : list (list(tuple(str,str))) :params model_file : the model will be saved to this file.