nltk.corpus.reader.crubadan module¶
An NLTK interface for the n-gram statistics gathered from the corpora for each language using An Crubadan.
There are multiple potential applications for the data but this reader was created with the goal of using it in the context of language identification.
For details about An Crubadan, this data, and its potential uses, see: http://borel.slu.edu/crubadan/index.html
- class nltk.corpus.reader.crubadan.CrubadanCorpusReader[source]¶
Bases:
CorpusReader
A corpus reader used to access language An Crubadan n-gram files.
- __init__(root, fileids, encoding='utf8', tagset=None)[source]¶
- Parameters:
root (PathPointer or str) – A path pointer identifying the root directory for this corpus. If a string is specified, then it will be converted to a
PathPointer
automatically.fileids – A list of the files that make up this corpus. This list can either be specified explicitly, as a list of strings; or implicitly, as a regular expression over file paths. The absolute path for each file will be constructed by joining the reader’s root to each file name.
encoding –
The default unicode encoding for the files that make up the corpus. The value of
encoding
can be any of the following:A string:
encoding
is the encoding name for all files.A dictionary:
encoding[file_id]
is the encoding name for the file whose identifier isfile_id
. Iffile_id
is not inencoding
, then the file contents will be processed using non-unicode byte strings.A list:
encoding
should be a list of(regexp, encoding)
tuples. The encoding for a file whose identifier isfile_id
will be theencoding
value for the first tuple whoseregexp
matches thefile_id
. If no tuple’sregexp
matches thefile_id
, the file contents will be processed using non-unicode byte strings.None: the file contents of all files will be processed using non-unicode byte strings.
tagset – The name of the tagset used by this corpus, to be used for normalizing or converting the POS tags returned by the
tagged_...()
methods.