Utils¶

Algorithms¶

diaparser.utils.alg.kmeans(x, k, max_it=32)[source]¶

KMeans algorithm for clustering the sentences by length.

Parameters

x (list[int]) – The list of sentence lengths.
k (int) – The number of clusters. This is an approximate value. The final number of clusters can be less or equal to k.
max_it (int) – Maximum number of iterations. If centroids does not converge after several iterations, the algorithm will be early stopped.

Returns

The first list contains average lengths of sentences in each cluster. The second is the list of clusters holding indices of data points.

Return type

list[float], list[list[int]]

Examples

>>> x = torch.randint(10,20,(10,)).tolist()
>>> x
[15, 10, 17, 11, 18, 13, 17, 19, 18, 14]
>>> centroids, clusters = kmeans(x, 3)
>>> centroids
[10.5, 14.0, 17.799999237060547]
>>> clusters
[[1, 3], [0, 5, 9], [2, 4, 6, 7, 8]]

diaparser.utils.alg.tarjan(sequence)[source]¶

Tarjan algorithm for finding Strongly Connected Components (SCCs) of a graph.

Parameters: sequence (list) – List of head indices.
Yields: A list of indices that make up a SCC. All self-loops are ignored.

Examples

>>> next(tarjan([2, 5, 0, 3, 1]))  # (1 -> 5 -> 2 -> 1) is a cycle
[2, 5, 1]

diaparser.utils.alg.chuliu_edmonds(s)[source]¶

ChuLiu/Edmonds algorithm for non-projective decoding.

Some code is borrowed from tdozat’s implementation. Descriptions of notations and formulas can be found in Non-projective Dependency Parsing using Spanning Tree Algorithms.

Notes

The algorithm does not guarantee to parse a single-root tree.

References

Ryan McDonald, Fernando Pereira, Kiril Ribarov and Jan Hajic. 2005. Non-projective Dependency Parsing using Spanning Tree Algorithms.

Parameters: s (Tensor) – [seq_len, seq_len]. Scores of all dependent-head pairs.
Returns: A tensor with shape [seq_len] for the resulting non-projective parse tree.
Return type: Tensor

diaparser.utils.alg.mst(scores, mask, multiroot=False)[source]¶

MST algorithm for decoding non-pojective trees. This is a wrapper for ChuLiu/Edmonds algorithm.

The algorithm first runs ChuLiu/Edmonds to parse a tree and then have a check of multi-roots, If multiroot=True and there indeed exist multi-roots, the algorithm seeks to find best single-root trees by iterating all possible single-root trees parsed by ChuLiu/Edmonds. Otherwise the resulting trees are directly taken as the final outputs.

Parameters

scores (Tensor) – [batch_size, seq_len, seq_len]. Scores of all dependent-head pairs.
mask (BoolTensor) – [batch_size, seq_len]. The mask to avoid parsing over padding tokens. The first column serving as pseudo words for roots should be False.
muliroot (bool) – Ensures to parse a single-root tree If False.

Returns

A tensor with shape [batch_size, seq_len] for the resulting non-projective parse trees.

Return type

Tensor

Examples

>>> scores = torch.tensor([[[-11.9436, -13.1464,  -6.4789, -13.8917],
                            [-60.6957, -60.2866, -48.6457, -63.8125],
                            [-38.1747, -49.9296, -45.2733, -49.5571],
                            [-19.7504, -23.9066,  -9.9139, -16.2088]]])
>>> scores[:, 0, 1:] = float('-inf')
>>> scores.diagonal(0, 1, 2)[1:].fill_(float('-inf'))
>>> mask = torch.tensor([[False,  True,  True,  True]])
>>> mst(scores, mask)
tensor([[0, 2, 0, 2]])

diaparser.utils.alg.eisner(scores, mask)[source]¶

First-order Eisner algorithm for projective decoding.

References

Ryan McDonald, Koby Crammer and Fernando Pereira. 2005. Online Large-Margin Training of Dependency Parsers.

Parameters

scores (Tensor) – [batch_size, seq_len, seq_len]. Scores of all dependent-head pairs.
mask (BoolTensor) – [batch_size, seq_len]. The mask to avoid parsing over padding tokens. The first column serving as pseudo words for roots should be False.

Returns

A tensor with shape [batch_size, seq_len] for the resulting projective parse trees.

Return type

Tensor

Examples

>>> scores = torch.tensor([[[-13.5026, -18.3700, -13.0033, -16.6809],
                            [-36.5235, -28.6344, -28.4696, -31.6750],
                            [ -2.9084,  -7.4825,  -1.4861,  -6.8709],
                            [-29.4880, -27.6905, -26.1498, -27.0233]]])
>>> mask = torch.tensor([[False,  True,  True,  True]])
>>> eisner(scores, mask)
tensor([[0, 2, 0, 2]])

diaparser.utils.alg.eisner2o(scores, mask)[source]¶

Second-order Eisner algorithm for projective decoding. This is an extension of the first-order one that further incorporates sibling scores into tree scoring.

References

Ryan McDonald and Fernando Pereira. 2006. Online Learning of Approximate Dependency Parsing Algorithms.

Parameters

scores (Tensor, Tensor) – A tuple of two tensors representing the first-order and second-order scores repectively. The first ([batch_size, seq_len, seq_len]) holds scores of all dependent-head pairs. The second ([batch_size, seq_len, seq_len, seq_len]) holds scores of all dependent-head-sibling triples.
mask (BoolTensor) – [batch_size, seq_len]. The mask to avoid parsing over padding tokens. The first column serving as pseudo words for roots should be False.

Returns

A tensor with shape [batch_size, seq_len] for the resulting projective parse trees.

Return type

Tensor

Examples

>>> s_arc = torch.tensor([[[ -2.8092,  -7.9104,  -0.9414,  -5.4360],
                           [-10.3494,  -7.9298,  -3.6929,  -7.3985],
                           [  1.1815,  -3.8291,   2.3166,  -2.7183],
                           [ -3.9776,  -3.9063,  -1.6762,  -3.1861]]])
>>> s_sib = torch.tensor([[[[ 0.4719,  0.4154,  1.1333,  0.6946],
                            [ 1.1252,  1.3043,  2.1128,  1.4621],
                            [ 0.5974,  0.5635,  1.0115,  0.7550],
                            [ 1.1174,  1.3794,  2.2567,  1.4043]],
                           [[-2.1480, -4.1830, -2.5519, -1.8020],
                            [-1.2496, -1.7859, -0.0665, -0.4938],
                            [-2.6171, -4.0142, -2.9428, -2.2121],
                            [-0.5166, -1.0925,  0.5190,  0.1371]],
                           [[ 0.5827, -1.2499, -0.0648, -0.0497],
                            [ 1.4695,  0.3522,  1.5614,  1.0236],
                            [ 0.4647, -0.7996, -0.3801,  0.0046],
                            [ 1.5611,  0.3875,  1.8285,  1.0766]],
                           [[-1.3053, -2.9423, -1.5779, -1.2142],
                            [-0.1908, -0.9699,  0.3085,  0.1061],
                            [-1.6783, -2.8199, -1.8853, -1.5653],
                            [ 0.3629, -0.3488,  0.9011,  0.5674]]]])
>>> mask = torch.tensor([[False,  True,  True,  True]])
>>> eisner2o((s_arc, s_sib), mask)
tensor([[0, 2, 0, 2]])

diaparser.utils.alg.cky(scores, mask)[source]¶

The implementation of Cocke-Kasami-Younger (CKY) algorithm to parse constituency trees.

References

Yu Zhang, Houquan Zhou and Zhenghua Li. 2020. Fast and Accurate Neural CRF Constituency Parsing.

Parameters

scores (Tensor) – [batch_size, seq_len, seq_len]. Scores of all candidate constituents.
mask (BoolTensor) – [batch_size, seq_len, seq_len]. The mask to avoid parsing over padding tokens. For each square matrix in a batch, the positions except upper triangular part should be masked out.

Returns

Sequences of factorized predicted bracketed trees that are traversed in pre-order.

Examples

>>> scores = torch.tensor([[[ 2.5659,  1.4253, -2.5272,  3.3011],
                            [ 1.3687, -0.5869,  1.0011,  3.3020],
                            [ 1.2297,  0.4862,  1.1975,  2.5387],
                            [-0.0511, -1.2541, -0.7577,  0.2659]]])
>>> mask = torch.tensor([[[False,  True,  True,  True],
                          [False, False,  True,  True],
                          [False, False, False,  True],
                          [False, False, False, False]]])
>>> cky(scores, mask)
[[(0, 3), (0, 1), (1, 3), (1, 2), (2, 3)]]

Dataset¶

class diaparser.utils.Dataset(transform, data, **kwargs)[source]¶

Dataset that is compatible with torch.utils.data.Dataset. This serves as a wrapper for manipulating all data fields with the operating behaviours defined in Transform. The data fields of all the instantiated sentences can be accessed as an attribute of the dataset.

Parameters

transform (Transform) – An instance of Transform and its derivations. The instance holds a series of loading and processing behaviours with regard to the specfic data format.
data (list[list] or str) – A list of list of strings or a filename. This will be passed into transform.load().
kwargs (dict) – Keyword arguments that will be passed into transform.load() together with data to control the loading behaviour.

transform¶

An instance of Transform.

Type: Transform

sentences¶

A list of sentences loaded from the data. Each sentence includes fields obeying the data format defined in transform.

Type: list[Sentence]

Field¶

class diaparser.utils.RawField(name, fn=None)[source]¶

Defines a general datatype.

A RawField object does not assume any property of the datatype and it holds parameters relating to how a datatype should be processed.

Parameters

name (str) – The name of the field.
fn (function) – The function used for preprocessing the examples. Default: None.

class diaparser.utils.Field(name, pad=None, unk=None, bos=None, eos=None, lower=False, use_vocab=True, tokenize=None, fn=None, mask_token_id=0)[source]¶

Defines a datatype together with instructions for converting to Tensor. Field models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method.

Parameters

name (str) – The name of the field.
pad_token (str) – The string token used as padding. Default: None.
unk_token (str) – The string token used to represent OOV words. Default: None.
bos_token (str) – A token that will be prepended to every example using this field, or None for no bos_token. Default: None.
eos_token (str) – A token that will be appended to every example using this field, or None for no eos_token.
lower (bool) – Whether to lowercase the text in this field. Default: False.
use_vocab (bool) – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
tokenize (function) – The function used to tokenize strings using this field into sequential examples. Default: None.
fn (function) – The function used for preprocessing the examples. Default: None.

preprocess(sequence)[source]¶

Loads a single example using this field, tokenizing if necessary. The sequence will be first passed to fn if available. If tokenize is not None, the input will be tokenized. Then the input will be lowercased optionally.

Parameters: sequence (list) – The sequence to be preprocessed.
Returns: A list of preprocessed sequence.

build(dataset, min_freq=1, embed=None)[source]¶

Constructs a Vocab object for this field from the dataset. If the vocabulary has already existed, this function will have no effect.

Parameters

dataset (Dataset) – A Dataset object. One of the attributes should be named after the name of this field.
min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.
embed (Embedding) – An Embedding object, words in which will be extended to the vocabulary. Default: None.

transform(sequences: List[List[str]]) → List[Tensor][source]¶

Turns a list of sequences that use this field into tensors.

Each sequence is first preprocessed and then numericalized if needed.

Parameters: sequences (list[list[str]]) – A list of sequences.
Returns: A list of tensors transformed from the input sequences.

compose(sequences)[source]¶

Composes a batch of sequences into a padded tensor.

Parameters: sequences (list[Tensor]) – A list of tensors.
Returns: A padded tensor converted to proper device.

class diaparser.utils.SubwordField(*args, fix_len=0, **kwargs)[source]¶

A field that conducts tokenization and numericalization over each token rather the sequence.

This is customized for models requiring character/subword-level inputs, e.g., CharLSTM and BERT.

Parameters: fix_len (int) – A fixed length that all subword pieces will be padded to. This is used for truncating the subword pieces that exceed the length. To save the memory, the final length will be the smaller value between the max length of subword pieces in a batch and fix_len.

Examples

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
>>> field = SubwordField('bert',
                         pad=tokenizer.pad_token,
                         unk=tokenizer.unk_token,
                         bos=tokenizer.cls_token,
                         eos=tokenizer.sep_token,
                         fix_len=20,
                         tokenize=tokenizer.tokenize)
>>> field.vocab = tokenizer.get_vocab()  # no need to re-build the vocab
>>> field.transform([['This', 'field', 'performs', 'token-level', 'tokenization']])[0]
tensor([[  101,     0,     0],
        [ 1188,     0,     0],
        [ 1768,     0,     0],
        [10383,     0,     0],
        [22559,   118,  1634],
        [22559,  2734,     0],
        [  102,     0,     0]])

build(dataset, min_freq=1, embed=None)[source]¶

Constructs a Vocab object for this field from the dataset. If the vocabulary has already existed, this function will have no effect.

Parameters

dataset (Dataset) – A Dataset object. One of the attributes should be named after the name of this field.
min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.
embed (Embedding) – An Embedding object, words in which will be extended to the vocabulary. Default: None.

transform(sequences)[source]¶

Turns a list of sequences that use this field into tensors.

Each sequence is first preprocessed and then numericalized if needed.

Parameters: sequences (list[list[str]]) – A list of sequences.
Returns: A list of tensors transformed from the input sequences.

class diaparser.utils.BertField(name, tokenizer, **kwargs)[source]¶

A field that is dealt by a transformer.

Parameters

name (str) – name of the field.
tokenizer (AutoTokenizer) – the tokenizer for the transformer.
fix_len (int) – A fixed length that all subword pieces will be padded to. This is used for truncating the subword pieces that exceed the length. To save the memory, the final length will be the smaller value between the max length of subword pieces in a batch and fix_len.

Examples

>>> tokenizer = BertField.tokenizer('bert-base-cased')
>>> field = BertField('bert',
                      tokenizer,
                      fix_len=20)
>>> field.transform([['This', 'field', 'performs', 'token-level', 'tokenization']])[0]
tensor([[  101,     0,     0],
        [ 1188,     0,     0],
        [ 1768,     0,     0],
        [10383,     0,     0],
        [22559,   118,  1634],
        [22559,  2734,     0],
        [  102,     0,     0]])

build(dataset)[source]¶: Pretrained: nothing to be done.

classmethod tokenizer(name)[source]¶: Create an instance of tokenizer from either path or name. :param name: path or name of tokenizer.

class diaparser.utils.ChartField(name, pad=None, unk=None, bos=None, eos=None, lower=False, use_vocab=True, tokenize=None, fn=None, mask_token_id=0)[source]¶

Field dealing with constituency trees.

This field receives sequences of binarized trees factorized in pre-order, and returns charts filled with labels on each constituent.

Examples

>>> sequence = [(0, 5, 'S'), (0, 4, 'S|<>'), (0, 1, 'NP'), (1, 4, 'VP'), (1, 2, 'VP|<>'),
                (2, 4, 'S+VP'), (2, 3, 'VP|<>'), (3, 4, 'NP'), (4, 5, 'S|<>')]
>>> field.transform([sequence])[0]
tensor([[ -1,  37,  -1,  -1, 107,  79],
        [ -1,  -1, 120,  -1, 112,  -1],
        [ -1,  -1,  -1, 120,  86,  -1],
        [ -1,  -1,  -1,  -1,  37,  -1],
        [ -1,  -1,  -1,  -1,  -1, 107],
        [ -1,  -1,  -1,  -1,  -1,  -1]])

build(dataset, min_freq=1)[source]¶

Constructs a Vocab object for this field from the dataset. If the vocabulary has already existed, this function will have no effect.

Parameters

dataset (Dataset) – A Dataset object. One of the attributes should be named after the name of this field.
min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.
embed (Embedding) – An Embedding object, words in which will be extended to the vocabulary. Default: None.

transform(sequences)[source]¶

Turns a list of sequences that use this field into tensors.

Each sequence is first preprocessed and then numericalized if needed.

Parameters: sequences (list[list[str]]) – A list of sequences.
Returns: A list of tensors transformed from the input sequences.

Transform¶

class diaparser.utils.Transform[source]¶

A Transform object corresponds to a specific data format. It holds several instances of data fields that provide instructions for preprocessing and numericalizing, etc.

training¶

Sets the object in training mode. If False, some data fields not required for predictions won’t be returned. Default: True.

Type: bool

save(path, sentences)[source]¶: path (str of file): file where to write sentences or None to use stdout.

class diaparser.utils.CoNLL(ID=None, FORM=None, LEMMA=None, CPOS=None, POS=None, FEATS=None, HEAD=None, DEPREL=None, PHEAD=None, PDEPREL=None, reader=<built-in function open>)[source]¶

The CoNLL object holds ten fields required for CoNLL-X data format. Each field can be binded with one or more Field objects. For example, FORM can contain both Field and SubwordField to produce tensors for words and subwords.

ID¶: Token counter, starting at 1.

FORM¶: Words in the sentence.

LEMMA¶: Lemmas or stems (depending on the particular treebank) of words, or underscores if not available.

CPOS¶: Coarse-grained part-of-speech tags, where the tagset depends on the treebank.

POS¶: Fine-grained part-of-speech tags, where the tagset depends on the treebank.

FEATS¶: Unordered set of syntactic and/or morphological features (depending on the particular treebank), or underscores if not available.

HEAD¶: Heads of the tokens, which are either values of ID or zeros.

DEPREL¶: Dependency relations to the HEAD.

PHEAD¶: Projective heads of tokens, which are either values of ID or zeros, or underscores if not available.

PDEPREL¶: Dependency relations to the PHEAD, or underscores if not available.

References

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X Shared Task on Multilingual Dependency Parsing.

classmethod toconll(tokens)[source]¶

Converts a list of tokens to a string in CoNLL-X format. Missing fields are filled with underscores.

Parameters: tokens (list[str] or list[tuple]) – This can be either a list of words or word/pos pairs.
Returns: A string in CoNLL-X format.

Examples

>>> print(CoNLL.toconll(['She', 'enjoys', 'playing', 'tennis', '.']))
     She     _       _       _       _       _       _       _       _
     enjoys  _       _       _       _       _       _       _       _
     playing _       _       _       _       _       _       _       _
     tennis  _       _       _       _       _       _       _       _
     .       _       _       _       _       _       _       _       _

classmethod isprojective(sequence)[source]¶

Checks if a dependency tree is projective. This also works for partial annotation.

Besides the obvious crossing arcs, the examples below illustrate two non-projective cases which are hard to detect in the scenario of partial annotation.

Parameters: sequence (list[int]) – A list of head indices.
Returns: True if the tree is projective, False otherwise.

Examples

>>> CoNLL.isprojective([2, -1, 1])  # -1 denotes un-annotated cases
False
>>> CoNLL.isprojective([3, -1, 2])
False

classmethod istree(sequence, proj=False, multiroot=False)[source]¶

Checks if the arcs form an valid dependency tree.

Parameters

sequence (list[int]) – A list of head indices.
proj (bool) – If True, requires the tree to be projective. Default: False.
multiroot (bool) – If False, requires the tree to contain only a single root. Default: True.

Returns

True if the arcs form an valid tree, False otherwise.

Examples

>>> CoNLL.istree([3, 0, 0, 3], multiroot=True)
True
>>> CoNLL.istree([3, 0, 0, 3], proj=True)
False

load(data, proj=False, max_len=None, **kwargs)[source]¶

Loads the data in CoNLL-X format. Also supports for loading data from CoNLL-U file with comments and non-integer IDs.

Parameters

data (list[list] or str) – A list of instances or a filename.
proj (bool) – If True, discards all non-projective sentences. Default: False.
max_len (int) – Sentences exceeding the length will be discarded. Default: None.

Returns

A list of CoNLLSentence instances.

class diaparser.utils.Tree(WORD=None, POS=None, TREE=None, CHART=None)[source]¶

The Tree object factorize a constituency tree into four fields, each associated with one or more Field objects.

WORD¶: Words in the sentence.

POS¶: Part-of-speech tags, or underscores if not available.

TREE¶: The raw constituency tree in nltk.tree.Tree format.

CHART¶: The factorized sequence of binarized tree traversed in pre-order.

classmethod totree(tokens, root='')[source]¶

Converts a list of tokens to a nltk.tree.Tree. Missing fields are filled with underscores.

Parameters

tokens (list[str] or list[tuple]) – This can be either a list of words or word/pos pairs.
root (str) – The root label of the tree. Default: ‘’.

Returns

A nltk.tree.Tree object.

Examples

>>> print(Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP'))
(TOP (_ She) (_ enjoys) (_ playing) (_ tennis) (_ .))

classmethod binarize(tree)[source]¶

Conducts binarization over the tree.

First, the tree is transformed to satisfy Chomsky Normal Form (CNF). Here we call chomsky_normal_form() to conduct left-binarization. Second, all unary productions in the tree are collapsed.

Parameters: tree (nltk.tree.Tree) – The tree to be binarized.
Returns: The binarized tree.

Examples

>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ She))
                                    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
                                    (_ .)))
                                ''')
>>> print(Tree.binarize(tree))
(TOP
  (S
    (S|<>
      (NP (_ She))
      (VP
        (VP|<> (_ enjoys))
        (S+VP (VP|<> (_ playing)) (NP (_ tennis)))))
    (S|<> (_ .))))

classmethod factorize(tree, delete_labels=None, equal_labels=None)[source]¶

Factorizes the tree into a sequence. The tree is traversed in pre-order.

Parameters

tree (nltk.tree.Tree) – The tree to be factorized.
delete_labels (set[str]) – A set of labels to be ignored. This is used for evaluation. If it is a pre-terminal label, delete the word along with the brackets. If it is a non-terminal label, just delete the brackets (don’t delete childrens). In EVALB, the default set is: {‘TOP’, ‘S1’, ‘-NONE-’, ‘,’, ‘:’, ‘``’, “’’”, ‘.’, ‘?’, ‘!’, ‘’} Default: None.
equal_labels (dict[str, str]) – The key-val pairs in the dict are considered equivalent (non-directional). This is used for evaluation. The default dict defined in EVALB is: {‘ADVP’: ‘PRT’} Default: None.

Returns

The sequence of the factorized tree.

Examples

>>> tree = nltk.Tree.fromstring('''
                                (TOP
                                  (S
                                    (NP (_ She))
                                    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
                                    (_ .)))
                                ''')
>>> Tree.factorize(tree)
[(0, 5, 'TOP'), (0, 5, 'S'), (0, 1, 'NP'), (1, 4, 'VP'), (2, 4, 'S'), (2, 4, 'VP'), (3, 4, 'NP')]
>>> Tree.factorize(tree, delete_labels={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''})
[(0, 5, 'S'), (0, 1, 'NP'), (1, 4, 'VP'), (2, 4, 'S'), (2, 4, 'VP'), (3, 4, 'NP')]

classmethod build(tree, sequence)[source]¶

Builds a constituency tree from the sequence. The sequence is generated in pre-order. During building the tree, the sequence is de-binarized to the original format (i.e., the suffixes |<> are ignored, the collapsed labels are recovered).

Parameters

tree (nltk.tree.Tree) – An empty tree that provides a base for building a result tree.
sequence (list[tuple]) – A list of tuples used for generating a tree. Each tuple consits of the indices of left/right span boundaries and label of the span.

Returns

A result constituency tree.

Examples

>>> tree = Tree.totree(['She', 'enjoys', 'playing', 'tennis', '.'], 'TOP')
>>> sequence = [(0, 5, 'S'), (0, 4, 'S|<>'), (0, 1, 'NP'), (1, 4, 'VP'), (1, 2, 'VP|<>'),
                (2, 4, 'S+VP'), (2, 3, 'VP|<>'), (3, 4, 'NP'), (4, 5, 'S|<>')]
>>> print(Tree.build(tree, sequence))
(TOP
  (S
    (NP (_ She))
    (VP (_ enjoys) (S (VP (_ playing) (NP (_ tennis)))))
    (_ .)))

load(data, max_len=None, **kwargs)[source]¶

Parameters

data (list[list] or str) – A list of instances or a filename.
max_len (int) – Sentences exceeding the length will be discarded. Default: None.

Returns

A list of TreeSentence instances.

Vocab¶

class diaparser.utils.Vocab(counter, min_freq=1, specials=[], unk_index=0)[source]¶

Defines a vocabulary object that will be used to numericalize a field.

Parameters

counter (Counter) – Counter object holding the frequencies of each value found in the data.
min_freq (int) – The minimum frequency needed to include a token in the vocabulary. Default: 1.
specials (list[str]) – The list of special tokens (e.g., pad, unk, bos and eos) that will be prepended to the vocabulary. Default: [].
unk_index (int) – The index of unk token. Default: 0.

itos¶: A list of token strings indexed by their numerical identifiers.

stoi¶: A defaultdict object mapping token strings to numerical identifiers.