km_tokeniser.py

Korpus Malti tokeniser.

class malti.tokeniser.km_tokeniser.km_tokeniser.KMTokeniser

Bases: RegexTokeniser

The tokeniser used by the MLRS Korpus Malti corpus. Taken from https://github.com/UMSpeech/MASRI/blob/main/masri/tokenise/tokenise.py

__init__() None

Constructor.

Return type:

None

tokenise(text: str) list[str]

Tokenise a text into a list of tokens.

Parameters:

text (str) – The text to tokenise.

Returns:

The list of tokens.

Return type:

list[str]

tokenise_indices(text: str) list[tuple[int, int]]

Tokenise a text and return the indices of the tokens. A list of integer pair tuples [(i, j)] is returned such that text[i:j] is a token.

Parameters:

text (str) – The text to tokenise.

Returns:

The list of tuple pairs containing integers specifying the locations of the tokens in the text.

Return type:

list[tuple[int, int]]