km_tokeniser.py
Korpus Malti tokeniser.
- class malti.tokeniser.km_tokeniser.km_tokeniser.KMTokeniser
Bases:
RegexTokeniserThe tokeniser used by the MLRS Korpus Malti corpus. Taken from https://github.com/UMSpeech/MASRI/blob/main/masri/tokenise/tokenise.py
- __init__() None
Constructor.
- Return type:
None
- tokenise(text: str) list[str]
Tokenise a text into a list of tokens.
- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tokens.
- Return type:
list[str]
- tokenise_indices(text: str) list[tuple[int, int]]
Tokenise a text and return the indices of the tokens. A list of integer pair tuples
[(i, j)]is returned such thattext[i:j]is a token.- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tuple pairs containing integers specifying the locations of the tokens in the text.
- Return type:
list[tuple[int, int]]