tokeniser.py
A tokeniser.
- class malti.tokeniser.tokeniser.Tokeniser
Bases:
ABCTop-level abstract class representing all tokenisers.
- detokenise(tokens: list[str]) str
Detokenise the list of tokens back into a whole text. The default behaviour is to just join all the tokens with spaces in between.
- Parameters:
tokens (list[str]) – The tokenised text.
- Returns:
The text.
- Return type:
str
- tokenise(text: str) list[str]
Tokenise a text into a list of tokens.
- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tokens.
- Return type:
list[str]
- tokenise_indices(text: str) list[tuple[int, int]]
Tokenise a text and return the indices of the tokens. A list of integer pair tuples
[(i, j)]is returned such thattext[i:j]is a token.- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tuple pairs containing integers specifying the locations of the tokens in the text.
- Return type:
list[tuple[int, int]]