regex_tokeniser.py

Tokeniser class for representing tokenisers that work with regular expressions.

class malti.tokeniser.regex_tokeniser.RegexTokeniser

Tokenise a text by using regular expressions that match tokens.

__init__(pattern: str, flags: RegexFlag = re.UNICODE | re.MULTILINE | re.DOTALL | re.IGNORECASE) → None

Create a regular expression tokeniser from a regular expression.

Parameters:

pattern (str) – A string regular expression which will be compiled into a regular expression re object.
flags (RegexFlag) – Regular expression flags to use from the re module.

Return type:

None

detokenise(tokens: list[str]) → str

Detokenise the list of tokens back into a whole text. The default behaviour is to just join all the tokens with spaces in between.

tokenise(text: str) → list[str]

Tokenise a text into a list of tokens.

tokenise_indices(text: str) → list[tuple[int, int]]

Tokenise a text and return the indices of the tokens. A list of integer pair tuples [(i, j)] is returned such that text[i:j] is a token.

Parameters:: text (str) – The text to tokenise.
Returns:: The list of tuple pairs containing integers specifying the locations of the tokens in the text.
Return type:: list[tuple[int, int]]