regex_tokeniser.py
Tokeniser class for representing tokenisers that work with regular expressions.
- class malti.tokeniser.regex_tokeniser.RegexTokeniser
Bases:
TokeniserTokenise a text by using regular expressions that match tokens.
- __init__(pattern: str, flags: RegexFlag = re.UNICODE | re.MULTILINE | re.DOTALL | re.IGNORECASE) None
Create a regular expression tokeniser from a regular expression.
- Parameters:
pattern (str) – A string regular expression which will be compiled into a regular expression
reobject.flags (RegexFlag) – Regular expression flags to use from the
remodule.
- Return type:
None
- detokenise(tokens: list[str]) str
Detokenise the list of tokens back into a whole text. The default behaviour is to just join all the tokens with spaces in between.
- Parameters:
tokens (list[str]) – The tokenised text.
- Returns:
The text.
- Return type:
str
- tokenise(text: str) list[str]
Tokenise a text into a list of tokens.
- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tokens.
- Return type:
list[str]
- tokenise_indices(text: str) list[tuple[int, int]]
Tokenise a text and return the indices of the tokens. A list of integer pair tuples
[(i, j)]is returned such thattext[i:j]is a token.- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tuple pairs containing integers specifying the locations of the tokens in the text.
- Return type:
list[tuple[int, int]]