regex_tokeniser.py
Tokeniser class for representing tokenisers that work with regular expressions.
- class malti.tokeniser.regex_tokeniser.RegexTokeniser
Bases:
TokeniserTokenise a text by using regular expressions that match tokens.
- __init__(pattern: str, flags: RegexFlag = re.UNICODE | re.MULTILINE | re.DOTALL | re.IGNORECASE) None
Create a regular expression tokeniser from a regular expression.
- Parameters:
pattern (str) – A string regular expression which will be compiled into a regular expression
reobject.flags (RegexFlag) – Regular expression flags to use from the
remodule.
- Return type:
None
- tokenise(text: str) list[str]
Tokenise a text into a list of tokens.
- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tokens.
- Return type:
list[str]
- tokenise_indices(text: str) list[tuple[int, int]]
Tokenise a text and return the indices of the tokens. A list of integer pair tuples
[(i, j)]is returned such thattext[i:j]is a token.- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tuple pairs containing integers specifying the locations of the tokens in the text.
- Return type:
list[tuple[int, int]]