km_tokeniser.py
Korpus Malti tokeniser.
- class malti.tokeniser.km_tokeniser.km_tokeniser.KMTokeniser
Bases:
RegexTokeniserThe tokeniser used by the MLRS Korpus Malti corpus. Adapted from https://github.com/UMSpeech/MASRI/blob/main/masri/tokenise/tokenise.py Even though the linked repository does not have an MIT license, we have permission from the owner, albertgatt, to include it in this MIT licensed project.
- ABBREV_PREFIX = "sant[\\'’]|(a\\.?m|p\\.?m|onor|sra|nru|dott|kap|mons|dr|prof)\\.?"
Captures abbreviations e.g. Sant’ (as in Sant’Anna)
- DECIMAL = '\\d+[.,/]\\d+'
Captures decimal numbers e.g. 10.1
- DEF_ARTICLE = '\\w{0,5}?[dtlrnsxzcżċ]-'
Captures definite articles e.g. għall- or l-
- DEF_NUMERAL = '-i[dtlrnsxzcżċ]'
Captures definite numerals e.g. -il (as in ħdax-il)
- END_PUNCTUATION = '\\?|\\.|,|\\!|;|:|…|"|\\\'|\\.\\.\\.\\\''
Captures end-of-sentence punctuation marks e.g. .
- NUMBER = '\\d+'
Captures whole numbers e.g. 10
- NUMERIC_DATE = '\\d{1,2}[-/]\\d{1,2}[-/]\\d{2,4}|\\d{2,4}[-/]\\d{1,2}[-/]\\d{1,2}'
Capture dates expressed numerically e.g. 10/10/2010
- PROCLITIC_PREP = "^\\w[\\'’]$"
Captures proclitic prepositions e.g. l’
- WORD = "\\w+[`\\']?|\\S"
Captures words e.g. kelb
- __init__() None
Constructor.
- Return type:
None
- detokenise(tokens: list[str]) str
Detokenise the list of tokens back into a whole text.
- Parameters:
tokens (list[str]) – The tokenised text.
- Returns:
The text.
- Return type:
str
- tokenise(text: str) list[str]
Tokenise a text into a list of tokens.
- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tokens.
- Return type:
list[str]
- tokenise_indices(text: str) list[tuple[int, int]]
Tokenise a text and return the indices of the tokens. A list of integer pair tuples
[(i, j)]is returned such thattext[i:j]is a token.- Parameters:
text (str) – The text to tokenise.
- Returns:
The list of tuple pairs containing integers specifying the locations of the tokens in the text.
- Return type:
list[tuple[int, int]]