km_tokeniser.py

Korpus Malti tokeniser.

class malti.tokeniser.km_tokeniser.km_tokeniser.KMTokeniser

Bases: RegexTokeniser

The tokeniser used by the MLRS Korpus Malti corpus. Adapted from https://github.com/UMSpeech/MASRI/blob/main/masri/tokenise/tokenise.py Even though the linked repository does not have an MIT license, we have permission from the owner, albertgatt, to include it in this MIT licensed project.

ABBREV_PREFIX = "sant[\\'’]|(a\\.?m|p\\.?m|onor|sra|nru|dott|kap|mons|dr|prof)\\.?": Captures abbreviations e.g. Sant’ (as in Sant’Anna)

DECIMAL = '\\d+[.,/]\\d+': Captures decimal numbers e.g. 10.1

DEF_ARTICLE = '\\w{0,5}?[dtlrnsxzcżċ]-': Captures definite articles e.g. għall- or l-

DEF_NUMERAL = '-i[dtlrnsxzcżċ]': Captures definite numerals e.g. -il (as in ħdax-il)

END_PUNCTUATION = '\\?|\\.|,|\\!|;|:|…|"|\\\'|\\.\\.\\.\\\'': Captures end-of-sentence punctuation marks e.g. .

NUMBER = '\\d+': Captures whole numbers e.g. 10

NUMERIC_DATE = '\\d{1,2}[-/]\\d{1,2}[-/]\\d{2,4}|\\d{2,4}[-/]\\d{1,2}[-/]\\d{1,2}': Capture dates expressed numerically e.g. 10/10/2010

PROCLITIC_PREP = "^\\w[\\'’]$": Captures proclitic prepositions e.g. l’

WORD = "\\w+[`\\']?|\\S": Captures words e.g. kelb

__init__() → None

Constructor.

Return type:: None

detokenise(tokens: list[str]) → str

Detokenise the list of tokens back into a whole text.

Parameters:: tokens (list[str]) – The tokenised text.
Returns:: The text.
Return type:: str

tokenise(text: str) → list[str]

Tokenise a text into a list of tokens.

Parameters:: text (str) – The text to tokenise.
Returns:: The list of tokens.
Return type:: list[str]

tokenise_indices(text: str) → list[tuple[int, int]]

Tokenise a text and return the indices of the tokens. A list of integer pair tuples [(i, j)] is returned such that text[i:j] is a token.

Parameters:: text (str) – The text to tokenise.
Returns:: The list of tuple pairs containing integers specifying the locations of the tokens in the text.
Return type:: list[tuple[int, int]]