km_tokeniser.py

Korpus Malti tokeniser.

class malti.tokeniser.km_tokeniser.km_tokeniser.KMTokeniser

Bases: RegexTokeniser

The tokeniser used by the MLRS Korpus Malti corpus. Adapted from https://github.com/UMSpeech/MASRI/blob/main/masri/tokenise/tokenise.py Even though the linked repository does not have an MIT license, we have permission from the owner, albertgatt, to include it in this MIT licensed project.

ABBREV_PREFIX = "sant[\\'’]|(a\\.?m|p\\.?m|onor|sra|nru|dott|kap|mons|dr|prof)\\.?"

Captures abbreviations e.g. Sant’ (as in Sant’Anna)

DECIMAL = '\\d+[.,/]\\d+'

Captures decimal numbers e.g. 10.1

DEF_ARTICLE = '\\w{0,5}?[dtlrnsxzcżċ]-'

Captures definite articles e.g. għall- or l-

DEF_NUMERAL = '-i[dtlrnsxzcżċ]'

Captures definite numerals e.g. -il (as in ħdax-il)

END_PUNCTUATION = '\\?|\\.|,|\\!|;|:|…|"|\\\'|\\.\\.\\.\\\''

Captures end-of-sentence punctuation marks e.g. .

NUMBER = '\\d+'

Captures whole numbers e.g. 10

NUMERIC_DATE = '\\d{1,2}[-/]\\d{1,2}[-/]\\d{2,4}|\\d{2,4}[-/]\\d{1,2}[-/]\\d{1,2}'

Capture dates expressed numerically e.g. 10/10/2010

PROCLITIC_PREP = "^\\w[\\'’]$"

Captures proclitic prepositions e.g. l’

WORD = "\\w+[`\\']?|\\S"

Captures words e.g. kelb

__init__() None

Constructor.

Return type:

None

detokenise(tokens: list[str]) str

Detokenise the list of tokens back into a whole text.

Parameters:

tokens (list[str]) – The tokenised text.

Returns:

The text.

Return type:

str

tokenise(text: str) list[str]

Tokenise a text into a list of tokens.

Parameters:

text (str) – The text to tokenise.

Returns:

The list of tokens.

Return type:

list[str]

tokenise_indices(text: str) list[tuple[int, int]]

Tokenise a text and return the indices of the tokens. A list of integer pair tuples [(i, j)] is returned such that text[i:j] is a token.

Parameters:

text (str) – The text to tokenise.

Returns:

The list of tuple pairs containing integers specifying the locations of the tokens in the text.

Return type:

list[tuple[int, int]]