Tokenisers
Tokenisers are used to break up text represented as a single string (such as from a text file) into a list of words.
The tokenise function
The simplest way to tokenise a text in malti is as follows:
1import malti.tokeniser
2
3sentence = 'Eżempju ta\' sentenza.'
4tokens = malti.tokeniser.tokenise(sentence)
5print(tokens)
['Eżempju', "ta'", 'sentenza', '.']
The Tokeniser class
The above is a convenience function that makes use of a default tokeniser (KMTokeniser in this version).
To gain access to all the features of tokenisers, they should be used in their class form, for example:
1import malti.tokeniser
2
3tokeniser = malti.tokeniser.KMTokeniser()
4
5sentence = 'Eżempju ta\' sentenza.'
6tokens = tokeniser.tokenise(sentence)
7print(tokens)
['Eżempju', "ta'", 'sentenza', '.']
Apart from tokenise, every tokeniser can also return a list of indices of the tokens instead of the tokens themselves by calling the tokenise_indices method:
1import malti.tokeniser
2
3tokeniser = malti.tokeniser.KMTokeniser()
4
5sentence = 'Eżempju ta\' sentenza.'
6indices = tokeniser.tokenise_indices(sentence)
7print(indices)
[(0, 7), (8, 11), (12, 20), (20, 21)]
This tells you that the first word is found at sentence[0:7], the second word at sentence[8:11], and so on.
There is also a detokenise method that is meant to approximately invert the tokenise method by returning the original text given a list of tokens (although tokenisation is generally a lossy transformation which means that there is no guarantee that the original text can be recovered):
1import malti.tokeniser
2
3tokeniser = malti.tokeniser.KMTokeniser()
4
5tokens = ['Eżempju', "ta'", 'sentenza', '.']
6text = tokeniser.detokenise(tokens)
7print(text)
'Eżempju ta\' sentenza.'
Available tokenisers
The following tokenisers are available:
malti.tokeniser.RegexTokeniser(regex_tokeniser.py): A tokeniser where you have to supply a regular expression that matches words.malti.tokeniser.KMTokeniser(km_tokeniser.py): ARegexTokeniserthat is equivalent to the one used to tokenise the Korpus Malti.