Tokenisers

Tokenisers are used to break up text represented as a single string (such as from a text file) into a list of words.

The tokenise function

The simplest way to tokenise a text in malti is as follows:

1import malti.tokeniser
2
3sentence = 'Eżempju ta\' sentenza.'
4tokens = malti.tokeniser.tokenise(sentence)
5print(tokens)
['Eżempju', "ta'", 'sentenza', '.']

The Tokeniser class

The above is a convenience function that makes use of a default tokeniser (KMTokeniser in this version). To gain access to all the features of tokenisers, they should be used in their class form, for example:

1import malti.tokeniser
2
3tokeniser = malti.tokeniser.KMTokeniser()
4
5sentence = 'Eżempju ta\' sentenza.'
6tokens = tokeniser.tokenise(sentence)
7print(tokens)
['Eżempju', "ta'", 'sentenza', '.']

Apart from tokenise, every tokeniser can also return a list of indices of the tokens instead of the tokens themselves by calling the tokenise_indices method:

1import malti.tokeniser
2
3tokeniser = malti.tokeniser.KMTokeniser()
4
5sentence = 'Eżempju ta\' sentenza.'
6indices = tokeniser.tokenise_indices(sentence)
7print(indices)
[(0, 7), (8, 11), (12, 20), (20, 21)]

This tells you that the first word is found at sentence[0:7], the second word at sentence[8:11], and so on.

There is also a detokenise method that is meant to approximately invert the tokenise method by returning the original text given a list of tokens (although tokenisation is generally a lossy transformation which means that there is no guarantee that the original text can be recovered):

1import malti.tokeniser
2
3tokeniser = malti.tokeniser.KMTokeniser()
4
5tokens = ['Eżempju', "ta'", 'sentenza', '.']
6text = tokeniser.detokenise(tokens)
7print(text)
'Eżempju ta\' sentenza.'

Available tokenisers

The following tokenisers are available:

  • malti.tokeniser.RegexTokeniser (regex_tokeniser.py): A tokeniser where you have to supply a regular expression that matches words.

  • malti.tokeniser.KMTokeniser (km_tokeniser.py): A RegexTokeniser that is equivalent to the one used to tokenise the Korpus Malti.