Tokenisers ========== Tokenisers are used to break up text represented as a single string (such as from a text file) into a list of words. The ``tokenise`` function ------------------------- The simplest way to tokenise a text in ``malti`` is as follows: .. code-block:: python :linenos: import malti.tokeniser sentence = 'Eżempju ta\' sentenza.' tokens = malti.tokeniser.tokenise(sentence) print(tokens) .. code-block:: python ['Eżempju', "ta'", 'sentenza', '.'] The ``Tokeniser`` class ----------------------- The above is a convenience function that makes use of a default tokeniser (``KMTokeniser`` in this version). To gain access to all the features of tokenisers, they should be used in their class form, for example: .. code-block:: python :linenos: import malti.tokeniser tokeniser = malti.tokeniser.KMTokeniser() sentence = 'Eżempju ta\' sentenza.' tokens = tokeniser.tokenise(sentence) print(tokens) .. code-block:: python ['Eżempju', "ta'", 'sentenza', '.'] Apart from ``tokenise``, every tokeniser can also return a list of indices of the tokens instead of the tokens themselves by calling the ``tokenise_indices`` method: .. code-block:: python :linenos: import malti.tokeniser tokeniser = malti.tokeniser.KMTokeniser() sentence = 'Eżempju ta\' sentenza.' indices = tokeniser.tokenise_indices(sentence) print(indices) .. code-block:: python [(0, 7), (8, 11), (12, 20), (20, 21)] This tells you that the first word is found at ``sentence[0:7]``, the second word at ``sentence[8:11]``, and so on. There is also a ``detokenise`` method that is meant to *approximately* invert the ``tokenise`` method by returning the original text given a list of tokens (although tokenisation is generally a lossy transformation which means that there is no guarantee that the original text can be recovered): .. code-block:: python :linenos: import malti.tokeniser tokeniser = malti.tokeniser.KMTokeniser() tokens = ['Eżempju', "ta'", 'sentenza', '.'] text = tokeniser.detokenise(tokens) print(text) .. code-block:: python 'Eżempju ta\' sentenza.' Available tokenisers -------------------- The following tokenisers are available: * ``malti.tokeniser.RegexTokeniser`` (:doc:`../malti/tokeniser/regex_tokeniser`): A tokeniser where you have to supply a regular expression that matches words. * ``malti.tokeniser.KMTokeniser`` (:doc:`../malti/tokeniser/km_tokeniser/km_tokeniser`): A ``RegexTokeniser`` that is equivalent to the one used to tokenise the `Korpus Malti `_.