Line joiners

Join a list of text lines into a single line, adding spaces between lines only where necessary and rejoining hyphenated words. This is useful for when extracting text from a PDF or using an OCR application.

The join_lines function

The simplest way to join multiple lines into a single text in malti is as follows:

1import malti.line_joiner
2
3lines = ['Dan it-', 'test huwa', 'maqsum f\'div-', 'ersi linji.']
4text = malti.line_joiner.join_lines(text, fix_hyphenated_words=True)
5print(text)
'Dan it-test huwa maqsum f\'diversi linji.'

The LineJoiner class

The above is a convenience function that makes use of a default line joiner (RBLineJoiner in this version). To gain access to all the features of line joiners, they should be used in their class form, for example:

1import malti.line_joiner
2
3splitter = malti.line_joiner.RMLineJoiner()
4
5lines = ['Dan it-', 'test huwa', 'maqsum f\'div-', 'ersi linji.']
6text = malti.line_joiner.join_lines(text, fix_hyphenated_words=True)
7print(text)
'Dan it-test huwa maqsum f\'diversi linji.'

Available line joiners

The following line joiners are available:

  • malti.line_joiner.RBLineJoiner (rb_line_joiner.py): A LineJoiner that processes lines with a rule-based algorithm.