Skip to content
Snippets Groups Projects

tokenization function and removing all non-alphabetic characters

Open Paweł Tometczak requested to merge pawel.tometczak-master-patch-87597 into master
Compare and
2 files
+ 28
0
Compare changes
  • Side-by-side
  • Inline
Files
2
+ 14
0
def get_letter_order():
"""Returns a dictionary mapping each letter or combination of letters to a numerical value representing its
position in the Kashubian alphabet. This ordering is used to sort words alphabetically in the Kashubian language."""
return {'a': 1, 'ą': 2, 'ã': 3, 'b': 4, 'c': 5, 'ch': 6, 'cz': 7, 'd': 8, 'dz': 9, '': 10, 'e': 11,
'é': 12, 'ë': 13, 'f': 14, 'g': 15, 'h': 16, 'i': 17, 'j':18,'k': 19, 'l': 20, 'ł': 21, 'm': 22,
'n': 23, 'ń': 24, 'ò': 25,'o':26, 'ó': 27, 'ô': 28, 'p': 29,'r': 30, 'rz': 31, 's': 32, 'sz': 33, 't': 34,
'ù': 35,'u':36, 'w': 37, 'y': 38, 'z': 39, 'ż': 40}
def sort_words(words):
"""Sorts a list of words alphabetically in the Kashubian language. Uses the ordering defined by the get_letter_order()
function to determine the order of letters and combinations of letters in each word."""
letter_order = get_letter_order()
sorted_words = sorted(words, key=lambda w: [letter_order.get(x, ord(x)) for x in w.lower()])
return sorted_words