Arabic word tokenization system using the maximum matching model

Shahab Ahmad  Almaaytah

doi:10.55214/25768484.v8i6.2682

Word tokenization is the first stage for higher-order Natural Language Processing (NLP) tasks like Part-of-Speech (PoS) tagging, parsing, and named entity recognition. The amount of text on the World Wide Web is growing daily in the present era of technology, necessitating the use of advanced instruments. Since more and more people speak Arabic around the world, Arabic language processing systems must be improved. Due to the writing style of Arabic with a lack of support for capitalization features and the use of compound words, it is difficult to perform word tokenization. This research paper proposes a novel Arabic word tokenization system based on the knowledge. To develop this system, a maximum matching model with its two variations, namely forward and reverse maximum matching is used. The proposed system is implemented in Python. The results produced during system evaluation report high performance.

Section

How to Cite

Almaaytah, S. A. . (2024). Arabic word tokenization system using the maximum matching model. Edelweiss Applied Science and Technology, 8(6), 3210–3217. https://doi.org/10.55214/25768484.v8i6.2682