Arabic word tokenization system using the maximum matching model

https://doi.org/10.55214/25768484.v8i6.2682

Authors

  • Shahab Ahmad Almaaytah Department of English Language and Humanity, Applied College, King Faisal University

Word tokenization is the first stage for higher-order Natural Language Processing (NLP) tasks like Part-of-Speech (PoS) tagging, parsing, and named entity recognition. The amount of text on the World Wide Web is growing daily in the present era of technology, necessitating the use of advanced instruments. Since more and more people speak Arabic around the world, Arabic language processing systems must be improved. Due to the writing style of Arabic with a lack of support for capitalization features and the use of compound words, it is difficult to perform word tokenization. This research paper proposes a novel Arabic word tokenization system based on the knowledge. To develop this system, a maximum matching model with its two variations, namely forward and reverse maximum matching is used. The proposed system is implemented in Python. The results produced during system evaluation report high performance.

Section

How to Cite

Almaaytah, S. A. . (2024). Arabic word tokenization system using the maximum matching model. Edelweiss Applied Science and Technology, 8(6), 3210–3217. https://doi.org/10.55214/25768484.v8i6.2682

Downloads

Download data is not yet available.

Dimension Badge

Download

Downloads

Issue

Section

Articles

Published

2024-10-29