متن‌ساز

Blog

Announcing Naqqāsh

Alongside today's release of Maḵẖzan, we are also today releasing Naqqāsh. Naqqāsh is a library of common string parsing and transformation methods for the Arabic script. For anyone building technology that deals with the Arabic script, Naqqāsh may save hours of work making common transformations.

The technology that underpins Naqqāsh was one of the first things we built for Matnsaz. In fact Matnsaz would be crippled without it. As we built Maḵẖzan, our Urdu text corpus, Naqqāsh provided the essential functionality that we needed to standardize and correct text for use in our natural language processing work for Matnsāz.

Really Naqqāsh is a set of methods that any developer would rightly imagine come built in with all programming languages. Most of the functions provided by Naqqāsh, such as identifying whether a Unicode point is a letter in the Arabic script, are common functions that we conduct everyday with Latin text. Unfortunately things often break quickly the moment we scale to Arabic script. So Naqqāsh provides a set of core functionality that builds on Unicode and existing string functions to provide specific features for the Arabic script.

Some examples:

print(Naqqash.isLetter("ب"))
>>> true

print(Naqqash.isLetter("۲"))
>>> false

print(Naqqash.isLeftJoining("ب"))
>>> true

print(Naqqash.addTatweelTo("ب", toDisplay: Naqqash.ContextualForm.Medial))
>>> "ـبـ"

print(Naqqash.removeDiacritics("اَب"))
>>> "اب"

These are nowhere near complete, but as we build out more features for Matnsaz, we will continue to open-source functionality that would be beneficial to other developers. Please get in touch for questions or feature requests.

See the repository on GitHub

Zeerak Ahmed