Announcing Naqqash
Alongside today's release of Maḵẖzan, we are also today releasing Naqqash. Naqqash is a library of common string parsing and transformation methods for the Arabic script. For anyone building technology that deals with the Arabic script, Naqqāsh may save hours of work making common transformations.
The technology that underpins Naqqash was one of the first things we built for Matnsaz. In fact Matnsaz would be crippled without it. As we built Makhzan, our Urdu text corpus, Naqqash provided the essential functionality that we needed to standardize and correct text for use in our natural language processing work for Matnsaz.
Really Naqqash is a set of methods that any developer would rightly imagine come built in with all programming languages. Most of the functions provided by Naqqash, such as identifying whether a Unicode point is a letter in the Arabic script, are common functions that we conduct everyday with Latin text. Unfortunately things often break quickly the moment we scale to Arabic script. So Naqqash provides a set of core functionality that builds on Unicode and existing string functions to provide specific features for the Arabic script.
Some examples:
print(Naqqash.isLetter("ب"))
>>> true
print(Naqqash.isLetter("۲"))
>>> false
print(Naqqash.isLeftJoining("ب"))
>>> true
print(Naqqash.addTatweelTo("ب", toDisplay: Naqqash.ContextualForm.Medial))
>>> "ـبـ"
print(Naqqash.removeDiacritics("اَب"))
>>> "اب"
These are nowhere near complete, but as we build out more features for Matnsaz, we will continue to open-source functionality that would be beneficial to other developers. Please get in touch for questions or feature requests.
