متن‌ساز

Blog

Announcing Maḵẖzan

Today I'm proud to announce Maḵẖzan – an Urdu text corpus built to support modern machine learning, natural language processing, and linguistic analysis work.

Building Maḵẖzan took the best part of a year. As we built the auto-correction technology for Matnsāz, we quickly hit a roadblock finding good training data for our algorithms. After spending weeks attempting to build a proprietary text corpus, it was clear that sourcing high-quality data and then cleaning and analyzing this would be no small task. In fact, this became the primary and most difficult task for our work.

We know these struggles are not unique to us. Every developer of Urdu technology faces the same challenges – searching for clean data in easy to use formats. Urdu text corpuses do exist, but most have not been updated in years. Many exist in old technological stacks, and some are even behind paywalls. There is little to no attribution, which means there is no way to ensure that the software we build furthers the language instead of sabotaging it with poor quality recommendations.

All of these issues are intensified by the already subpar nature of Urdu publishing software. Due to the existence of only a handful of tools, many of which continue to be used in old, pirated versions, errors in transcription are multiplied to all text produced. For example, the absence of spell-check means common typographical errors are not caught even by high-profile publications. Further, the issues of spacing in Urdu fonts mean that writers and publishers use spaces arbitrarily to add whitespace where it is needed aesthetically. Almost no text uses the zero-width non-joiner character, which is used to indicate the end of cursive without adding a full space. These issues make it impossible to ascertain word boundaries. As a result many researchers have worked hard to establish word-breaking algorithms for bad data.

Unfortunately bad data prevents us from making any great strides in Urdu technology. Any modern machine learning application – from grammar correction, to search, all the way to voice and optical character recognition – starts with an elemental need for a clean data set of Urdu text. If we continue to spend time constructing and cleaning data sets, our work towards actual, useful applications continues to be curtailed.

Now it's true that large companies, Google the most notable, have managed to get Urdu technology off the ground with a corpus they've presumably built from online publications. In many cases Hindi is offered much better native support, and much of that technology should scale to Urdu with clever modifications that are completely within available technological capability. However the issue with proprietary language technology, especially for underrepresented languages such as Urdu, is that the lack of attribution, sourcing and auditability makes it impossible to understand why the language algorithms are behaving as they are, what sort of language they promulgate and if they are causing potential damage by highlighting the wrong sources for the wrong purpose.

To fix this, it was clear that we had to open-source Maḵẖzan. Over the course of the last year we have worked with institutions across the world that have access to high quality Urdu publications. After understanding the needs of Urdu software, scholars and researchers have graciously partnered with us to build a resource for all researchers and developers.

We spent most of the year talking to Urdu publishers, explaining the needs of Urdu technology and how it could be helped via donations of high-quality Urdu literature. The good thing is that most Urdu publishers, researchers and writers immediately understood that they are the primary customers for good Urdu software. And that together we can build the infrastructure for upcoming Urdu technology.

From then on the work has been to take text from a variety of formats (HTML, InPage being the most prevalent) into a common syntactical framework for programmatic analysis. In other words, we have developed a format to represent Urdu text to quickly identify what line of text represents what sort of format in the original writing. As a result our XML hierarchy can be used to quickly identify headings, paragraphs, lists, tables, and text in other languages. This makes it incredibly easy to highlight the pieces of text relevant to your analysis and proceed from there.

This process also includes the correction of common errors in Urdu text. Mostly these are errors of spacing, the use of correct punctuation characters, and typographical errors which are incredibly common. We have not been able to fix all of these, and the corpus as it is undoubtedly contains a few errors. But the data has been cleaned an immense degree, we continue to identify and fix errors as we find them.

To further reduce the cost of starting an analysis, Maḵẖzan comes with scripts that provide a starting point on how to conduct basic analyses. For starters we've included two scripts – one to pre-process our text to use fully decomposed Unicode characters, and another to count word frequencies. We'll continue to add more common linguistic analyses to these scripts, but these serve as good starting points for any researcher. The output of our scripts is also provided with the corpus. Which means if you are a researcher or developer that just needs common analyses such as the most common words in the Urdu language, you can just copy the results of our work without writing any code yourself. Of course if you'd like to dig in this is easy to do.

Maḵẖzan's scripts build on the Naqqāsh Swift package, which was also built as a foundational technology for Matnsāz and which we are also open-sourcing today. You can read more about this release in an accompanying blog post.

While Maḵẖzan is not the largest Urdu corpus at the moment, we have sourced enough data for it to become so. Unfortunately transforming data into clean, structured, easy-to-use formats is a labor-intensive process. We will continue to add more data to Maḵẖzan to increase its linguistic breadth, diversity and quality.

I'd like to thank all the writers included in our corpus, the Gurmani Center of Literature and Languages, and the Al-Mawrid Institute.

Special thanks is also due to:

  • Ateeb Gul, previously at the Lahore University of Management Sciences, now at Boston University

  • Bilal Tanweer, Muhammad Naveed, Ghulam Moeen ud Din of the Lahore University of Management Sciences

  • Jawad Ghamidi, Aamer Abdullah and Munir Ahmed of the Al-Mawrid Institute

  • Najeeba Arif of the International Islamic University

  • Frances Pritchett and A. Sean Pue of Columbia University

  • Shiraz Ali, of the University of California, Berkeley

  • Awais Athar, who provided early feedback

  • Hassan Talat who helped build the engineering pipeline of this corpus

We would love to hear from you about Maḵẖzan. If you have feature requests, questions, concerns or feedback we can't wait to engage. This is an open-source project, and would benefit greatly not just from your engagement but also from contribution. If you would like to contribute data or can help us transform and analyze text, please do get in touch.

For now, many thanks for your support. And we'd love to see what you do with Maḵẖzan.

See the repository on GitHub

Zeerak Ahmed