Developing Maḵẖzan into the best available Urdu text corpus

The recent launch of the Matnsāz Public Beta was enabled by the largest update to Maḵẖzan we have ever made.

Maḵẖzan is an open-sourced Urdu text corpus. As we built the autocorrect technology that powers Matnsāz, we found that existing Urdu text corpuses were limited. Instead of building a new proprietary text corpus, or relying on workarounds to bad data, we decided to create a high-quality Urdu data source and share it.

As of today, Maḵẖzan consists of 3.3 million words of Urdu text. We are in the process of taking this to 4 million words in the near future. We selected text from publications with high editorial standards, ensuring linguistic quality. Each document was converted into XML, with rich semantic tagging. We also cleaned each document to remove typographical errors. In total we have made tens of thousands of corrections and additions to source text. All of this is available for free, for both commercial and non-commercial uses.

The core difference between the work we have done for Maḵẖzan, and previous work done in this field, is quality. Our methods to achieve this quality are a fundamental shift from prior work in this field. This shift enables a step function change in the resulting bar of data quality.

As a result, we believe Maḵẖzan is the best available data source for Urdu text today. If you are building Urdu software or researching trends in the language, Maḵẖzan is the place to start this work.

I am writing this post to share a little bit about the work that went into building Maḵẖzan, why we believe it is a significant milestone for Urdu technology and research, and to thank all those that made it possible.

※

The basic problem of building machine learning software for Urdu is that it is difficult to find good data sources from which algorithms can learn the best practices of writing Urdu. In fact, the poor quality of Urdu text input mechanisms and publishing software, has meant that many mistakes in Urdu orthography are not just incidental but systemic. Almost the entirety of Urdu text published today contains errors.

After manually reviewing hundreds of thousands of words of typed text, we were able to identify the most common errors in digital Urdu text. Among these errors are:

Arbitrary use of spaces in and around words
Inconsistent use of spaces around punctuation
Diacritics added to the wrong character (often spaces and not on letters)
Incorrect punctuation (such as the use of two single quotation marks as opposed to double quotation marks)

In addition for some of the text we received, we made modifications specific to the source to make the text better for linguistic analysis. For example, from text we received from the Bunyad Journal, we removed endnotes which did not contain much linguistic richness.

A lot of the Urdu text we were using quoted text from other languages. English is commonly found in Urdu text, often for the use of technical vocabulary. But we also saw frequent use of Arabic and Farsi, as well as other regional languages of South Asia. These were often quoted from other material. To make resulting analyses cleaner, we annotated every line of text in our corpus to indicate where text is no longer Urdu. This allows the analysis to ignore other languages if needed without compromising the integrity of the text.

Each document in Maḵẖzan represents not just text, but with it identifies the semantic properties of the text. We have painstakingly identified whether each line of text is a heading, paragraph, list, table, blockquote or verse. We did this because we wanted to enable as many possible scenarios of linguistic analysis as possible. Building this semantic richness into the dataset from the get go allows Maḵẖzan to be used for things we haven't even thought of yet.

We also tagged each document with data about its authorship and publication. This data is often left out in data sources. But we felt it was essential to build a notion of authorship into the Maḵẖzan. Most algorithms today are opaque, and when they make decisions it is hard to tell why they are doing so. This is particularly true of models based on machine learning, and especially deep learning. The lack of transparency about these models is worsened by the fact that we don't know what data they learned from. And often we find out too late that the model itself is biased because the data it was fed was biased as well. To be more cognizant of propagating such biases, we wanted to allow for software and analyses to select text from Maḵẖzan that best fit their use case. And for us and external researchers to audit Maḵẖzan for its quality and diversity. Consequently, the metadata we have added to each document is not just human-readable but designed to be easily parseable programatically. There is work to be done still on ensuring this metadata is accurate, and descriptive. And we are committed to continue pushing the quality bar higher.

※

The truth of how we were able to do all this is simple. We spent a lot of time reading every document that is in Maḵẖzan, fixing it manually, then reviewing our fixes, before adding it to the corpus. A large fraction of the work going into Maḵẖzan, has been done by Hamza Safdar, and Akbar Zaman. Hamza and Akbar have spent months helping us clean this data, make our processes better, and ultimately have set the foundation of much Urdu technology and knowledge to be built on top of Maḵẖzan.

Ironically, a lot of people ask me if we could use artificial intelligence to do this work for us. And we have to explain that the reason we can't is that for artificial intelligence to work in this domain, we have to give it something good to learn off of. Without good data, artificial intelligence is useless. And while some approaches built off of statistical analysis that underpin artificial intelligence can take us a bit farther, a large, clean text-corpus is a game-changer.

So instead of trying to build software that understood the language without having read it, we built software and processes to help our human editors do this work better and faster.

After identifying common sources of error, we took the deterministic fixes and converted them into scripts we ran on every piece of text. For example, we know how many spaces should occur before and after punctuation, and the use of correct quotation characters. So these fixes were made programatically. To do this we wrote text editor plugins and scripts to do this work for us. I have to thank Shaoor Munir and Muhammad Haroon for catalyzing this approach, and taking our entirely manual process and moving it forward by a huge leap.

To semantically tag text, we spent hundreds of human hours identifying the right semantic tag from the source text. Semantic tags identify what function the text serves in the document (such as headings, paragraphs, lists, tables, blockquotes, and verses). Waqas Ali suggested a clever mechanism to use screen automation software to read through the styles attached to each line of text in a word processor, and to use that to infer the semantic tag to be attached to each line. This was a quantum leap in our processing. Since this process was not fool-proof (occasionally the styles used in the source text were not consistent or perfectly predictable), we used a couple of rounds of manual review after this automated step.

For text that was published in HTML, we were able to infer semantic tags with more confidence from their underlying markup in HTML.

Identifying word boundaries was an extremely interesting challenge. Due to the aesthetic gaps of current Urdu typography, many Urdu writers add and remove space characters arbitrarily to provide a better aesthetic fit. Computers rely on the explicit addition of space characters to understand where a word ends and where a new one begins. Without this knowledge, the ability to make any sense of the language is deeply constrained.

Most prior work in the field of Urdu language processing has attempted to get around this by building algorithms that make a best guess on how to correct errors in Urdu text. These are based on linguistic rules, and probabilistic estimates of the best way to correct erroneous Urdu text. While achieving some success, they are limited inherently by not having seen a large volume of reliably correct Urdu text.

Our approach was again, in a sense, manual. Instead of guessing the best fixes for words, we read every word and fixed all of the ones we thought were wrong.

In our first pass we look through each piece of text and find single letters that are not connected to any word. These are commonly a symptom of the poor of use of spaces around words, and we make manual corrections wherever necessary.

As a second step, we then programmatically go through each word in our corpus and see if it can be split up to two better candidate words. Similarly we take each pair of words and assess if removing the space between them results in a more correct candidate word. This method produces a lot of noise, so we filter the results of this analysis to candidate suggestions which appear more commonly than the source words. We then manually review each correction before making it in the text. Through this method we fix issues of word boundaries in bulk, and concentrate on the largest sources of error.

Of course this manual process is not always clear cut. While for some words there is a canonical orthographic tradition, many words do not have a clear tradition and others are written in multiple ways. To make decisions here we review all suggestions where we are unsure by experts in the Urdu language and in the orthographic tradition of the Arabic script. Shiraz Ali has been critical in this effort.

To make this work faster, we wrote command line tools to quickly go through candidate suggestions and mark them as ready to fix, reject candidate suggestions from our algorithm, or to put some candidate suggestions on hold as we conduct further research into the orhographic tradition. Hassan Talat was critical in the effort to scale our operations here.

To identify languages other than Urdu in source text, we relied primarily on visual queues. Text in other scripts is easy to identify. When text in other languages written in the Arabic script is included in Urdu text, it is often typeset in a different style. Using these visual queues we are able to isolate text in other languages, and then use a combination of manual review and translation software to correctly identify and tag what language the text is from. Where no visual demarcation of text is present, we scan the text for names of other languages, and then isolate for manual review where needed.

※

So what can you do with all this data?

Maḵẖzan underpins the intelligence of Matnsāz, a breakthrough new Urdu keyboard. Matnsāz uses an auto-correct algorithm built for Urdu from the ground up, and utilizes all the work we put into Maḵẖzan to make it work. Critically, Matnsāz also presents a new way of typing Urdu. Converting the traditional alphabet-based keyboard into a shape-based layout. This is a design that is only possible in a software-based paradigm, and can only work if supported by high-quality data. This is to say that software allows us to change the function of keyboard keys dynamically, in a way that is not possible in hardware. One key can insert the right consonant, based on shape, by understanding the current context of what you are typing. This understanding is only possible if learning from good data. So, similar to Matnsāz, if you are building new software for the Urdu language, Maḵẖzan is the place to start.

But the applications of Maḵẖzan are not limited to software. When we first started this work, we spent time talking to researchers in the humanities, in language studies, as well as those in computer science. And found that a corpus of text can allow for an incredible range of analyses.

You can use Maḵẖzan today to identify how often a word has been used, how the use of terminology has changed over time, how the style of different writers varies, and how Urdu itself varies by region and time.

A teacher reached out to us to see if we could help identify what the most commonly used pairs of letters were in the Urdu language. This knowledge would help them identify which tor-jor exercises to concentrate on to help students learn the rules of cursive in the Arabic script. This analysis was trivial in Maḵẖzan, and we went from request to result within a few minutes.

If you are interested in pursuing any analyses that will use Maḵẖzan, please do get in touch. We'd love to hear from you.

※

Maḵẖzan is currently composed of text from three publications:

Ishraq, published by the Al-Mawrid Institute
Bunyad, published by the Gurmani Centre at the Lahore University of Management Sciences
IBC Urdu

I would like to thank all three institutions for their generosity and partnership. In particular I would also like to thank the following individuals.

From the Al-Mawrid Institute:

Jawad Ahmed
Aamer Abdullah
Munir Ahmed

From the Gurmani Centre at the Lahore University of Management Sciences:

Bilal Tanweer

Gratitude is also due to Ateeb Gul for bringing us together with the Gurmani Centre.

And from IBC Urdu:

Tahir Umar
Sabookh Syed

I would like to express my deepest gratitude to the individuals that have been critical to developing Maḵẖzan (in alphabetical order of last name):

Shiraz Ali
Waqas Ali
Muhammad Haroon
Shaoor Munir
Hamza Safdar
Hassan Talat
Akbar Zaman

I would also like to thank Fareed Zaffar, for setting up the partnership between Matnsāz and the Technology for People Initiative at the Lahore University of Management Sciences. This partnership brought me together with Muhammad Haroon, Shaoor Munir, Hamza Safdar, and Akbar Zaman.

We will continue to improve Maḵẖzan. I hope this work enables new knowledge and technology for the Urdu language.

You can read more about Maḵẖzan on its webpage, our blog post introducing Maḵẖzan, and on the GitHub repository.

Zeerak AhmedAugust 24, 2020

Blog

Developing Maḵẖzan into the best available Urdu text corpus