متن‌ساز

Makhzan⁦ ⁧مخزن⁩

An Urdu text corpus for machine learning, natural language processing and linguistic analysis

Clean, structured, high-quality data

 
xml.png
 

Spend no more time cleaning your dataset

Most Urdu text is riddled with linguistic and typographical errors. The lack of software based spell-check and grammar correction means many errors get published without ever getting caught. Poor quality typefaces also force publishers to add or remove spaces arbitrarily for aesthetic purposes. As a result it is impossible to get clean word boundaries. Maḵẖzan was built over months to correct these issues, and more continue to be addressed as we grow the corpus.

 

Stand on solid structure

Currently available corpuses are at best a large chunk of text in an unattributed text file. You have no idea where the text is from, when it was published, who wrote it, let alone identify what part of the text you are dealing with (headings, paragraphs, lists, tables etc.). Every piece of text in Maḵẖzan is richly structured. Document hierarchy lives as XML tags, dense metadata identifies details of publication, and text from other languages is annotated – which all means that the fidelity of your analyses can be that much higher, and that we can ensure our corpus is representative of Urdu spoken all across the world.

 

Learn from the language of scholars

Maḵẖzan has been started with generous initial donations of text from two renowned journals – Bunyad, from the Gurmani Center of Literature and Languages at the Lahore University of Management Sciences (LUMS), and Ishraq, from the Al-Mawrid Institute. This choice of sources allowed us to get a diversity of voices even in a small initial corpus, while ensuring the highest editorial standards available in published Urdu text. As a result your models can also maintain high linguistic standards.

 

Completely free of cost

All this is available to use immediately, with no cost or hindrance whatsoever. The only limitation is that the text used as part of the corpus cannot be distributed with your software (to protect our writers). But for any training or analysis it will take you minutes to get started.

 
wordFreq.png
 

Get started in minutes

Have an application that just needs basic statistics like word frequencies? Those come pre-calculated. Have an analysis you want to build? Start with some of our built in scripts. Want to use the corpus for a completely different technology stack? Simply use any XML parser with our documentation and get going. Start from scratch and be done in minutes.

 

 

Read more technical documentation

 

 

Donating to this corpus

This corpus, and the many applications it will enable, depend on generous donations of text. If you are an Urdu writer or publisher that would like to help the cause of Urdu in modern software, please get in touch.

 

How will this corpus grow over time?

At the moment we are sitting on generous donations that we are in the process of transforming into training-ready formats. This is a long, labor-intensive process. If you'd like to volunteer, we'd love your support.

Our goal is to continue to grow this text over time, and specifically to increase its representativeness. At the moment, like Urdu publishing, this corpus over-represents works by male authors and publications from the city of Lahore. We are working to add more diversity to this corpus.