Mohammed Al-Firas

Data Analytics, Software Development, Education

Projects


Optical Character Recognition Model - Ancient Arabic texts

For my undergraduate project, I developed an OCR framework to transcribe ancient Arabic manuscripts. The system first pre-processed images of the manuscripts, then segmented them into individual lines and words. Finally, a deep learning-based LSTM model was used to recognise the text in each word image.

I created and labelled a dataset of roughly 4500 words taken from Qur'an manuscripts dating from the 7th to 9th centuries CE. The LSTM model was first trained on a larger open-source dataset of medieval Arabic texts before being fine-tuned on my dataset.

Doublets in Literary Texts

Doublets refer to passages in a text where the same saying, phrase or narrative appears more than once, usually with slight variation. Doublets are a common feature in the New Testament Gospels and the Qur'an. For this project, I developed a script capable of identifying doublets and long formulaic expressions in texts, with a special focus on the Qur'an.

© Mohammed Al-Firas