Data Analytics, Software Development, Education
For my undergraduate project, I developed an OCR framework to transcribe ancient Arabic manuscripts. The system first pre-processed images of the manuscripts, then segmented them into individual lines and words. Finally, a deep learning-based LSTM model was used to recognise the text in each word image.
I created and labelled a dataset of roughly 4500 words taken from Qur'an manuscripts dating from the 7th to 9th centuries CE. The LSTM model was first trained on a larger open-source dataset of medieval Arabic texts before being fine-tuned on my dataset.
Doublets refer to passages in a text where the same saying, phrase or narrative appears more than once, usually with slight variation. Doublets are a common feature in the New Testament Gospels and the Qur'an. For this project, I developed a script capable of identifying doublets and long formulaic expressions in texts, with a special focus on the Qur'an.
© Mohammed Al-Firas