In Data in brief
Based on 29,192,662 html files obtained from the ClueWeb a bi-gram data language model for Arabic is constructed. The created dataset is considering standard types of bi-gram analysis, however with focus on the root-pattern paradigm in Arabic. Root-Pattern distributions in form of P(root|pattern), P(pattern|root) and P(pattern|pattern) are additionally estimated. The aspect of considering the Maximum Likelihood Estimation (MLE) on the root-pattern level as a higher-level of abstraction, has been widely neglected in Arabic research community despite its advantage in reducing ambiguities within Arabic morphological analysis and its impact on cognitive aspect on Arabic word perception [1]. In the preprocessing phase, the html files were converted to 974 unfiltered raw text files with the size of about 180 GB. These files were morphologically analyzed towards extracting and counting frequencies of patterns, roots, particle, and stems and particularly root-pattern occurrences. Based on a resulting corpus containing around 18,482,719 raw words, a language data model is constructed containing 9,311,246 bi-grams of morphologically analyzed wordform, including around 3.49 million bi-directional P(root|pattern) and around 1.153 million P(patttern|pattern) bi-grams in form of conditional probabilities covering a subset of around 8086 roots with 20413 possible pattern-forms. As this data model is considering the root-pattern phenomenon in Arabic, the created data are useful for researchers working on cognitive aspect of Arabic such as visual word cognition, morpho-phonetic perception, morphological analysis, spell-checking, and resolving ambiguities in morphological parsing.
Haddad Bassam, Awwad Ahmad, Hattab Mamoun, Hattab Ammar
2023-Feb
Arabic language model, N-gram models, Probabilistic morphology, Root Pattern Analysis, Root-Pattern Classification, Word Cognition