

This corpus has both previous and subsequent versions. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag. Tim Buckwalter's lexicon and morphological analyzer was used to generate a candidate list of POS tags for each word. The new features include complete vocalization of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive. This corpus is also referred to as ANNAHAR. Partįor this corpus, the An Nahar News Agency stories were taken from Arabic Gigaword (LDC2003T12). The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1 and 3. Arabic Treebanking (ArabicTB), which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc.Part-of-Speech (POS) tagging, which includes inflectional features and gloss information not traditionally included with POS annotation.The Arabic Treebank project consists therefore of two distinct phases: As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. This corpus is designed for those who study and use languages either professionally or academically, and who need text corpora in their work.

Treebanks have become crucially important for the development of both data-driven and general linguistic research. Treebanks are language resources that provide annotations of natural languages at various levels of structure: at the word level, the phrase level, and the sentence level.

This corpus is part three of that project. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general.

Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) was developed by the Linguistic Data Consortium (LDC) and contains approximately 300,000 Arabic word tokens with both syntactic treebank annotation and annotation on part of speech (POS), gloss, and word segmentation.
