2026
CongoNames Corpus: A Large-Scale Labeled Dataset of Congolese Personal Names
Personal names carry cultural and linguistic identity, yet most African countries lack large-scale, structured name datasets suitable for natural language processing (NLP) research and computational social science. We present CONGONAMES, the first large-scale corpus of personal names from the Democratic Republic of the Congo (DRC), derived from publicly released national secondary-school examination palomàres (result lists) published annually by the DRC Ministry of Education. The corpus comprises 8,053,983 name records spanning 16 examination years (2008–2023) across 12 provinces and 304 sub-provincial regions, each enriched with a reported sex marker (M/F) and regional provenance metadata. We describe a fully deterministic, layered processing pipeline (bronze–silver–gold architecture) that converts raw PDF documents into structured CSV datasets without manual annotation or machine-learning-based inference. The dataset is validated against school-level census counts extracted from the same source PDFs, yielding extraction error rates below 2% for all years except 2023 (7.81%, flagged due to a layout change). Descriptive analyses document name length and token-count distributions, character-level n-gram profiles, provincial diversity indices, and inter-provincial name-inventory overlap, collectively establishing the dual linguistic origin—locally rooted Bantu components and Christian/French-origin components—that characterizes modern Congolese naming practice. The dataset, processing code, and documentation are released openly to support research in African NLP, onomastics, and computational social science.