Publications

Applied Research

Applied computing work shaped by infrastructure, operational cost, data quality, and access realities in the Democratic Republic of the Congo and similar contexts.

2026

CongoNames Corpus: A Large-Scale Labeled Dataset of Congolese Personal Names

Personal names carry cultural and linguistic identity, yet most African countries lack large-scale, structured name datasets suitable for natural language processing (NLP) research and computational social science. We present CONGONAMES, the first large-scale corpus of personal names from the Democratic Republic of the Congo (DRC), derived from publicly released national secondary-school examination palomàres (result lists) published annually by the DRC Ministry of Education. The corpus comprises 8,053,983 name records spanning 16 examination years (2008–2023) across 12 provinces and 304 sub-provincial regions, each enriched with a reported sex marker (M/F) and regional provenance metadata. We describe a fully deterministic, layered processing pipeline (bronze–silver–gold architecture) that converts raw PDF documents into structured CSV datasets without manual annotation or machine-learning-based inference. The dataset is validated against school-level census counts extracted from the same source PDFs, yielding extraction error rates below 2% for all years except 2023 (7.81%, flagged due to a layout change). Descriptive analyses document name length and token-count distributions, character-level n-gram profiles, provincial diversity indices, and inter-provincial name-inventory overlap, collectively establishing the dual linguistic origin—locally rooted Bantu components and Christian/French-origin components—that characterizes modern Congolese naming practice. The dataset, processing code, and documentation are released openly to support research in African NLP, onomastics, and computational social science.

DOI

2026

CongoNames Corpus: A Large-Scale Labeled Dataset of Congolese Personal Names

Personal names carry cultural and linguistic identity, yet most African countries lack large-scale, structured name datasets suitable for natural language processing (NLP) research and computational social science. We present CONGONAMES, the first large-scale corpus of personal names from the Democratic Republic of the Congo (DRC), derived from publicly released national secondary-school examination palomàres (result lists) published annually by the DRC Ministry of Education. The corpus comprises 8,053,983 name records spanning 16 examination years (2008–2023) across 12 provinces and 304 sub-provincial regions, each enriched with a reported sex marker (M/F) and regional provenance metadata. We describe a fully deterministic, layered processing pipeline (bronze–silver–gold architecture) that converts raw PDF documents into structured CSV datasets without manual annotation or machine-learning-based inference. The dataset is validated against school-level census counts extracted from the same source PDFs, yielding extraction error rates below 2% for all years except 2023 (7.81%, flagged due to a layout change). Descriptive analyses document name length and token-count distributions, character-level n-gram profiles, provincial diversity indices, and inter-provincial name-inventory overlap, collectively establishing the dual linguistic origin—locally rooted Bantu components and Christian/French-origin components—that characterizes modern Congolese naming practice. The dataset, processing code, and documentation are released openly to support research in African NLP, onomastics, and computational social science.

DOI

2025 / 2025 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC)

Automated Citation Detection in Congolese Legal Texts: Leveraging LLM-Based NER for Knowledge Graph Construction

This paper builds upon our previous work on Juro, an AI-powered chatbot designed to improve legal information access in the Democratic Republic of Congo (DRC), by addressing the specific challenge of automated citation detection in unstructured legal texts. We propose an end-to-end approach that combines Large Language Model (LLM)-based annotation and Named Entity Recognition (NER) for extracting key entities critical to constructing a legal knowledge graph. A total of 8,400 Congolese legal document titles were collected and annotated using the GPT-4o-mini model, followed by training in spaCy under two distinct configurations, one emphasizing accuracy and the other efficiency. We evaluated the system using both a split dataset and a human-annotated benchmark, demonstrating strong performance in identifying document types, reference numbers, and publication dates. An initial mapping algorithm connected documents based on annotated entities, revealing a preliminary citation graph of over 1,400 relationships. While the current methodology shows promise in automating entity extraction and preliminary graph construction, future developments will explore deeper relationship modeling, improved type coverage, and integration into the Juro framework to provide enhanced legal support.

DOI