A Pre-Research Article | Data Science, Data Pipelines & Genomic Informatics
Abstract
Paternity testing is one of the most legally and emotionally consequential applications of forensic science. In Kenya, demand for DNA paternity testing has grown sharply over the past decade, with laboratories such as the Bioinformatics Institute of Kenya now processing upwards of 125 cases every month, many feeding directly into court proceedings around child support, custody, and inheritance. Yet the infrastructure underpinning these tests remains largely manual, time-intensive, and constrained by limited laboratory capacity and a near-total absence of a national forensic DNA database.
This article makes the case that machine learning (ML), combined with a well-engineered data pipeline, offers a credible and practical path toward faster, more scalable, and more consistent paternity determination. It surveys the existing research landscape, identifies the specific gaps that remain unaddressed in the African genomic context, and outlines the architecture of a data pipeline that could serve as the foundation for an ML-assisted paternity testing system designed for the Kenyan environment.
1. The Problem: Paternity Disputes in the Kenyan Legal System
Paternity testing accounts for over 90% of all DNA tests conducted in Kenya. Cases span a wide range of contexts: child maintenance and custody battles in family courts, inheritance disputes, immigration documentation, and increasingly, personal verification outside any legal process.
The bottleneck is not awareness; demand is clearly there and growing. The bottleneck is capacity. Until recently, Kenya had only one institution authorised to conduct forensic DNA testing: the Government Chemist, a government-run laboratory that has historically struggled with case backlog. A second public laboratory was later established at the Kenya Medical Research Institute (KEMRI), and a small number of private providers, notably the Bioinformatics Institute of Kenya (BIK) and EasyDNA Kenya, now operate in the market.
Despite this growth, several structural problems persist:
- No national forensic DNA database exists. Bodies are disposed of without DNA records. Cold cases cannot be cross-referenced against stored profiles.
- Manual interpretation remains the norm. Short Tandem Repeat (STR) profiles are compared by trained analysts, introducing potential for human error and analyst inconsistency.
- Turnaround times are slow. Legal-grade tests can take days to weeks, delaying court proceedings.
- Population-specific allele frequency data for East African populations is sparse. Most STR allele frequency databases are built on European, East Asian, or American reference panels; which affects the statistical accuracy of paternity index calculations.
2. How DNA Paternity Testing Currently Works
Before discussing machine learning applications, it is worth briefly describing how the current process works.
Modern paternity testing uses Short Tandem Repeats (STRs), sections of the genome where a short sequence of base pairs (the "repeat unit") is repeated a variable number of times from person to person. Because the number of repeats at each STR locus is highly variable across individuals, and because a child inherits one allele at each locus from each parent, comparing STR profiles across child, mother, and alleged father allows analysts to determine whether the father's alleles are present in the child's DNA.
The output of this process is a Combined Paternity Index (CPI), a likelihood ratio that expresses how much more likely it is that the tested man is the biological father versus a random unrelated man from the same population. A CPI above 10,000 (corresponding to a probability of paternity above 99.99%) is typically required for a legal determination.
The standard in Kenya involves 24 genetic markers. The Bioinformatics Institute of Kenya, for example, offers a 24-marker test that it describes as superior to the panels used by most law enforcement laboratories in the country and across Africa.
The weakness in this process is two-fold: it requires skilled human analysts at every step, and the statistical power of the CPI calculation depends on accurate, population-specific allele frequency tables, which, for East African populations, are still being developed.
3. What the Research Says: Machine Learning and DNA Kinship Analysis
A growing body of peer-reviewed research now demonstrates that machine learning can meaningfully contribute to DNA-based kinship and paternity analysis. The work spans several different approaches.
3.1 Deep Neural Networks on STR Data
A 2023 paper published in the Journal of Intelligent Systems proposed replacing manual STR matching with a Deep Neural Network (DNN) trained on 15-locus STR data. The researchers created a synthetic familial dataset, augmented it to increase sample size, and trained a DNN to predict paternity. This was among the first studies to directly position deep learning as a substitute for, rather than a supplement to, manual forensic interpretation.
The paper explicitly acknowledged that in developing countries, conventional kinship analysis techniques result in inadequate accuracy when dealing with large STR datasets, largely because of the human labour required for profile-by-profile comparison.
3.2 Random Forest and SVM on mtDNA Sequences
A study published on PubMed (NIH) applied four machine learning classifiers, Support Vector Machines (SVM), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Random Forest (RF), to mitochondrial DNA hypervariable region I sequences. The data covered African, Asian, and Caucasian samples.
The results were encouraging: a Bag-of-Words + PCA + Random Forest combination achieved 94.4% accuracy in predicting genetic relatedness, outperforming all other configurations. Critically, this study is one of the few to explicitly include African DNA samples, making its findings directly relevant to the Kenyan context.
3.3 SNP-Based Kinship Panels with Supervised ML
A 2024 paper in Expert Systems with Applications introduced a novel panel of 4,849 Single Nucleotide Polymorphisms (SNPs) and applied supervised machine learning to classify kinship relationships across more than 150,000 simulated pairs. The panel was designed to overcome the limitations of STR-based methods for detecting second-degree and more distant relationships.
A key feature of this study was its transparency:
the full codebase was made publicly available on GitHub (), making it an accessible starting point for researchers who want to replicate or extend the work.
3.4 Dynamic Programming and ML for DNA Sequence Classification
A two-part research series by Dr. Ernest Bonat and colleagues, published on Medium in 2024, explored using dynamic programming (specifically the Smith-Waterman and Needleman-Wunsch sequence alignment algorithms) in combination with machine learning classifiers for paternity DNA sequence classification. Part 2 of the series moved into feature engineering, DNA natural language processing (treating nucleotide sequences as text for embedding and classification), and model deployment strategies.
This work is notable for its practical orientation; it was designed with real-world deployment in hospital and laboratory settings in mind, using efficient, low-cost hardware platforms.
3.5 AI-Assisted Allele Calling in Forensic DNA Analysis
A 2025 preprint on bioRxiv demonstrated that deep learning models can outperform traditional rule-based systems for allele calling in forensic DNA electropherogram analysis. The researchers showed that deep learning eliminates much of the manual inspection currently required to classify electrophoresis signals into categories such as alleles, stutter artefacts, and baseline noise; the core step in any STR-based DNA test.
4. The Gap That Remains: East African Population Data
Despite this research momentum, a significant gap persists. The overwhelming majority of ML models for DNA kinship analysis have been trained on datasets drawn from European, East Asian, or broadly American populations. Allele frequency distributions vary meaningfully across ethnic groups, and a paternity index calculated using a European allele frequency table applied to a Luo, Kikuyu, or Kalenjin profile will produce an inaccurate probability estimate.
This is not a minor technical footnote; it is a potential source of serious legal error.
Addressing it requires two things: first, building and publishing a properly annotated STR/SNP allele frequency reference panel for major Kenyan and East African population groups; and second, retraining or fine-tuning kinship ML models on that East African reference data.
This is the specific research contribution that has not yet been made and that this project intends to address.
5. Proposed **Data Pipeline **Architecture
The data pipeline proposed for this research is designed to take raw STR genotype data as input and produce a paternity probability as output
The pipeline is structured across five stages:
Stage 1: Data Ingestion and Standardisation
Raw genotype data from laboratory electrophoresis systems is ingested in standard file formats (e.g., .csv, FASTA, or vendor-specific formats from Applied Biosystems or equivalent instruments). Data is standardised to a consistent allele notation format, and metadata (sample ID, locus names, collection date, chain-of-custody flags) is attached.
Tools: Python (pandas, biopython), validation schemas, format converters.
Stage 2: Allele Frequency Reference Construction
For each STR locus in the panel, allele frequencies are computed from a reference population dataset. Where East African population data is available (from published studies, KEMRI archives, or ethically sourced anonymised samples), it is used as the primary reference. Where gaps exist, published African population data from sources such as the 1000 Genomes Project is used as a fallback, clearly flagged.
Output: A locus-by-allele frequency matrix, stratified by population group where sufficient data is available.
Stage 3: Feature Engineering
Each trio of profiles (child, mother, alleged father) is transformed into a structured feature vector. Features include:
- Per-locus allele match/mismatch flags between child and alleged father
- Per-locus likelihood ratios (using the allele frequency reference from Stage 2)
- Combined Paternity Index (CPI) computed using the classical formula
- Encoded nucleotide sequences (using k-mer or one-hot encoding, for deep learning branches of the pipeline)
- Population group label (where known) Tools: scikit-learn (preprocessing), numpy, custom STR feature extraction functions.
Stage 4: Model Training and Evaluation
Multiple model families are trained and benchmarked:
- Baseline: Logistic Regression and Gradient Boosted Trees (for interpretability and legal transparency)
- Intermediate: Random Forest (strong performance in existing literature on genetic relatedness)
- Advanced: Deep Neural Network trained on locus-by-allele matrix representations
- Sequence model (experimental): Transformer-based model treating nucleotide sequences as text, for cases where raw sequence data is available
All models are evaluated using stratified cross-validation. Primary metrics are accuracy, F1-score, and critically calibration (how well predicted probabilities correspond to actual paternity likelihoods). For a legal application, a poorly calibrated model that outputs overconfident probabilities is more dangerous than a slightly less accurate but well-calibrated one.
Tools: scikit-learn, tensorflow/keras, xgboost, matplotlib and shap for interpretability.
Stage 5: Output and Reporting
The pipeline produces a structured report containing:
- Predicted paternity probability with confidence interval
- Per-locus CPI breakdown (for transparency and legal review)
- SHAP-based feature importance plot (showing which loci drove the prediction)
- A plain-language summary suitable for inclusion in a court document
6. Why This Matters: The Broader Argument
This pipeline is not intended to remove DNA analysts from the paternity testing process. In a legal context, the chain-of-custody requirements, the professional accountability of a qualified scientist, and the right of courts to examine and cross-examine expert witnesses all require that human expertise remain central.
What the pipeline addresses is the bottleneck before the analyst signs off. The manual steps of computing per-locus paternity indices, constructing the CPI, and interpreting the result against a population reference are repetitive, time-consuming, and, when allele frequency tables are mismatched to the test population; potentially inaccurate. Automating and standardising those steps with a well-validated ML model makes the analyst's job faster, frees up laboratory capacity, and if the underlying allele frequency data is East African, makes the result more statistically appropriate for the populations actually being tested in Kenyan courts.
There is also a longer-term argument. Kenya currently has no national forensic DNA database. As paternity and forensic DNA testing scales, the data generated by each test represents a potential building block for such a database. A well-designed data pipeline, built with data governance and privacy protections from the ground up, could eventually support population-level allele frequency studies, cold case investigations, and missing persons identification, all areas where current Kenyan capacity is severely limited.
7. Research Questions and Next Steps
This pre-article identifies the following research questions that the forthcoming full paper and associated project will address:
1. How accurately can an ML model trained on East African STR data predict paternity, compared to the classical CPI-based statistical method?
2. Which model architecture (logistic regression, random forest, DNN) offers the best balance between predictive accuracy and interpretability for court-admissible use?
3. How significant is the degradation in paternity index accuracy when non-African allele frequency references are applied to East African DNA profiles?
4. Can a data pipeline for paternity testing be built that meets Kenya's legal chain-of-custody requirements while significantly reducing per-case analyst time?
The next steps are:
- Complete a systematic literature review extending the sources identified here
- Establish a data acquisition plan (simulated datasets for training; anonymised, ethically sourced real STR profiles where possible)
- Develop and test the five-stage pipeline described above
- Write and submit the full research paper
8. Resources and References
Peer-Reviewed Papers
A Deep Neural Network Model for Paternity Testing Based on 15-Loci STR for Iraqi Families (2023):
ML to Predict Genetic Relatedness Using Human mtDNA, African, Asian, Caucasian Samples:
Kinship Analysis and ML Algorithms with a New NGS Panel, 4,849 SNPs (ScienceDirect, 2024):
Making AI Accessible for Forensic DNA Profile Analysis (bioRxiv, 2025):
Multidisciplinary Forensic DNA Training in Kenya, KEMRI Perspective (ScienceDirect):
NASTRA: Accurate STR Analysis by Nanopore Sequencing (PMC, 2024):
Research Articles
Advanced Paternity DNA Sequence Classification Using Dynamic Programming and ML β Part 1 (Bonat, 2024):
Advanced Paternity DNA Sequence Classification Using Dynamic Programming and ML β Part 2 (Bonat, 2024):
GitHub Repositories
Forensic DNA pipeline: SNP/STR β kinship, ancestry, mixture analysis:
Likelihood ratio model for STR and SNP kinship testing:
Full code for the 2024 ScienceDirect SNP kinship paper :
Pairwise relatedness from ancient DNA with contamination correction:
End-to-end STR profiling pipeline for massively parallel sequencing:
Kenya-Specific Context
KEMRI DNA Lab and the Case for a National Forensic Database:
Inside Kenya's Paternity Testing Boom (Nation Africa, 2024):
Bioinformatics Institute of Kenya β 24-Marker Paternity Test:
EasyDNA Kenya β Legal / Court-Admissible Testing:
9. Closing Note
This project is an attempt to begin filling that gap. The pipeline described here is a starting point, not an endpoint. The research paper that follows this pre-article will provide a more formal treatment of the methods, a detailed experimental evaluation, and where the results support it, a clear argument for why ML-assisted paternity testing should be considered as a complement to existing forensic laboratory practice in Kenya.
Pre-research article. All research questions and pipeline architecture are prospective. Full methodology and results will be published in the forthcoming research paper.
by Kipngeno Gregory Data and Software Engineer
United States
NORTH AMERICA
Related News
What Does "Building in Public" Actually Mean in 2026?
19h ago
The Agentic Headless Backend: What Vibe Coders Still Need After the UI Is Done
19h ago
Why Iβm Still Learning to Code Even With AI
21h ago
I gave Claude a persistent memory for $0/month using Cloudflare
1d ago
NYT: 'Meta's Embrace of AI Is Making Its Employees Miserable'
1d ago