Generating is easy; prioritizing is hard
Simulation of 10,000 synthetic peptides to understand why the problem is not producing sequences, but deciding what deserves synthesis and testing.
The problem we target
AMR creates infections with few therapeutic options. Scientists need to connect dispersed data, prioritize isolates, and choose candidates with limited experimental capacity.
Where AI helps
AI can reduce the search space, filter sequences, prioritize candidates, and guide active learning, but validation remains experimental.
Where GPT-5.5/Codex helps
GPT-5.5/Codex helps build the platform, document, simulate, code, test, and explain. It does not replace biomolecular models or wet lab work.
ProteoGPT-like inside PROTEONEXT
PROTEONEXT is not ProteoGPT 2.0: it is a sovereign platform that could encapsulate specialized protein LLMs and connect them with federated AMR data, governance, MLOps, and experimental validation.
Prioritization funnel
Each stage discards candidates. The key point is that a computational shortlist is still not biological evidence.
Top 30 didactic candidates
These candidates are useful to inspect scoring, not to make real scientific decisions.
| # | Sequence | Score | Length | Charge | Hydrophobicity | Solubility proxy | Toxicity proxy |
|---|---|---|---|---|---|---|---|
| 1 | HRARMWVKRRQ |
99.8 | 11 | +6 | 0.36 | 0.82 | 0.00 |
| 2 | SIQIMERKRIAMKKRLHKFQMPK |
99.7 | 23 | +8 | 0.39 | 0.82 | 0.00 |
| 3 | WKYHIKINQVHSVSIRH |
99.0 | 17 | +6 | 0.35 | 0.78 | 0.00 |
| 4 | CCIHTFIKKNKKAQMRRQSLFA |
98.8 | 22 | +7 | 0.36 | 0.76 | 0.00 |
| 5 | INMKAWHAWMGCANHHKRMRTQER |
98.7 | 24 | +7 | 0.38 | 0.76 | 0.00 |
| 6 | NNIAIVFGPHKHVLRLHGRKSK |
98.5 | 22 | +8 | 0.36 | 0.75 | 0.00 |
| 7 | KMIAKNRCVHHRGNKVTTIVI |
98.4 | 21 | +7 | 0.38 | 0.74 | 0.00 |
| 8 | HSFDHRTMHFFAK |
98.3 | 13 | +4 | 0.38 | 0.74 | 0.00 |
| 9 | SIALDAWSHHQSHQRWHIMASQKVKNLVC |
98.3 | 29 | +6 | 0.41 | 0.74 | 0.00 |
| 10 | MVCDWKKIWKNGHLNKRSVR |
98.3 | 20 | +6 | 0.35 | 0.74 | 0.00 |
| 11 | IFKDKVMYHLLWKTASTKHHD |
98.1 | 21 | +5 | 0.38 | 0.73 | 0.00 |
| 12 | HKHFITVDNLINSLLTRKSC |
98.0 | 20 | +4 | 0.35 | 0.72 | 0.00 |
| 13 | SNCFVFFSEFIQNWKAMKILHKSQDKKYTK |
98.0 | 30 | +5 | 0.37 | 0.72 | 0.00 |
| 14 | KGHSATRHKTIHVAVHAVPEFVDTGQATRV |
98.0 | 30 | +6 | 0.37 | 0.72 | 0.00 |
| 15 | FLNRFIKNKVHDHPKV |
97.9 | 16 | +5 | 0.38 | 0.72 | 0.00 |
| 16 | GILHWRQKYKAKCPHFERWRAKEAMFWHFN |
97.9 | 30 | +8 | 0.40 | 0.72 | 0.00 |
| 17 | RHSHKWPFWITTVRRIHFAPAWWNPKGN |
97.9 | 28 | +8 | 0.39 | 0.71 | 0.00 |
| 18 | HLNNLISTKWMVFKHNT |
97.8 | 17 | +4 | 0.41 | 0.71 | 0.00 |
| 19 | MNTVAIKTFHLHGNKHE |
97.8 | 17 | +4 | 0.35 | 0.71 | 0.00 |
| 20 | MVRQMEHRWLFCANAQKEPMHKRHM |
97.7 | 25 | +6 | 0.40 | 0.71 | 0.00 |
| 21 | VSAKEFHATLWCKVIHPNNLKQVQKIR |
97.7 | 27 | +6 | 0.41 | 0.71 | 0.00 |
| 22 | TERKYMKDHLQAMPRKANQAWRCRFIW |
97.7 | 27 | +6 | 0.37 | 0.71 | 0.00 |
| 23 | HEMKMRAKMHEVTE |
97.7 | 14 | +2 | 0.36 | 0.71 | 0.00 |
| 24 | FCHQDHVAAEAVKCHTKRAVSH |
97.6 | 22 | +5 | 0.36 | 0.70 | 0.00 |
| 25 | WDHIFHREMHT |
97.6 | 11 | +2 | 0.36 | 0.70 | 0.00 |
| 26 | WFLVKCKKVEIHAYAKLSFRRIPFRECHH |
97.6 | 29 | +8 | 0.41 | 0.70 | 0.00 |
| 27 | MCIKKVQAHQTHSI |
97.5 | 14 | +4 | 0.36 | 0.70 | 0.00 |
| 28 | QQNKTHLHFRLIGV |
97.5 | 14 | +4 | 0.36 | 0.70 | 0.00 |
| 29 | HILIPFNKSKHYKRVLWRMMCWLPRHD |
97.5 | 27 | +8 | 0.41 | 0.69 | 0.00 |
| 30 | HWWRSFQTLH |
97.5 | 10 | +3 | 0.40 | 0.69 | 0.00 |
Module README
06 Scientific Problem and AI
This module explains the challenge that PROTEONEXT aims to address and simulates the central bottleneck in antimicrobial peptide discovery:
Generating many sequences is easy; deciding which ones deserve synthesis and assay is hard.
The real problem as of May 1, 2026
Antimicrobial resistance (AMR) is a health, scientific, and industrial problem. In infections caused by multidrug-resistant pathogens — especially hospital Gram-negatives — clinicians may have very few therapeutic options. Scientists not only need new molecules; they need better ways to prioritize candidates, connect hospital data, leverage experimental results, and reduce pointless iterations.
In PROTEONEXT the initial focus is best understood through three groups:
CRAB: Acinetobacter baumannii resistant to carbapenems.CRE: Carbapenem-resistant Enterobacterales.CRPA: Pseudomonas aeruginosa resistant to carbapenems.
Why AI
AI can help in different tasks:
- Federated analytics: understand which data and isolates exist without moving sensitive rows.
- Predictive models: prioritize phenotypes, nodes, mechanisms, and candidates.
- Generative protein AI: explore AMP sequences at scale.
- Filters and scoring: discard unlikely candidates before synthesis.
- Active learning: choose the next experimental batch to learn more with fewer assays.
AI does not validate antimicrobial activity. Real validation requires MIC/MBC, hemolysis, cytotoxicity, stability, and additional assays.
Where GPT-5.5/Codex helps
GPT-5.5/Codex helps as a technical copilot:
- Building didactic simulators.
- Generating and reviewing code.
- Creating validators, APIs, dashboards, and tests.
- Translating scientific concepts into Microsoft architecture.
- Helping document decisions and risks.
- Preparing prompts, playbooks, and literature reviews for humans.
It should not be used as a biomedical validator or as a substitute for specialized protein LLM models, QSAR, scientific committees, or wet lab work.
ProteoGPT-like inside PROTEONEXT
PROTEONEXT should not be explained as a "version 2.0" of ProteoGPT. They are different by nature:
ProteoGPTor aProteoGPT-likemodel represents the specialized scientific layer: protein language models capable of generating, transforming, or prioritizing peptide sequences.PROTEONEXTrepresents the sovereign translational platform: federated AMR data, governance, security, confidential computing, MLOps, Fabric/Purview, traceability, scientific partners, and experimental validation.GPT-5.5/Codexrepresents the technical copilot: it helps build, test, document, explain, and operate the platform, but does not replace biomolecular models or wet lab.
The correct relationship is that PROTEONEXT could encapsulate or integrate ProteoGPT-like models as a specialized generative engine. The differential contribution of PROTEONEXT is not just generating peptides, but connecting that generation with authorized microbiological/genomic data, privacy, governance, MLOps, and an experimental validation loop.
Module 07 didactically simulates that model-lab-model loop: a computational shortlist moves to fictitious assays and those results guide the next iteration. It is an analogy of SPEL-like active learning, not a real biological execution.
Funnel simulation
The script simular_funnel_peptidos.py generates 10,000 synthetic peptides of 8 to 40 amino acids and applies a funnel:
- Mass generation.
- Basic physicochemical filter.
- Simulated safety filter.
- Score ranking.
- Shortlist of 30 candidates.
Results are written to:
salida/funnel_resultados.jsonsalida/shortlist_peptidos.csv
Run
From Desarrollo:
& 'C:\ProgramData\miniconda3\python.exe' .\06_problema_cientifico_ia\simular_funnel_peptidos.py