Learning path

The problem we target

AMR creates infections with few therapeutic options. Scientists need to connect dispersed data, prioritize isolates, and choose candidates with limited experimental capacity.

Where AI helps

AI can reduce the search space, filter sequences, prioritize candidates, and guide active learning, but validation remains experimental.

Where GPT-5.5/Codex helps

GPT-5.5/Codex helps build the platform, document, simulate, code, test, and explain. It does not replace biomolecular models or wet lab work.

ProteoGPT-like inside PROTEONEXT

PROTEONEXT is not ProteoGPT 2.0: it is a sovereign platform that could encapsulate specialized protein LLMs and connect them with federated AMR data, governance, MLOps, and experimental validation.

Protein LLM / ProteoGPT-like Specialized scientific engine to generate, transform, or prioritize peptide sequences. It still requires filters, benchmarks, and validation.
PROTEONEXT Translational ecosystem: federated nodes, Azure, Fabric, Purview, confidential computing, MLOps, traceability, and scientific partners.
GPT-5.5 / Codex Technical copilot to build, document, test, and explain the platform. It does not decide real biological activity.
Generated sequences10000
Shortlist30
Biologically validated0

Prioritization funnel

Each stage discards candidates. The key point is that a computational shortlist is still not biological evidence.

Mass generation 10000
100.00%
Physicochemical filter 1682
16.82%
Simulated safety filter 1680
16.80%
Score ranking 1680
16.80%
Synthesis shortlist 30
0.30%
Real biological validation 0
0.00%

Top 30 didactic candidates

These candidates are useful to inspect scoring, not to make real scientific decisions.

# Sequence Score Length Charge Hydrophobicity Solubility proxy Toxicity proxy
1 HRARMWVKRRQ 99.8 11 +6 0.36 0.82 0.00
2 SIQIMERKRIAMKKRLHKFQMPK 99.7 23 +8 0.39 0.82 0.00
3 WKYHIKINQVHSVSIRH 99.0 17 +6 0.35 0.78 0.00
4 CCIHTFIKKNKKAQMRRQSLFA 98.8 22 +7 0.36 0.76 0.00
5 INMKAWHAWMGCANHHKRMRTQER 98.7 24 +7 0.38 0.76 0.00
6 NNIAIVFGPHKHVLRLHGRKSK 98.5 22 +8 0.36 0.75 0.00
7 KMIAKNRCVHHRGNKVTTIVI 98.4 21 +7 0.38 0.74 0.00
8 HSFDHRTMHFFAK 98.3 13 +4 0.38 0.74 0.00
9 SIALDAWSHHQSHQRWHIMASQKVKNLVC 98.3 29 +6 0.41 0.74 0.00
10 MVCDWKKIWKNGHLNKRSVR 98.3 20 +6 0.35 0.74 0.00
11 IFKDKVMYHLLWKTASTKHHD 98.1 21 +5 0.38 0.73 0.00
12 HKHFITVDNLINSLLTRKSC 98.0 20 +4 0.35 0.72 0.00
13 SNCFVFFSEFIQNWKAMKILHKSQDKKYTK 98.0 30 +5 0.37 0.72 0.00
14 KGHSATRHKTIHVAVHAVPEFVDTGQATRV 98.0 30 +6 0.37 0.72 0.00
15 FLNRFIKNKVHDHPKV 97.9 16 +5 0.38 0.72 0.00
16 GILHWRQKYKAKCPHFERWRAKEAMFWHFN 97.9 30 +8 0.40 0.72 0.00
17 RHSHKWPFWITTVRRIHFAPAWWNPKGN 97.9 28 +8 0.39 0.71 0.00
18 HLNNLISTKWMVFKHNT 97.8 17 +4 0.41 0.71 0.00
19 MNTVAIKTFHLHGNKHE 97.8 17 +4 0.35 0.71 0.00
20 MVRQMEHRWLFCANAQKEPMHKRHM 97.7 25 +6 0.40 0.71 0.00
21 VSAKEFHATLWCKVIHPNNLKQVQKIR 97.7 27 +6 0.41 0.71 0.00
22 TERKYMKDHLQAMPRKANQAWRCRFIW 97.7 27 +6 0.37 0.71 0.00
23 HEMKMRAKMHEVTE 97.7 14 +2 0.36 0.71 0.00
24 FCHQDHVAAEAVKCHTKRAVSH 97.6 22 +5 0.36 0.70 0.00
25 WDHIFHREMHT 97.6 11 +2 0.36 0.70 0.00
26 WFLVKCKKVEIHAYAKLSFRRIPFRECHH 97.6 29 +8 0.41 0.70 0.00
27 MCIKKVQAHQTHSI 97.5 14 +4 0.36 0.70 0.00
28 QQNKTHLHFRLIGV 97.5 14 +4 0.36 0.70 0.00
29 HILIPFNKSKHYKRVLWRMMCWLPRHD 97.5 27 +8 0.41 0.69 0.00
30 HWWRSFQTLH 97.5 10 +3 0.40 0.69 0.00

Module README

06 Scientific Problem and AI

This module explains the challenge that PROTEONEXT aims to address and simulates the central bottleneck in antimicrobial peptide discovery:

Generating many sequences is easy; deciding which ones deserve synthesis and assay is hard.

The real problem as of May 1, 2026

Antimicrobial resistance (AMR) is a health, scientific, and industrial problem. In infections caused by multidrug-resistant pathogens — especially hospital Gram-negatives — clinicians may have very few therapeutic options. Scientists not only need new molecules; they need better ways to prioritize candidates, connect hospital data, leverage experimental results, and reduce pointless iterations.

In PROTEONEXT the initial focus is best understood through three groups:

  • CRAB: Acinetobacter baumannii resistant to carbapenems.
  • CRE: Carbapenem-resistant Enterobacterales.
  • CRPA: Pseudomonas aeruginosa resistant to carbapenems.

Why AI

AI can help in different tasks:

  • Federated analytics: understand which data and isolates exist without moving sensitive rows.
  • Predictive models: prioritize phenotypes, nodes, mechanisms, and candidates.
  • Generative protein AI: explore AMP sequences at scale.
  • Filters and scoring: discard unlikely candidates before synthesis.
  • Active learning: choose the next experimental batch to learn more with fewer assays.

AI does not validate antimicrobial activity. Real validation requires MIC/MBC, hemolysis, cytotoxicity, stability, and additional assays.

Where GPT-5.5/Codex helps

GPT-5.5/Codex helps as a technical copilot:

  • Building didactic simulators.
  • Generating and reviewing code.
  • Creating validators, APIs, dashboards, and tests.
  • Translating scientific concepts into Microsoft architecture.
  • Helping document decisions and risks.
  • Preparing prompts, playbooks, and literature reviews for humans.

It should not be used as a biomedical validator or as a substitute for specialized protein LLM models, QSAR, scientific committees, or wet lab work.

ProteoGPT-like inside PROTEONEXT

PROTEONEXT should not be explained as a "version 2.0" of ProteoGPT. They are different by nature:

  • ProteoGPT or a ProteoGPT-like model represents the specialized scientific layer: protein language models capable of generating, transforming, or prioritizing peptide sequences.
  • PROTEONEXT represents the sovereign translational platform: federated AMR data, governance, security, confidential computing, MLOps, Fabric/Purview, traceability, scientific partners, and experimental validation.
  • GPT-5.5/Codex represents the technical copilot: it helps build, test, document, explain, and operate the platform, but does not replace biomolecular models or wet lab.

The correct relationship is that PROTEONEXT could encapsulate or integrate ProteoGPT-like models as a specialized generative engine. The differential contribution of PROTEONEXT is not just generating peptides, but connecting that generation with authorized microbiological/genomic data, privacy, governance, MLOps, and an experimental validation loop.

Module 07 didactically simulates that model-lab-model loop: a computational shortlist moves to fictitious assays and those results guide the next iteration. It is an analogy of SPEL-like active learning, not a real biological execution.

Funnel simulation

The script simular_funnel_peptidos.py generates 10,000 synthetic peptides of 8 to 40 amino acids and applies a funnel:

  1. Mass generation.
  2. Basic physicochemical filter.
  3. Simulated safety filter.
  4. Score ranking.
  5. Shortlist of 30 candidates.

Results are written to:

  • salida/funnel_resultados.json
  • salida/shortlist_peptidos.csv

Run

From Desarrollo:

& 'C:\ProgramData\miniconda3\python.exe' .\06_problema_cientifico_ia\simular_funnel_peptidos.py