Protein Topologies - Jack Simon

Structural Analysis Pipeline

I developed a computational pipeline for analyzing structural differences between wild type and mutant protein variants, integrating contact mapping, cavity detection, and LLM reasoning to identify mutation-specific druggable binding sites. The code is available here.

The pipeline incorporates MDTraj for fast PDB/trajectory parsing, contact‑map for residue‑level contact‑frequency matrices, fpocket for cavity detection and druggability scoring, and Anthropic Claude for LLM‑driven hotspot triage. Feed it a wild‑type / mutant pair of structures and it returns contact‑difference maps, pocket metadata, ranked hotspot residues, and ready‑to‑run PyMOL scripts—fully unattended and trivially parallelizable across large variant panels.

The workflow requires wild type and mutant PDB structures as input and produces comprehensive structural analysis data, cavity characterization, and LLM-selected hotspot residues for targeted protein design. The entire analysis completes automatically without manual intervention, scaling to process multiple protein variants systematically.

Architecture

The pipeline orchestrates systematic structural comparison through multiple analysis modules, each contributing essential data to the final LLM-powered hotspot selection. The workflow processes structural data through established computational methods before leveraging an LLM for decision making.

graph TD A[Wild type PDB] --> C[Load Structures
MDTraj] B[Mutant PDB] --> C C --> F[Generate Contact
Frequency Maps] F --> G[Calculate Contact
Differences] G --> H[Export Results] H --> I[Contact Differences
Text File] H --> J[CSV Matrix
All Deltas] H --> K[PNG Heatmap
Contact Map] G --> L[Rank Residues by
Contact Changes] L --> M[Extract Changed
Residues] M --> N[most_changed_residues.json] B --> O[Copy PDB to
Analysis Directory] O --> P[Run fpocket
Cavity Detection] P --> Q[Parse Pocket
Metadata] Q --> R[Assign Residues
to Pockets 4Å] R --> S[pocket_analysis.json] C --> T[Find Numbering
Gaps] T --> U[missing_residues.json] C --> V[Extract Chain
Boundaries Chain A] V --> W[residue_boundaries.json] C --> X[Extract
Sequences] X --> Y[sequences.json] N --> BB[Merge All Outputs] S --> BB U --> BB W --> BB Y --> BB BB --> CC[combined_analysis.json] CC --> DD[Anthropic Claude
API Call] DD --> EE[Generate Design
Specifications] EE --> FF[specifications.json] EE --> GG[PyMOL Commands] style A fill:#e1f5fe style B fill:#e1f5fe style CC fill:#fff3e0 style DD fill:#f3e5f5 style FF fill:#e8f5e8 style GG fill:#e8f5e8

Program Modules

1. Input Processing

The pipeline begins with wild type and mutant PDB structure files, establishing analysis output directories and validating input structures. Both structures are loaded using MDTraj.

2. Contact Map Analysis Module

The contact analysis module generates intermolecular contact frequency maps using the contact-map library's ContactFrequency class. An AtomMismatchedContactDifference object quantifies how each residue's contact pattern changes between wild type and mutant, identifying both direct mutation effects and allosteric structural perturbations. Results are exported in multiple formats: human-readable text files, complete CSV matrices, and color-scaled PNG heatmap visualizations.

3. Residue Change Ranking Module

All residues showing contact pattern changes are ranked by total absolute change magnitude and converted to conventional PDB format. The module extracts and stores the most significantly changed residues with quantitative metrics in most_changed_residues.json, providing input for downstream hotspot selection.

4. Cavity Detection Module

The fpocket integration module focuses exclusively on mutant structure cavity detection. The system copies the mutant PDB to the analysis directory, executes fpocket cavity detection, parses resulting metadata, and geometrically assigns protein residues to detected pockets using a 4Å distance cutoff. Each pocket's druggability metrics and residue composition are recorded for analysis in pocket_analysis.json, providing a direct link between structural changes and potential druggable sites.

5. Structural Quality Assessment Module

Additional analysis modules perform structural quality checks including missing residue identification through numbering gap detection, chain boundary extraction, and complete amino acid sequence extraction with single-letter code conversion. All results are stored in JSON format for downstream processing.

6. Data Integration Module

All individual analysis outputs are consolidated into combined_analysis.json, creating a comprehensive structural dataset that serves as input to an LLM. This unified schema enables custom processing of complex structural information and ensures all features are available for downstream reasoning.

7. LLM Hotspot Selection Module

The Anthropic Claude API processes the complete structural analysis data using carefully crafted prompts that instruct the LLM to identify optimal hotspot residues. Claude applies expert structural biology reasoning (encoded in the prompt, not necessarily parametrically in the model) to select spatially clustered residues from high-scoring pockets that exhibit significant contact changes, returning minimal JSON specifications ready for direct integration with protein design platforms.

8. Output Generation

The pipeline generates comprehensive output files including PyMOL visualization commands for structural interpretation and audits of pipeline runs.

Future Work

Scaling the method to longer MD trajectories and enhanced‑sampling replicas will capture mutation‑induced flexibility that single‑frame analysis may miss. Experimental follow‑up—binding kinetics, cell‑based assays, and structural confirmation—remains the critical path to validating the LLM's hotspot picks. Finally, coupling the module stack directly to generative design engines will enable an end‑to‑end loop from variant upload to inhibitor blueprint.

This agentic approach — where autonomous modules exist within complex, multi-step programs — demonstrates the utility of integrating LLMs as reasoning engines within bioengineering pipelines. By leveraging LLMs for decision-making, triage, and design specification, systems will increasingly transcends traditional automation, enabling adaptive workflows that can interpret, prioritize, and act on diverse biological data. This paradigm shift paves the way for more intelligent, responsive, and creative programs in protein engineering, drug discovery, and all of life science.

Portfolio