Case history: AI-Driven Drug Repurposing Workflow for Rare Disease Research

Modern,Electronics,Research,And,Development,Facility:,Beautiful,Caucasian,Female,Engineer

Context

A biotechnology research institute specializing in rare neurological disorders faced a critical challenge in their drug repurposing pipeline. With limited resources and a small compound library, the organization needed to identify potential therapeutic candidates more efficiently. Traditional high-throughput screening approaches were cost-prohibitive, and the time required to validate new chemical entities would exceed their funding timeline.

The research team had identified a promising lead compound showing preliminary efficacy in preclinical models, but they lacked the infrastructure to systematically explore structural analogs and predict their biological activity profiles. Manual literature reviews and database searches were time-consuming and often incomplete, leaving potentially valuable therapeutic insights undiscovered.

Challenge

The institute needed to:

The institute needed to develop an automated workflow that would:

  1. Identify structural analogues of the reference compound through robust 2D fingerprint-based similarity searches (Tanimoto ≥ 0.85) against PubChem
  2. Map each analogue’s canonical SMILES to its corresponding PubChem CID and ChEMBL ID, applying fallback strategies to maximize coverage
  3. Enrich the resulting compound list with high-quality biological annotations—direct target assignments (SINGLE PROTEIN), mechanism-of-action metadata, assay details, and literature-level associations—queried programmatically from ChEMBL and PubChem
  4. Deliver reproducible outputs, including structured CSV tables, a PDF report of SMILES-to-target summaries, and both static (heatmaps, scatter plots) and interactive (HTML network) visualizations
  5. Ensure scalability and modularity, so that the same pipeline can handle small test batches or scale to tens of thousands of compounds, integrate new data sources, and accommodate evolving project requirements through configurable parameters and cleanly separated code modules

Solution

Kode, a cheminformatics solutions development company, designed and implemented a fully automated, end-to-end computational workflow that integrated chemical similarity analysis with biological data enrichment to address the institute’s challenges:

Phase 1 – Chemical Similarity Search: The pipeline performed 2D fingerprint-based similarity searches against the PubChem database using a Tanimoto coefficient threshold of ≥0.85. This balanced structural relevance with sufficient chemical diversity to capture novel biological interactions. Retrieved compounds were filtered to remove duplicates and entries lacking experimental annotations.

Phase 2 – Intelligent Identifier Mapping: Canonical SMILES strings were programmatically mapped to both PubChem CIDs and ChEMBL IDs. The system implemented sophisticated fallback mechanisms to maximize data coverage:

  • When direct matches failed, the pipeline searched for structurally similar analogs within ChEMBL
  • Synonym resolution captured biologically relevant entries under alternative identifiers
  • Salt/desalt normalization ensured that formulated and parent compounds were properly linked
  • Duplicate filtering prevented redundant processing

Phase 3 – Comprehensive Biological Enrichment: For each compound, the system retrieved:

  • Direct target annotations from ChEMBL (focusing on single-protein interactions for specificity)
  • Mechanism-of-action metadata
  • Detailed assay information including experimental type, target identifiers, and studied organisms
  • Literature-level associations from scientific publications when direct assay data was unavailable

Phase 4 – Multi-Dimensional Analysis & Visualization: The workflow generated both static and interactive visualizations to support interpretation:

  • Target distribution analysis showing the spectrum from selective to promiscuous binders
  • Similarity heatmaps using multiple fingerprint types (MACCS, Morgan) and metrics (Tanimoto, cosine)
  • Interactive network graphs mapping compound-target relationships
  • Scatter plots highlighting structural conservation across target families

All outputs were delivered in structured formats (CSV, PDF, HTML) ensuring reproducibility and easy integration into downstream analysis pipelines.

Results

The automated workflow successfully identified and characterized 150+ structural analogs of the lead compound, revealing several key insights:

Biological Activity Patterns:

  • Most compounds exhibited selective binding profiles (fewer than 10 targets), desirable for minimizing off-target effects
  • One compound displayed engagement with 25 unique targets, suggesting potential for multi-target therapeutic approaches
  • Ten recurrent targets were identified across multiple compounds, including cholinesterases, cannabinoid receptors, and myeloperoxidase, indicating privileged structural motifs for these target classes

Chemical Diversity Analysis: Similarity matrices revealed substantial structural diversity despite the common reference scaffold, with most pairwise similarities below 0.5. Discrete high-similarity clusters identified structurally coherent subgroups suitable for focused analog series development.

Operational Impact:

  • Analysis time reduced from weeks to days, accelerating decision-making for preclinical studies
  • The fallback mechanisms recovered biological annotations for 40% of compounds that would otherwise have been excluded
  • Literature-based augmentation provided experimental context for 25+ under-annotated compounds
  • The modular, configurable architecture enabled rapid adaptation when the team needed to analyze additional compound series

The institute successfully prioritized three structural analogs for immediate in vitro validation based on their predicted target profiles and chemical tractability. The reproducible workflow has since been applied to four additional lead compounds, establishing it as a core component of their drug discovery infrastructure. The approach demonstrated that intelligent integration of chemical similarity with multi-source biological data could substantially accelerate early-stage drug repurposing efforts, particularly valuable for organizations with resource constraints.

By achieving these goals, the workflow provides a comprehensive, reproducible platform for compound repurposing studies and downstream pharmacological investigations.

REQUEST INFO