Marat's Notes

The Problem

Bioinformatics pipelines are complex. A typical whole-genome sequencing analysis involves dozens of tools, specific parameter combinations, quality thresholds, and file format conversions. Even experienced bioinformaticians constantly reference documentation for:

Which aligner to use (BWA-MEM2, Bowtie2, STAR, Minimap2?)
Correct GATK command sequences for variant calling
Quality thresholds for filtering (what’s a good Ti/Tv ratio?)
nf-core pipeline configurations
R/Python code for differential expression analysis

What if Claude Code could provide expert-level guidance for all of this, instantly?

The Solution: Claude Code Skills

Claude Code supports skills - markdown files that inject domain expertise into the conversation. Unlike MCP servers that require infrastructure, skills are just prompts that load when invoked.

I built a comprehensive genomics skill set that covers the major NGS workflows:

Skill	Command	Coverage
Main	`/genomics`	Overview, tool selection, routing
WGS/WES	`/genomics:wgs`	GATK, DeepVariant, Mutect2
RNA-seq	`/genomics:rnaseq`	DESeq2, Seurat, Scanpy
ChIP/ATAC	`/genomics:chipseq`	MACS2, HOMER, TOBIAS
Annotation	`/genomics:annotation`	VEP, SnpEff, ANNOVAR
QC	`/genomics:qc`	FastQC, MultiQC, Picard
CNV/SV	`/genomics:cnv`	GATK CNV, CNVkit, Manta

Why Skills Over MCP?

I considered building an MCP server (like the AWS HealthOmics MCP), but skills made more sense for this use case:

Aspect	Skill	MCP Server
Purpose	Guidance & code generation	Runtime execution
Infrastructure	None (just markdown)	Server process
Latency	Instant	Network overhead
Maintenance	Edit text files	Deploy/monitor service
Best for	“How do I…” questions	“Run this workflow” actions

Skills excel at providing expertise - knowing which tool to use, correct parameters, quality thresholds, and best practices. MCP servers excel at execution - actually running pipelines on cloud infrastructure.

Skill Structure

Each skill file is structured to maximize Claude’s effectiveness:

# WGS/WES Variant Calling Pipeline Skill

You are an expert in whole genome sequencing...

## Workflow Overview
FASTQ → QC → Alignment → Post-processing → Variant Calling → Filtering

## Standard Pipeline
### 1. Quality Control
[FastQC commands with parameters]

### 2. Alignment
[BWA-MEM2 commands with read groups]

### 3. Post-Alignment Processing
[GATK MarkDuplicates, BQSR commands]

## Quality Metrics to Check
| Metric | Good Value | Concern |
|--------|------------|---------|
| Mapping rate | >95% | <90% |
...

## Common Issues & Solutions
1. Low mapping rate: Check read quality, contamination...

Key elements:

Role definition - “You are an expert in…”
Workflow overview - Visual pipeline structure
Complete commands - Copy-paste ready with realistic parameters
Quality thresholds - Concrete numbers, not vague guidance
Troubleshooting - Common issues and solutions

What’s Included

WGS/WES Pipeline (`/genomics:wgs`)

Full GATK best practices workflow:

# Alignment with read groups
bwa-mem2 mem -t 16 \
    -R "@RG\tID:${SAMPLE}\tSM:${SAMPLE}\tPL:ILLUMINA\tLB:lib1" \
    ${REFERENCE} ${R1} ${R2} | \
    samtools sort -@ 8 -m 2G -o ${SAMPLE}.sorted.bam -

# BQSR
gatk BaseRecalibrator \
    -R ${REFERENCE} \
    -I ${SAMPLE}.dedup.bam \
    --known-sites ${DBSNP} \
    --known-sites ${MILLS_INDELS} \
    -O ${SAMPLE}.recal_data.table

# Variant calling
gatk HaplotypeCaller \
    -R ${REFERENCE} \
    -I ${SAMPLE}.bqsr.bam \
    -O ${SAMPLE}.g.vcf.gz \
    -ERC GVCF

Plus DeepVariant, Mutect2 for somatic, hard filtering vs VQSR guidance, and nf-core/sarek integration.

RNA-seq Analysis (`/genomics:rnaseq`)

Covers both traditional alignment and pseudo-alignment approaches:

# Salmon quantification
salmon quant -i salmon_index \
    -l A \
    -1 sample_R1.fq.gz \
    -2 sample_R2.fq.gz \
    -p 8 \
    --validateMappings \
    -o salmon_quant/sample

Complete DESeq2 workflow in R:

library(DESeq2)
library(tximport)

# Import Salmon counts
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)

# Run DESeq2
dds <- DESeqDataSetFromTximport(txi, colData = sample_info, design = ~ condition)
dds <- DESeq(dds)
res <- lfcShrink(dds, coef = "condition_treatment_vs_control", type = "apeglm")

Also includes single-cell analysis with Seurat and Scanpy, pathway analysis with clusterProfiler, and visualization code.

CNV/SV Analysis (`/genomics:cnv`)

The newest addition - copy number and structural variant detection:

# GATK Somatic CNV
gatk DenoiseReadCounts \
    -I tumor.counts.hdf5 \
    --count-panel-of-normals cnv_pon.hdf5 \
    --standardized-copy-ratios tumor.standardizedCR.tsv \
    --denoised-copy-ratios tumor.denoisedCR.tsv

gatk ModelSegments \
    --denoised-copy-ratios tumor.denoisedCR.tsv \
    --allelic-counts tumor.allelicCounts.tsv \
    --output-prefix tumor \
    -O segments_output/

Covers CNVkit for WES, Manta/DELLY/GRIDSS for structural variants, SURVIVOR for merging calls, and AnnotSV for annotation.

Quality Control (`/genomics:qc`)

Comprehensive QC at every stage:

Stage	Tools	Key Metrics
Raw reads	FastQC, fastp	Q30%, adapter content
Aligned	Picard, mosdepth	Mapping rate, coverage
Variants	bcftools stats	Ti/Tv ratio, het/hom
Samples	VerifyBamID, Somalier	Contamination, relatedness

Annotation (`/genomics:annotation`)

VEP with all the plugins you actually need:

vep -i input.vcf.gz \
    --cache --offline --assembly GRCh38 \
    --plugin CADD,whole_genome_SNVs.tsv.gz \
    --plugin SpliceAI,snv=spliceai_scores.masked.snv.hg38.vcf.gz \
    --plugin AlphaMissense,file=AlphaMissense_hg38.tsv.gz \
    --plugin REVEL,file=revel_scores.tsv.gz \
    --plugin ClinVar,clinvar.vcf.gz \
    -o annotated.vcf

Plus filtering strategies for rare disease and cancer, ACMG classification guidance, and database references.

Installation

Clone and use directly:

git clone https://github.com/maratgaliev/claude-skill-genomic-pipelines.git
cd claude-skill-genomic-pipelines
# Claude Code will auto-detect skills

Or copy to existing project:

cp -r claude-skill-genomic-pipelines/.claude /path/to/your/project/

Usage Examples

Designing a WGS pipeline:

/genomics:wgs

I need to set up a germline variant calling pipeline for 50 WGS samples.
We're using GRCh38 and want to use DeepVariant. What's the recommended workflow?

Troubleshooting RNA-seq:

/genomics:rnaseq

My RNA-seq analysis shows very few differentially expressed genes (only 12 with padj < 0.05).
I have 3 replicates per condition. What could be wrong?

CNV analysis:

/genomics:cnv

I have tumor/normal WES data and need to detect copy number alterations.
Should I use GATK CNV or CNVkit? What are the tradeoffs?

Project Structure

.claude/
├── settings.json          # Skill registration
└── skills/
    ├── genomics.md        # Main routing skill
    ├── genomics-wgs.md    # 400+ lines of WGS/WES guidance
    ├── genomics-rnaseq.md # Bulk + scRNA-seq
    ├── genomics-chipseq.md# Epigenomics
    ├── genomics-annotation.md
    ├── genomics-qc.md
    └── genomics-cnv.md    # CNV/SV analysis

Extending the Skills

To add new capabilities:

Create a new skill file in .claude/skills/
Register it in .claude/settings.json:

{
  "skills": {
    "genomics:metagenomics": {
      "path": "skills/genomics-metagenomics.md",
      "description": "Metagenome analysis (Kraken2, MetaPhlAn, assembly)",
      "invocable": true
    }
  }
}

Potential additions:

Metagenomics (Kraken2, MetaPhlAn, MAG assembly)
Long-read analysis (ONT, PacBio specific workflows)
Spatial transcriptomics
Multi-omics integration
Clinical reporting templates

Conclusion

Skills provide a lightweight way to inject domain expertise into Claude Code. For bioinformatics, this means:

Instant access to best practices and correct parameters
Complete commands ready to copy and adapt
Quality thresholds based on community standards
Troubleshooting guidance for common issues

The genomics skill set covers the major NGS workflows and can be extended for specific needs. No infrastructure required - just markdown files that make Claude an expert bioinformatician.

Repository: github.com/maratgaliev/claude-skill-genomic-pipelines

Resources

Claude Code Documentation
nf-core Pipelines - Production-ready Nextflow pipelines
GATK Best Practices
Biostars - Bioinformatics Q&A

Building a Genomics Skill for Claude Code