Pfam Domain Variation — Interesting Genes ?

Same-length haplotypes with high domain architecture diversity (filtered from 18,500 genes in JoGo)

Help — Pfam Domain Variation Viewer

Overview

This page lists 60 genes from the JoGo haplotype-resolved proteome where Pfam protein domain architectures vary across same-length haplotypes (a-level). These represent cases where amino acid substitutions (not indels) cause changes in domain recognition, suggesting functional divergence among haplotypes at the protein domain level.

Genes are selected with ≥5 distinct domain architectures and ≥30 domain differences, filtered from 18,500 genes with Pfam hits in the database.

Data Source

  • Haplotype sequences: JoGo a-level haplotypes (174,376 sequences from 19,193 gene regions)
  • Domain annotation: Pfam-A (27,481 HMM families) via HMMER hmmscan (E-value ≤ 1e-5)
  • Gene annotations: MANE Select v1.2 (GRCh38), 19,316 genes
  • Database: jogo_pfam.db — 1,491,403 domain hits across 18,500 genes, 178,088 domain differences

Selection Criteria

Only same protein length haplotype comparisons are included (ref aalen = alt aalen). This isolates domain variation caused by amino acid substitutions rather than insertions/deletions, making differences more biologically interpretable. Genes must have ≥5 distinct domain architectures and ≥30 total differences to appear on this page.

Difference Types

TypeBadge ColorDescription
boundary_shiftOrangeSame domain present but alignment coordinates shifted — substitutions alter domain boundary recognition
domain_lostRedDomain present in reference (a0001) is entirely absent in the alternate haplotype
domain_gainedGreenDomain absent in reference but present in the alternate haplotype
copy_lostDark RedFewer copies of a repeated domain (e.g., immunoglobulin repeats)
copy_gainedBlueAdditional copies of a repeated domain gained

Table Columns

ColumnDescription
InterestStar rating (1–5) based on the interest score formula below
GeneGene symbol. Click to open the per-gene Pfam domain viewer page
RegionGenomic region name (GENE_chrN_start_end). Click to open JoGo browser
Prot LenProtein length (aa) of reference haplotype (a0001)
HaplotypesNumber of a-level haplotypes with Pfam domain hits
ArchitecturesNumber of distinct domain architectures across haplotypes
Total DiffsSum of all domain differences vs reference haplotype (a0001) among same-length haplotypes
Diff BreakdownStacked bar showing proportion of each diff type (orange=shift, red=lost, green=gained, dark red=copy_lost, blue=copy_gained)
Diff TypesNumber of distinct diff types present (out of 5)
CountsColor-coded badges showing count per diff type

Interest Score

Genes are ranked by a composite interest score:

score = n_architectures × 3.0 + n_total_diffs × 0.01 + n_diff_types × 5.0 + domain_lost_bonus (8.0 if any) + domain_gained_bonus (8.0 if any)

Higher scores indicate more biologically interesting domain variation. Genes with many distinct architectures, complete domain loss/gain events, and diverse diff types rank highest. TTN (titin, 72 architectures) and FBN3 (fibrillin-3, 51 architectures) top the list.

Filters

FilterDescription
SearchFilter by gene name (case-insensitive substring match)
Min ArchitecturesOnly show genes with at least N distinct domain architectures
Min DiffsOnly show genes with at least N total domain differences
Diff Types: All 5 typesOnly show genes exhibiting all 5 difference types
Diff Types: Has domain_lostAt least one domain is completely lost in some haplotype
Diff Types: Has domain_gainedAt least one domain is gained in some haplotype

Per-Gene Viewer

Clicking a gene name opens the detailed domain viewer page ({REGION}_pfam_viewer.html) which shows:

  • Gene Information — NCBI/Ensembl/HGNC identifiers, RefSeq protein, protein length, reference architecture string
  • Domain Architecture Diagram — Interactive SVG with colored domain blocks per haplotype, diff markers (✗=lost, +=gained, ◄=shifted), and hover tooltips
  • Domain Differences Table — Sortable table of all differences vs reference with color-coded badges
  • Haplotype Domain Details — Collapsible per-haplotype sections with full domain lists (accession, name, description, type, clan, coordinates, scores)
  • Summary Statistics — Stat cards and domain frequency table across haplotypes

Methods

The pipeline consists of the following steps:

  • Download Pfam-A HMM database (27,481 families) from EBI FTP
  • Extract amino acid sequences from JoGo haplotype TSV (173,858 sequences after filtering zero-length)
  • Split into per-chromosome FASTA files and run hmmscan in parallel (8 jobs × 4 CPUs, E-value ≤ 1e-5)
  • Parse domtblout results and merge with Pfam metadata (name, description, type, clan) and MANE gene annotations
  • Compute per-gene domain architecture summaries and cross-haplotype domain differences
  • Build SQLite database (jogo_pfam.db) with 6 tables and indexes
  • Generate 18,515 per-gene HTML viewer pages and this summary page

Reference

Nagasaki M, et al. JoGo 1.0: the ACTG hierarchical nomenclature and database covering 4.7 million haplotypes across 19,194 human genes. Nucleic Acids Research, 2026. doi:10.1093/nar/gkaf1232

Interest Gene Region Prot Len Haplotypes Architectures Total Diffs Diff Breakdown Diff Types Counts