Cell Type Annotation in Biomedical Research: Expert-Grade Transcriptional Labeling Using Large Language Models

Cell type annotation is a foundational step in single-cell biology and spatial transcriptomics, where researchers infer what biological cell populations are present based on molecular measurements. The core problem is that raw assays—such as scRNA-seq (single-cell RNA sequencing), scATAC-seq, or spatial RNA profiling—produce high-dimensional gene expression matrices rather than explicit labels. “Cell type annotation” therefore refers to assigning each cell (or spatial spot) to a biological identity, using marker gene patterns, reference atlases, and curated ontologies.

Accurate annotation is essential because downstream analyses depend on it: differential expression, lineage inference, cell–cell interaction modeling, trajectory reconstruction, and biomarker discovery all assume that labels are biologically meaningful. Misannotation can propagate errors, leading to incorrect mechanistic conclusions—for example, conflating activated immune subtypes with tumor-associated states or incorrectly labeling doublets (two cells captured together) as rare populations.

Classically, annotation pipelines rely on marker genes: experts identify canonical markers for known cell types (e.g., CD3D/E for T cells, MS4A1 for B cells, LST1 for myeloid lineages). However, marker-based methods face limitations. Gene expression varies across donors, tissues, stimulation conditions, batch effects, and technical noise. Some cell types exist along continua rather than discrete categories, and many subtypes share overlapping marker sets. Additionally, transcriptional “activation” programs can temporarily mimic unrelated identities.

To improve robustness, reference mapping methods compare query cells to annotated reference atlases using statistical alignment. Examples include correlation-based approaches, nearest-neighbor transfer, and supervised classifiers trained on curated datasets. These methods still require careful handling of batch effects, selection of appropriate reference datasets, and calibration of confidence scores. Many researchers also incorporate ontology-aware labeling, ensuring that predictions respect hierarchical relationships (e.g., immune → myeloid → macrophage).

Recently, large language models (LLMs) have emerged as a new route for formalizing expert knowledge during annotation. In this paradigm, the model learns to interpret structured descriptions of cell types, including marker gene rationales, gating rules, and known confounders (such as doublet signatures or stress-response genes). When integrated into annotation workflows, an LLM can complement quantitative methods by transforming semi-structured expert annotations into a consistent labeling schema.

A key concept is “expert formalization”: translating the tacit decision-making of domain scientists—how they resolve ambiguous marker patterns—into explicit rules or structured prompts. For instance, an expert may distinguish two macrophage-like states by specific interferon-stimulated gene modules, or by gene programs indicating microglial identity. An LLM can be guided to produce structured outputs such as (1) predicted cell type, (2) evidence markers, (3) rejection criteria, and (4) uncertainty estimates.

This approach can improve interpretability and reproducibility. Instead of returning only a single label, the system can provide evidence-driven justifications that align with biological knowledge. In medical and translational research, interpretability matters: investigators must know whether a predicted subtype is supported by strong marker evidence or whether the model is extrapolating beyond reference data.

Methodologically, high-performance annotation systems typically use a hybrid architecture. Quantitative models handle the numeric embedding space derived from gene expression, while LLM components handle narrative reasoning, ontology mapping, and structured evidence formatting. Validation is crucial: accuracy should be assessed against independently curated labels, with attention to known failure modes. These include batch-specific artifacts, rare-cell scarcity in reference sets, and overconfidence in out-of-distribution scenarios.

Quality control is also integral. Before annotation, standard preprocessing steps include normalization, log transformation, highly variable gene selection, batch correction, and doublet detection. Post-annotation, researchers often check for consistency: predicted cell types should show expected marker enrichment, stable proportions across technical replicates (when appropriate), and coherent clustering structure.

Despite promise, LLM-based annotation carries risks. If trained or prompted poorly, the model may hallucinate marker associations that are not empirically supported, or it may conflate correlation with causation. Therefore, medical-grade annotation requires strict guardrails: constrained vocabularies from cell ontology databases, evidence grounding to observed marker expression, and calibration using held-out datasets.

In summary, cell type annotation is a critical inferential task that converts molecular measurements into biologically interpretable categories. Expert-driven formalization using large language models aims to encode curated biological reasoning—marker evidence, hierarchical relationships, and uncertainty—into structured, reproducible outputs. When combined with rigorous QC, reference mapping, and evidence grounding, this direction can enhance the reliability of single-cell cell type labels and strengthen downstream biomedical discovery. Source: @razoralign