Now Training: v2.0 Edge Persona Models

Intelligence at the Edge.
Zero Latency. Zero Compromise.

TinyLLMs provides high-reasoning, distilled Small Language Models (SLMs) purpose-built for constrained hardware. We bridge the reasoning gap for mission-critical, offline environments.

Cloud-Scale Training. Edge-Scale Deployment.

Our pipeline requires massive compute to distill deep reasoning capabilities into models small enough to run on local vehicle hardware.

01

RLHF & DPO Training

We utilize high-density H100/A100 GPU clusters to train foundational reward models. We apply advanced Reinforcement Learning (RL) techniques to teach complex spatial and persona-based reasoning.

02

Model Distillation

Through proprietary knowledge distillation and quantization, we compress large model weights into highly efficient SLMs (1B-7B parameters) without sacrificing the reasoning gap.

03

On-Device Inference

The distilled models are deployed directly onto edge hardware. They execute autonomous logic, persona mimicry, and dynamic routing with zero cellular latency.

The Distillation Engine

Compression Without Compromise

Our proprietary pipeline shrinks massive parameter footprints into edge-deployable formats while retaining complex reasoning pathways.

Knowledge Distillation

Transferring behavioral policies from 70B+ parameter teacher models to sub-3B student models using KL Divergence loss.

Quantization (INT8/INT4)

Reducing the precision of the network's weights to drastically reduce VRAM usage and accelerate edge ALUs.

Weight Pruning

Systematically removing non-critical neural connections to enforce sparsity, accelerating matrix multiplications.

Model Finetuning (LoRA)

Parameter-Efficient Fine-Tuning freezes the pre-trained model and injects trainable rank decomposition matrices.

Industry Validation

Fewer Active Parameters. Superior Reasoning.

NVIDIA's Nemotron Cascade 2 just proved what TinyLLMs was built on: you don't need trillion-parameter models to achieve frontier intelligence. You need intelligence density.

NVIDIA Nemotron Cascade 2

March 2026 · Open Source

A 30B Mixture-of-Experts model that activates only 3B parameters per token — and beats NVIDIA's own 120B model on coding and math benchmarks with 4x fewer active parameters.

Runs on a single RTX 4090 at 24.5GB quantized. Gold-medal performance on IMO 2025, IOI 2025, and 10 of 12 ICPC World Finals problems.

The post-training recipe uses Cascade RL and Multi-Domain On-Policy Distillation — the same families of techniques at the core of TinyLLMs' pipeline.

AIME 2025

92.4

Math Reasoning

LiveCodeBench v6

87.2

Code Generation

IMO 2025

Gold Medal

35 Points · Competition Math

Active Parameters

3B / 30B

10% Activation Ratio

TinyLLMs' entire architecture — from RLHF training through knowledge distillation to quantized edge deployment — is built on this same principle of intelligence density. As frontier labs open-source techniques like Cascade RL and on-policy distillation, our pipeline absorbs these advances and compresses them further for mission-critical hardware where cloud access is not an option.

Flagship Vertical

Next-Gen ADAS for
Emergency Vehicles.

Emergency responders cannot rely on cloud APIs in dead zones. TinyLLMs powers embedded agentic systems that handle complex traffic preemption, dynamic routing, and persona-based dispatcher mimicry—all processed locally on the vehicle's hardware.

  • Traffic Preemption Logic: Real-time intersection override based on RL policy networks.
  • Persona Mimicry: SLMs tuned to interpret dispatcher intent instantly.
  • Air-Gapped Reliability: 100% offline inference capability.
Edge Terminal // Unit 42
> Initializing local TinyLLM core... OK
> Loading ADAS RL Policy (v2.4)... OK
> INCOMING: Code 3 routing requested.
Model Output: Route calculated. Overriding grid intersections 4 through 9. Expected latency: 12ms. Cloud dependency: FALSE. Proceeding to visual navigation mode.
Monitoring telemetry stream...
The Team

Meet the Founders

TinyLLMs is founded by engineering leaders with deep roots in Reinforcement Learning, NLP, and High-Performance Compute—bringing experience from Stanford AI research, IIT, and scaling enterprise platforms.

Prabhjot Singh Rai

Prabhjot Singh Rai

Co-Founder & Principal Architect

LinkedIn
Stanford Research UMN IITR
Reinforcement Learning NLP Systems Architecture Model Distillation

Prabhjot brings deep expertise in reinforcement learning and natural language processing, with research experience at Stanford focused on reward modeling and policy optimization for language agents. He holds degrees from the University of Minnesota and IIT Roorkee, with a track record of building high-throughput ML systems at scale.

At TinyLLMs, Prabhjot leads the end-to-end technical architecture—from RLHF training pipelines and distillation workflows to the edge inference runtime. His work on persona-based reward shaping is central to TinyLLMs' ability to compress complex behavioral policies into sub-3B parameter models.

Prior to TinyLLMs, he designed and scaled enterprise AI platforms, gaining firsthand experience with the infrastructure constraints that edge deployment demands.

Sakthivel Sivaraman

Sakthivel Sivaraman

Co-Founder & Principal Scientist

LinkedIn
Stanford Research UPenn NITK
Edge Computing NLP Quantization On-Device Inference

Sakthivel is a systems researcher with deep expertise in edge computing and NLP, with research experience at Stanford and a degree from UPenn and NITK Surathkal. His work focuses on making large-scale language models practical for resource-constrained environments.

At TinyLLMs, Sakthivel leads the quantization and on-device inference stack, ensuring distilled models meet strict latency and memory budgets on embedded hardware. His research on INT4/INT8 quantization-aware training enables TinyLLMs to deploy reasoning-capable models on hardware with as little as 4GB of available memory.

His prior experience building production NLP systems at scale gives him a unique perspective on bridging the gap between research-grade models and real-world deployment constraints.

Research Affiliations

Stanford | UMN | UPenn | AWS