A high-speed, mathematically grounded 3B parameter language model for interpreting Nordic Non-ETS greenhouse gas emissions data. Built through Knowledge Distillation from a 32B teacher model, anchored to verified government statistics to guarantee factual accuracy.
The CorpusAI CO2 Emissions Model is an applied research project exploring the intersection of large language model distillation, domain-specific grounding, and edge deployment for environmental data analysis. The project addresses a critical gap in the Nordic climate reporting ecosystem: the need for fast, accurate, and interpretable AI systems capable of processing Non-ETS (non-Emissions Trading System) data from national statistical bureaus.
The core innovation is a three-stage methodology:
Knowledge Distillation (32B → 3B) compresses the reasoning capabilities of a large teacher model into a compact student model optimised for CPU inference. Anchor Data from verified government sources (SSB, Miljødirektoratet, Naturvårdsverket) is injected during training to eliminate hallucinations and ensure every output is traceable to "Ground Truth" statistics. The resulting model runs on commodity AMD EPYC hardware via Ollama, achieving sub-second response times without GPU requirements.
This specification documents the complete systems engineering lifecycle: from data preparation and model training through evaluation gates and production deployment. It serves as the technical foundation for the thesis component addressing the research question: "How can knowledge distillation and data anchoring techniques enable accurate, hallucination-free LLM inference for environmental reporting on resource-constrained hardware?"
The system architecture follows a staged distillation pipeline pattern, where each component is decoupled and independently verifiable — a key systems engineering principle enabling incremental validation. The architecture comprises four primary subsystems:
The 32-billion parameter teacher model runs on the Hippo/Viper GPU cluster (RTX 5090). It processes Anchor Data and generates Chain-of-Thought (CoT) reasoning pairs that form the training corpus for the student model. The teacher's role is to demonstrate how to think about emissions data — not just what the answer is.
The 3-billion parameter student model is the deployment target. Trained via LoRA fine-tuning on the teacher's distilled outputs and anchored to verified statistics, it achieves near-teacher-level accuracy at 10× the inference speed. Quantised to Q8_0 GGUF for CPU-only deployment on the S4 server (AMD EPYC).
The anchor layer is the system's "Ground Truth" guarantee. Raw emissions data from SSB (Statistisk sentralbyrå), Miljødirektoratet, and Naturvårdsverket is stored in a normalised MariaDB table and transformed into Natural Language Fact Sheets during training. This ensures zero hallucination on factual queries.
At inference time, a Retrieval-Augmented Generation (RAG) layer performs hybrid search across Qdrant vector embeddings and MariaDB structured data. Retrieved chunks and Anchor Data facts are injected into the prompt context, enabling the 3B student to provide cited, verifiable responses.
Phase 1 of the pipeline focuses on converting structured emissions data into training-ready formats. This is a two-step process: Anchor Data Extraction and Teacher Reasoning Generation.
Raw rows from the nordic_emissions_raw table are extracted and transformed into human-readable Anchor Strings. This transformation is deterministic — every fact sheet maps 1:1 to a database row, ensuring full traceability.
Source Format (MariaDB row):
Anchor String (output):
Each Anchor String carries metadata linking back to the source row ID and the originating statistical table. This enables post-hoc auditing: any model output can be traced back through the Anchor String to the exact database row and government publication it derives from.
The 32B Teacher model processes each Anchor String and generates structured Chain-of-Thought (CoT) training pairs. These pairs teach the student model how to reason about emissions data — not just memorise facts.
Input Prompt to Teacher:
Output (JSONL training pair):
The <think> block is critical: it exposes the teacher's mathematical reasoning (percentage calculations, baseline comparisons, regulatory framework references) so the student model learns to replicate this analytical process, not just the final text.
The training phase (the "Refinery Pass") uses Unsloth on the Hippo GPU server to maximise throughput on the RTX 5090. The dataset is carefully balanced to produce a model that is both accurate and robust.
The training dataset is structured as a balanced JSONL corpus with three distinct pair types, each serving a specific pedagogical function:
| Pair Type | Share | Example Q → A | Purpose |
|---|---|---|---|
| Anchor Direct | 30% |
"What were Oslo's 2024 transport emissions?" → "1.2M tonnes" | Exact factual recall from Ground Truth data |
| Reasoning (CoT) | 50% |
"Compare Oslo and Stockholm waste emissions" → <think> block + result |
Multi-step mathematical and analytical reasoning |
| Negative / Robustness | 20% |
"What are the emissions for Mars?" → "The dataset does not contain planetary data outside the Nordics" | Boundary enforcement; teaches model to refuse out-of-domain queries |
Using Unsloth on the Hippo server to maximise training throughput on the RTX 5090 (32GB VRAM). The LoRA configuration prioritises high-density adapter weights to preserve mathematical reasoning fidelity during distillation.
| Parameter | Value | Rationale |
|---|---|---|
LoRA Rank (r) |
64 |
High density required for mathematical logic preservation across distillation boundary |
LoRA Alpha (α) |
128 |
Alpha/Rank ratio of 2.0 balances adapter influence vs. base model knowledge |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Full attention + MLP targeting ensures reasoning pathway modification |
| Learning Rate | 1 × 10-4 |
Cosine schedule with warm-up; prevents catastrophic forgetting of base capabilities |
| Context Length | 4,096 tokens |
Sufficient for CoT blocks + Anchor context; larger windows degrade training speed |
| Batch Size | 4 (gradient accumulation: 8) |
Effective batch size 32; fits within 32GB VRAM budget with Unsloth optimisations |
| Epochs | 3 |
Domain-specific data benefits from multiple passes; overfitting monitored via eval loss |
| Weight Decay | 0.01 |
Light regularisation to prevent overfitting on small specialised datasets |
| Quantisation (Training) | 4-bit (NF4) |
QLoRA approach: base model in 4-bit, adapters in float16 for precision |
The model must pass three "Blue Note" quality gates before deployment. These gates are designed as sequential verification stages — inspired by systems engineering V&V (Verification and Validation) methodology — where each gate tests a progressively higher level of system capability.
<think> block for correct mathematical operations. When the model claims a "Total," is it actually the sum of the referenced "Sectors"? When it calculates a percentage change, is the arithmetic correct?
QUALITY GATE MAPPING TO SYSTEMS ENGINEERING V-MODEL
Phase 4 transforms the trained LoRA adapters into a production-ready inference system. The pipeline is designed for deterministic reproducibility: every step is scripted, version-controlled, and produces bit-identical outputs from identical inputs.
LoRA adapters are merged into the base Qwen 2.5-Coder-3B model using Unsloth's merge utilities. The merged model is then converted to GGUF format and quantised to Q8_0 (8-bit quantisation). Q8_0 is selected over Q4_K_M for this use case because mathematical precision is paramount — the marginal speed improvement of 4-bit quantisation does not justify the risk of numerical rounding artifacts in emissions calculations.
The deployed model runs via Ollama on Server 4 (AMD EPYC, 64 cores). At query time, a two-step prompt routing process ensures accuracy:
Step 1: Hybrid search retrieves relevant chunks from Qdrant (semantic similarity) and MariaDB (structured query) based on the user's question.
Step 2: The 3B model interprets the retrieved chunks alongside injected Anchor Data facts to produce a final, cited response. Every claim is traceable to a source row.
Two critical bottlenecks have been identified during initial prototyping. Both require pre-deployment mitigation to ensure production reliability.
Nordic characters (å, ø, æ, ä, ö) in source data may produce HTML entity artifacts (å, ø) when scraped from web-based statistical interfaces. If these artifacts persist into the training data, the 3B model may learn to reproduce them in outputs — generating responses like "Miljødirektoratet" instead of "Miljødirektoratet".
Mitigation: The "Hex Scrub" pre-processing script must run on all source data before the Teacher generates training pairs. This script normalises all HTML entities to their UTF-8 equivalents and validates character encoding consistency across the entire nordic_emissions_raw table.
On the S4 server (AMD EPYC, CPU-only inference), the Key-Value cache grows linearly with context length. At the full 4,096 token training context, inference latency degrades significantly as the KV cache consumes available RAM bandwidth. The EPYC's memory subsystem, while ample in capacity, cannot match GPU HBM bandwidth for random access patterns typical of transformer attention.
Mitigation: Production inference context is capped at 2,000 tokens. The RAG layer pre-filters retrieved chunks to stay within this budget. This constraint is acceptable because the 3B model's primary function is interpretation of pre-retrieved data, not open-ended generation. The 2K context window comfortably fits: system prompt (~200 tokens) + retrieved chunks (~800 tokens) + Anchor facts (~400 tokens) + generation headroom (~600 tokens).
This project is developed within the framework of a Master of Science in Innovation and Technology Management with a specialisation in Systems Engineering. The specification deliberately maps to established SE methodologies:
The three Quality Gates (Section 5) directly implement INCOSE SE Handbook requirements verification categories: Inspection (Math Test — automated numerical verification), Analysis (Logic Test — mathematical consistency checking), and Demonstration (Vibe Test — expert panel evaluation). Each gate has explicit pass/fail criteria, ensuring requirements traceability from stakeholder needs to test results.
The project lifecycle follows the V-Model pattern: left side (decomposition) maps Domain Requirements → System Design → Component Specifications, while the right side (integration) maps unit-level verification (Gate 01) through system-level validation (Gate 02) to acceptance testing (Gate 03). This structure is documented in Section 5's V-Model diagram.
The architecture's four subsystems (Teacher, Anchor, Refinery, RAG) communicate through well-defined interfaces: JSONL for training data exchange, SQL for anchor queries, GGUF for model serialisation, and REST APIs for inference. Each interface has a defined data contract, enabling independent development and testing of subsystems.
From an innovation perspective, CorpusAI represents a process innovation in environmental reporting: applying knowledge distillation to create domain-expert AI systems that can operate on commodity hardware. The commercial viability thesis is that organisations (municipalities, environmental agencies) can deploy specialised AI models without cloud dependency or GPU infrastructure costs — a significant barrier reduction for Nordic public sector adoption.
| SE Concept | CorpusAI Implementation | Thesis Section |
|---|---|---|
| Stakeholder Analysis | Nordic climate agencies (SSB, Miljødirektoratet), municipal planners, policy researchers | Chapter 2 |
| Requirements Decomposition | Accuracy (>99%), speed (<1s), domain-bounded, CPU-deployable, hallucination-free | Chapter 3 |
| Architecture Design | Teacher-Student-Anchor-RAG four-subsystem decomposition (Section 2 of this spec) | Chapter 4 |
| Verification & Validation | Three-gate quality framework: Math, Logic, Vibe (Section 5 of this spec) | Chapter 5 |
| Configuration Management | Git-controlled training configs, versioned GGUF artifacts, reproducible pipeline scripts | Chapter 6 |
| Risk Management | Encoding artifacts, KV cache bloat, domain boundary leakage (Section 7 of this spec) | Chapter 7 |
Ensure the nordic_emissions_raw table has at least 5,000 fresh rows from SSB, Miljødirektoratet, and Naturvårdsverket. Run the Hex Scrub encoding normalisation on all ingested data.
Use the 32B Coder on Viper to build the first 2,000 Q&A pairs following the 30/50/20 dataset composition. Validate JSONL format and anchor ID integrity before training.
Execute LoRA training on Hippo via Unsloth. Run all three Blue Note quality gates. Iterate on dataset composition if Gate 01 or 02 fail.
Merge adapters, quantise to Q8_0 GGUF, deploy via Ollama on S4. Configure RAG layer with Qdrant + MariaDB hybrid search. Production context cap: 2,000 tokens.
CorpusAI CO2 Emissions Model v1.0
A GilliganTech Research Project — Blue Note Logic Inc. × Gilligan Tech ENK
Master of Science · Innovation & Technology Management · Systems Engineering