Alexey Kuntsevich
Machine learning engineer (LLM agents, evaluation & memory) • Physics (doctoral studies)
Summary
Staff ML/AI engineer (18y in software, 10y in ML, 5y+ on LLMs) building and evaluating multi-step tool-using agents, document understanding systems, long-term memory/RAG, and fine-tuning pipelines for enterprise workflows. Strong focus on measurement and robustness: offline harnesses, LLM-as-a-judge with variance control, regression suites, and latency/cost benchmarking. Seeking a research engineering role advancing reliable, safe LLM agents for real knowledge work.
Professional experience
Stealth startup
AI Research Engineer (Jan 2026 - present)
- Trajectory robustness analysis. Extended production eval work into a follow-on research track using mid-trajectory perturbations and stochastic resampling to distinguish robust reasoning paths from brittle dependencies in long-horizon tasks. Measured by: outcome sensitivity and state/tool divergence under controlled interventions.
- Multi-agent planning system. Built coordinator -> sub-agent decomposition, context handoff + token budgeting, autonomous + human-in-the-loop modes, and proactive clarification when constraints were under-specified. Validated via: scenario-based evals tracking task success/failure modes, tool-call validity, and cost/latency budget drift.
- Automated taxonomy building and optimization. Working on pipelines that infer, normalize, and continuously refine taxonomies from unstructured data, combining LLM-assisted candidate generation with clustering, deduplication, hierarchy repair, and human review loops. Measured by: intrinsic taxonomy quality metrics plus downstream utility, using the efficiency of taxonomy-guided operation planning over the dataset as an end-to-end signal.
Procure.ai - Germany/Europe (fully remote from Munich)
Staff AI Engineer (sole ML engineer on a strategic initiative) (Feb 2025 - Jan 2026)
Built an enterprise knowledge graph, multi-agent planning system, and evaluation stack for corporate procurement. The business goal was to help procurement teams find, compare, and justify supplier choices faster across fragmented external data and internal sourcing constraints.
- Open-ended agent evaluation. Built LLM-as-a-judge pipelines for multi-step trajectories over an evolving supplier knowledge graph; stabilized judges via multi-run aggregation + agreement checks. For hard planning tasks, added teacher-trajectory matching (tool-call sequence, parameter choices, evidence coverage) with domain-expert steering. Validated via: pairwise win-rate / preference tests + calibration spot-checks.
- Regression & safety harnesses. Built smoke tests and regressions to detect capability drops across model swaps, especially around multi-tool orchestration and retrieval grounding. Validated via: fixed test suites, step-level diffs (tool calls + retrieved evidence), and gating thresholds.
- Enterprise knowledge graph. Built pipelines to ingest/merge/dedupe/enrich supplier records from public registries + catalogs across multiple EU countries; exposed structured, semantic, and geo query tools to agents. Validated via: entity-resolution precision/recall on labeled pairs + duplicate-rate monitoring.
Inworld AI - Bay Area / Remote
Staff ML Engineer (2024 - Jan 2025)
Worked on memory mechanisms and latency-sensitive retrieval for roleplay/gaming agents. The business goal was to improve long-session coherence and responsiveness so agents stayed believable and emotionally consistent under real-time product latency constraints.
- Memory ranking beyond semantics. Designed multi-factor memory ranking (emotional salience, recency, episodic relevance, persona affinity) to maintain emotional continuity and long-term persona consistency. Validated via: dialogue harness with memory injection/ablation; metrics for persona consistency, contradiction rate, and emotional-continuity ratings (LLM-judge + spot human checks).
- Selective memorization. Prototyped persona-aware memorization policies + reward-shaped heuristics, including explicit signals for “correct forgetting” (alignment-relevant). Validated via: retention/decay curves and downstream behavior drift when memories are present vs. absent.
- Serving benchmarking under gaming SLAs. Compared vLLM, SGLang, and llama.cpp. Measured by: TTFT, p50/p95 latency, tokens/sec, and throughput under concurrency.
- Multilingual robustness. Mitigated language drift (e.g., unintended language switching) via targeted prompting + routing across EN/RU/CJK. Measured by: language-id stability and drift rate on multilingual dialogue suites.
Allianz SE - Munich, Germany
Language Model Engineer (2020 - 2024)
Led language-model adoption across multiple business units, from early spaCy prototypes through GPT-3 to locally fine-tuned T5/Flan-T5 systems. The business goal was to reduce manual underwriting effort and improve consistency/auditability when screening document-heavy commercial insurance applications across countries and product lines.
- Document understanding for underwriting. Built extraction pipelines to pre-screen commercial insurance applications for eligibility signals from unstructured documents. Validated via: field-level extraction metrics (exact match / token-F1), false-negative monitoring on “must-catch” signals, and review/audit sampling.
- Fine-tuning for structured extraction + formal consistency checks. Fine-tuned T5/Flan-T5 for JSON extraction from messy multi-format surveys (incl. OCR) + conditional logic/rule application; added consistency checks that flag contradictions or missing information and generate targeted clarification requests. Bootstrapped the fine-tuning dataset by adapting public formal-logic corpora into domain templates and augmenting them with LLM-generated survey cases + programmatic pruning/validation. Validated via: JSON validity rate, schema correctness, rule/consistency checks (incl. contradiction + missing-answer cases), and end-to-end document pass/fail accuracy.
- Graph-backed RAG + clarification loop. Modeled evidence as a product -> version -> section -> chunk graph; used LLM-assisted graph construction plus heuristic traversal/pruning. The agent asked disambiguating questions before querying to prevent cross-document mix-ups. Validated via: retrieval recall@k / MRR, citation accuracy, and faithfulness/groundedness checks (LLM-judge + audits).
- Domain-adapted embeddings + internal benchmarking. Fine-tuned an open-weight embedding model on internal terminology/abbreviations; ran an internal model leaderboard and optimized T5-family encoder caching for throughput on Kubernetes. Measured by: retrieval ranking metrics (nDCG@k / recall@k), latency/throughput, and stability across model versions.
Previous experience
Senior Data Scientist / Product Owner, Flixbus (Munich) - 2017-2020 · Data Engineer, CHECK24 (Munich) - 2014-2016 · Software Developer -> Tech Lead, Apnet - 2007-2014
Independent research & lab work
- Trajectory robustness analysis for tool-using agents. Follow-on from production eval gaps observed at Procure.ai; designed mid-trajectory perturbation and stochastic resampling experiments to identify which intermediate decisions were robust vs. brittle in long-horizon workflows (RAG, document creation, web browsing, persistent memory). Measured by: outcome sensitivity and state/tool divergence under controlled interventions.
- Private training + serving lab. Multi-GPU home lab for SFT and experimental RL/GRPO fine-tuning; experiments with Unsloth and Apple MLX for efficiency/distributed topologies.
- RL sandbox for policy learning. Implemented a transformer-based PPO agent for competitive game AI to study curriculum design, proxy failure, and policy brittleness. Measured by: win rate, invalid-action rate, and generalization vs. held-out opponents/teams.
- Personal tool-using assistants (private). Built tool-integrated agents for notes/reminders capture and lightweight ops with human-in-the-loop control and observability.
Skills
LLM engineering & research: document understanding, multi-agent orchestration, LLM-as-a-judge evals, trajectory / process-level evaluation, memory architectures, RAG/GraphRAG, structured extraction, synthetic data, tool calling, fine-tuning (SFT; experimental RL/GRPO in lab settings) Training & serving: PyTorch, Hugging Face (Transformers/Datasets), Unsloth, MLX, llama.cpp, vLLM, T5x Infra & data: Docker, Kubernetes, Linux; PostgreSQL/Redis/MongoDB/RocksDB; FAISS; Kafka Programming: Python (expert), Rust (proficient), SQL (expert), FastAPI, Pytest Languages: Russian (native), English (working), German (intermediate)
Education
2024 - Professional courses: Systematically improving RAG applications · Mastering LLMs
2012 – 2017 - Nizhny Novgorod State University - Doctoral studies / postgraduate research program (Aspirantura), Physics (coursework + qualifying exams completed). Research: vacuum electrodynamics in particle accelerators.
2003 – 2009 - Nizhny Novgorod State University - Diploma in Physics (Radiophysics). MSc-equivalent in Electrical Engineering (U.S. credential evaluation).
