About
Every language is a universe
of thought.
We keep them alive.
A language dies every two weeks. By 2100, UNESCO estimates half of the world’s ~7,000 languages will be extinct — each taking with it centuries of irreplaceable knowledge, oral history, and cultural identity. The resources to preserve these languages exist, but they’re scattered across obscure PDFs, YouTube videos, academic papers, and dictionary websites. TongueKeeper deploys AI agents that autonomously discover, extract, and cross-reference these scattered fragments into a unified, searchable archive. In minutes, not months.
At a glance
How it works
Discover
Autonomous agents scour the web for dictionaries, grammars, recordings, and academic papers in endangered languages.
Extract
AI-powered extraction pulls vocabulary, grammar patterns, and audio from diverse sources into structured archives.
Cross-Reference
Intelligent verification links entries across sources, validating accuracy and building comprehensive language records.
The pipeline
Discovery
AI agents search with 6-tier dynamic queries across Perplexity Sonar and SERP APIs, generating up to 24 targeted queries per language.
Crawl
Each source is fetched through a 3-tier cascade: specialized crawlers, BrightData Web Unlocker for protected content, and Stagehand headless browser.
Extraction
Claude processes each source in a tool-use loop, extracting structured vocabulary entries, grammar patterns, IPA transcriptions, and conjugations.
Cross-Reference
A second Claude agent searches for duplicate entries across sources, merging definitions and calculating reliability scores.
Archive
All data flows into Elasticsearch with Jina AI embeddings for semantic search, reranking, and knowledge graph generation.
Data sources
Glottolog
The world's most comprehensive catalog of languages, with data on 5,352 endangered languages including geographic coordinates, endangerment status, and language family classification.
Endangered Languages Project
A collaborative platform documenting the world's endangered languages, providing endangerment assessments and preservation resources.
Community Sources
Dictionaries, academic papers, YouTube content, government archives, and wiki resources discovered autonomously by our AI agents.
Built with