The Epstein Files platform is a public-interest research tool for searching, analyzing, and cross-referencing the publicly released documents from the Jeffrey Epstein case. All source data comes from official releases by the U.S. Department of Justice and congressional committees.
In 2023–2025, the U.S. Department of Justice released over 1.3 million documents related to the Jeffrey Epstein investigation across 12 datasets. These documents include court filings, FBI records, bank statements, flight logs, email archives, detention records, and more.
The sheer volume of these releases makes manual review impractical. This platform was built to make these public records searchable and analyzable. Every document has been extracted, OCR-processed, indexed, and cross-referenced with entity recognition to surface connections that would otherwise remain buried in hundreds of thousands of PDFs.
This is a transparency and accountability project. The documents are public record. This tool simply makes them accessible.
All datasets originate from official DOJ releases and congressional committee publications.
| Dataset | Files | Content |
|---|---|---|
| DS1 | 3,158 | Court filings, FBI records, and investigative documents |
| DS2 | 574 | Financial documents and bank records |
| DS3 | 67 | Metropolitan Correctional Center (MCC) detention records |
| DS4 | 152 | Communications and correspondence |
| DS5 | 120 | Additional court filings |
| DS6 | 13 | Supplemental materials |
| DS7 | 17 | Supplemental materials |
| DS8 | 10,595 | FBI investigative records |
| DS9 | 531,307 | Bank records, JPMorgan correspondence |
| DS10 | 503,154 | Aviation and flight records |
| DS11 | 331,655 | Email archives |
| DS12 | 152 | Additional FBI files |
The extraction pipeline uses a multi-stage approach: hidden text recovery from PDF layers, OCR via SOTA models (FireRed-OCR) for scanned pages, CLIP ViT-L/14 for photo classification, and GLiNER for named entity recognition across all 3.13M+ chunks.
Entity data is stored in a Neo4j knowledge graph (1.44M nodes, 6.8M+ relationships) enabling network analysis and relationship discovery. Full-text search is powered by OpenSearch with 3.13M indexed chunks. Semantic search uses 5.3M vector embeddings in Qdrant. Document metadata and ground truth statistics are maintained in PostgreSQL.
This platform is a research tool for publicly available documents. All source material comes from official government releases. The platform does not host, generate, or distribute any illegal content.
Some images in the original documents were not properly redacted by the releasing agencies. Where identified, these images have been censored pending review. If you encounter content that should be redacted, please report it.
Inclusion of any individual's name in these documents does not imply guilt or wrongdoing. Many people appear in the records as witnesses, victims, legal representatives, or incidental mentions. The profiles on the Crime Board reflect what the documents contain, not conclusions of culpability.
Entity extraction (names, organizations, locations) is automated using NLP models and may contain OCR errors or misattributions. Always verify against the source documents.