Epstein Files — Public Document Analysis Platform

Intelligence Dashboard

Aggregated intelligence from 2M+ document chunks across 12 datasets

Key Documents

Investigative Terms

Most Connected Documents

Ranked by number of linked entities (persons, orgs, emails)

Strongest Connections

Entity pairs co-occurring in the most documents

What are the Epstein Files?

The Epstein Files are court documents, financial records, flight logs, communications, and other evidence released by the DOJ and various courts related to Jeffrey Epstein and associated individuals. These span 12 distinct datasets totaling over 1.3 million documents.

Search tips

Keyword mode uses full-text matching with fuzzy typo tolerance.

Semantic mode uses AI embeddings to find conceptually similar passages.

Use the Filters button to filter by text source, dataset, or your tags.

PDF OCR (cyan badge) was extracted from PDF text layers — high reliability, deterministic.

Visual OCR (amber badge) was recognized from scanned images — may contain errors.

How was this data extracted?

Each PDF was processed through text layer extraction (PyMuPDF), visual OCR (FireRed-OCR, 2B SOTA model), and page image extraction. Text was chunked, indexed in OpenSearch, and embedded in Qdrant (5.3M vectors) for semantic search. All source data comes directly from publicly released DOJ files.