Our story · 2018–2022

ColumbusDoc

A personal search engine: one place to search across all your documents, wherever they lived.

You connected your accounts — email, cloud, archives — and ColumbusDoc indexed every document into one place, searchable and explorable by facets. One search, across all your scattered data.

We designed, built and brought it to market between 2018 and 2022, as co-founders of the ColumbusIT startup. Below: what it did, how it worked inside, and the still-open problem it set out to solve.

2018 — 2022

The vision

One place to search across all your documents — wherever they lived.

Every day, time is lost searching for documents scattered across dozens of different services — email, cloud, archives — and the result rarely satisfies. ColumbusDoc was born from exactly this problem.

The idea was simple to tell and hard to build: connect every account into a single secure container, normalize and index every document, and make them searchable and explorable by facets. One search, across all your data, wherever it lived.

Connectable accounts

OneDriveSharePointOutlookGmailGoogle DriveDropboxamong others

The period video

How we told the story back then

ColumbusDoc's original promo: connect your accounts, search, filter, view and share documents in seconds. Fifty-six seconds, in Italian, with the era's graphics (and soundtrack).

The engineering

Under the hood, a distributed pipeline that turned any file into searchable content.

ColumbusDoc wasn't a UI with search bolted on. It was a distributed system: specialized processing queues — heavy work like OCR kept apart from fast operations — a central orchestrator balancing load across dedicated servers, and a pipeline that carried every document from raw format to indexed, searchable content.

The ingestion pipeline

Convert to PDF

Word, Excel, images, text: everything normalized into a common format.

Extract images

Graphic components are isolated from the document.

OCR

Text inside scans becomes machine-readable.

Searchable PDF

Recognized text is layered over the original image.

Optimize

Compression and linearization for fast web opening.

Extract text

Plain text feeds the full-text index.

Thumbnails

Every page gets its own preview.

Page indexes

Per-page indexes pre-computed so the viewer can stream only the portions it needs, without downloading the whole file.

Indexing

Data and metadata are written to MongoDB and Elasticsearch.

Distributed processing

An orchestrator (WatchDog) coordinated dedicated queues — control, synchronization, intensive processing, fast operations, external-storage integration — spread across multiple servers and sized to the load. Scalability by design, not improvisation.

The features

Far more than a search bar.

What it could do, concretely.

Natural-language queries

"Find 2017 documents about Columbus," "show me every email sent by Gianni." A semantic parser turned the sentence into a syntax tree and then a typed query. It even understood colloquial dates — "yesterday," "the day before," "last year."

Exploratory search

Interactive filters on account, year, presence of attachments or notes. You could browse the entire archive without typing a word, including and excluding facets.

SmartSet

Sharing sets of documents defined by semantic rules, beyond folders and ACLs: "share every document for client X dated 2021." Any new file matching the rule joined the set and was shared on its own.

Complex Italian formats

Certified email (PEC) parsed recursively — sender, recipient, certification timestamp, nested attachments — and e-invoices, including .p7m-signed ones, with a readable HTML preview. Regulated formats, handled natively.

Streaming PDF viewer

The viewer opened documents without downloading them whole: it loaded only the portions of the PDF needed for the page on screen, with intelligent pre-caching based on navigation.

Desktop and web

Native WPF desktop client and browser web client: the same engine, two interfaces.

A hard problem

Searching across your own scattered data is a real problem — and far from solved.

Aggregating the documents a person or organization spreads across different services, normalizing them and making them searchable from one place is a concrete, hard problem. It's the same unified-search ground that products like Glean and Dropbox Dash work on today. ColumbusDoc was our way of tackling it, with the technology of the time.

In 2023 Dropbox launched Dash, which brings into focus the same concepts we had worked on — down to sharing sets of documents as a unit: our SmartSet, their Stacks.

If anything, the barriers have grown

Aggregating someone's personal data has become more costly since then, not less: providers have progressively restricted access to their APIs. Google, for example, now requires restricted-scope verification and an annual security assessment for apps that touch Gmail or Drive and store that data elsewhere. The problem stays open precisely because it's hard.

Where we went deep

A parser that interpreted natural-language questions. Indexing and full-text search on Elasticsearch. Highly scalable distributed processing, with all its challenges. Response-time optimization across the entire application architecture. And native handling of Italy's regulated formats: certified email (PEC) parsed recursively, e-invoices including .p7m-signed ones.

What we carried forward

Deep roots, today's tools.

ColumbusIT wound down in 2022, but the experience we built around ingestion, indexing and search over data stayed — and some of those elements still underpin what we build today, with far more evolved tools: Traction, which brings a software project's scattered context together for AI agents; DRH, which makes diagnostic knowledge searchable; Lumina, which uses semantic search to connect editions over time.

Discover Traction Discover DRH Discover Lumina

From the story to the present

Want to see how we apply this experience to your systems?

Document management, indexing, search over data: tell us about your context.

Get in touch Back to home