Technology
Parsing the Da Capo Catalog
How we turned 1,399 pages of printed classical music compositions into a searchable database — and why no existing metadata system could have done it for us.
The book weighed about two kilos. Jerzy Chwialkowski's The Da Capo Catalog of Classical Music Compositions, published by Da Capo Press in 1996. 1,399 pages. 132 composers. I discovered it while researching how classical music had been catalogued historically — looking for anything that treated the repertoire with the kind of structural rigor we needed and could not find in any existing digital source. Within twenty minutes of getting my hands on a copy I had stopped thinking of it as a reference book. It was a specification — the most obsessively detailed specification I had ever held in my hands. Every major work from Monteverdi to Stockhausen, organized by genre within each composer, then chronologically within each genre. Opus numbers, catalog numbers, key signatures, instrumentation, literary sources for vocal works, movement titles, and for operas and cantatas, every individual aria and recitative listed in performance order.
I brought it to the team and said this is what we need to build from. We tracked down Chwialkowski — my boss led the negotiation, I was involved — and secured the rights to use the catalog's data as the foundation for our metadata system. Once we had the rights, the engineering could begin. The entire parsing logic and matching pipeline was mine. And the first problem was extracting 1,399 pages of structured knowledge from ink on paper.
The Shape of the Problem
Primephonic was being overhauled from a PHP site into native apps, and I was building the catalog, the payment integrations for subscriptions across the major app stores and Adyen, and the metadata layer that would make the whole thing work. The core problem with the catalog was simple to state and brutal to solve. The audio library was excellent — lossless files licensed from hundreds of classical labels. The metadata was chaos. Each label had its own conventions and none of them agreed with each other. Deutsche Grammophon tagged conductors one way, Hyperion another. Some labels filed a symphony's four movements as four separate tracks with no parent relationship. Others grouped them under an album title that bore no consistent relationship to the work's canonical name. I once found the same Brahms symphony tagged as "Brahms: Sym. 4 e-Moll," "Symphony No. 4 in E minor, Op. 98," and simply "Brahms 4" — three different labels, three different ideas about what constitutes a title.
Searching this library was like searching a library where every book had been shelved by a different librarian using a different system.
The Da Capo Catalog was the Rosetta Stone. It contained, in one dense volume, the canonical identity of every major classical composition: the correct title, the correct catalog number, the correct opus, the correct key, the correct instrumentation, and the correct hierarchical structure — movements within works, arias within operas, recitatives within cantatas. If I could get this data out of the book and into MongoDB, we would have a canonical reference against which to normalize every piece of label metadata in the library.
The problem was that the data was trapped in 1,399 pages of ink.
THE METADATA PROBLEM
OCR: The First Extraction
We took the book to the University of Amsterdam and used their OCR scanning technology to digitize it. I will spare you the details of getting a 1,399-page hardback through that process. The spine did not survive.
The UvA OCR Process
The University of Amsterdam's document digitization facility used high-resolution flatbed scanning at 400 DPI combined with Tesseract-based OCR tuned for reference typography. We ran each composer section in isolation and manually reviewed a sample against the physical book before committing the output. Character accuracy was approximately 98–99% — which sounds excellent until you are extracting structured musical data where a single wrong character in a catalog number (BWV 140 vs BWV 140̀) creates a record that points to nothing. The real failure mode was not character errors but structural collapse: indentation that signified hierarchy in print became whitespace noise in the OCR stream.
The physical layout created problems immediately. The Da Capo Catalog uses a dense, compact typographic style typical of 1990s reference publishing: small font, hierarchical indentation to indicate parent-child relationships between works and movements, liberal use of abbreviations, and a mix of roman, italic, and bold type to distinguish field types. Titles in italic. Catalog numbers in a specific format. Parenthetical annotations for dates, literary sources, and instrumentation woven into the running text of each entry.
Standard OCR handled the character recognition reasonably well — perhaps 1-2% character error rate, which sounds acceptable until you realize that a single wrong character in a catalog number turns BWV 140 into something that does not exist. But the real problem was not characters. It was structure. The indentation that told a human reader "this aria belongs to this opera" was lost. The typographic distinction between a work title and a movement title was flattened. What came out the other end was a river of text that contained all the right words in the right order but had lost the hierarchy that made them meaningful.
I ran the OCR in batches, one composer section at a time, and spot-checked against the physical book. I kept it open on my desk for weeks. The pages started getting soft at the corners. The OCR gave us words. We needed a tree.
FROM INK TO TREE
The Grammar of the Catalog
This is where the work became genuinely interesting. I spent evenings reading entries, hundreds of them, and I started to see it: the Da Capo Catalog has a grammar. A consistent, unwritten grammar. Chwialkowski's explanatory notes describe the intent of the formatting but not its formal rules. So I had to reverse-engineer it. Entry by entry, pattern by pattern.
A typical entry for an orchestral work looks roughly like this:
Symphony No. 5, in C minor, Op. 67 (1808)
I. Allegro con brio
II. Andante con moto
III. Scherzo: Allegro
IV. Allegro — Presto
That looks simple. I thought so too, at first. Consider what the parser actually has to know:
- The first line is a work-level entry. It contains a genre identifier ("Symphony"), a sequence number ("No. 5"), a key ("in C minor"), a catalog reference ("Op. 67"), and a composition date ("1808").
- The indented lines are movements. Each has a Roman numeral index and a tempo marking. Some have subtitles. Some have dashes indicating attacca transitions.
- The date is in parentheses. But parentheses are also used for instrumentation, for literary sources, for alternate titles. You cannot assume parentheses mean "date." I learned this the hard way when the parser confidently tagged "(for 2 oboes and cor anglais)" as a year.
- "Op. 67" is Beethoven's opus number. But for Mozart, the equivalent is "K. 467." For Bach, "BWV 1068." For Schubert, "D. 759." Each composer uses a different catalog system, sometimes multiple competing systems. The parser has to know which system applies to which composer. I built a lookup table. It had forty-seven entries by the time we were done.
ANATOMY OF A CATALOG ENTRY
Now consider a vocal work:
Cantata No. 140, Wachet auf, ruft uns die Stimme, BWV 140 (1731)
1. Chorus: Wachet auf, ruft uns die Stimme
2. Recitative (T): Er kommt, er kommt
3. Aria (S, B): Wann kommst du, mein Heil?
4. Chorale: Zion hört die Wächter singen
5. Recitative (B): So geh herein zu mir
6. Aria (S): Mein Freund ist mein
7. Chorale: Gloria sei dir gesungen
Here the grammar shifts. Movements use Arabic numerals, not Roman. Each has a type identifier — Chorus, Recitative, Aria, Chorale — that does not exist in instrumental works. Voice parts are abbreviated in parentheses: T for tenor, S for soprano, B for bass. German text with diacritics and noun capitalization everywhere. And then there are operas, where the grammar shifts again: arias in act order, sometimes with scene numbers, sometimes without. Overtures as separate entries. Librettist attribution. Premiere date and venue. Cross-references when a libretto was set by multiple composers.
The CTO and I wrote the parser in Perl, pair programming. Perl was the right language for this and it was not close. The regex engine is not a library you import — it is part of the language itself, with features we used constantly: the /x flag to write readable multi-line patterns for entry structures that would have been unreadable as single-line regexes, named captures to extract fields directly into hashes, and \G anchoring to walk through an entry field by field without losing position. Perl was designed by a linguist for exactly this class of problem — practical extraction from messy structured text — and it showed. Autovivification meant we could build the nested catalog tree incrementally without pre-declaring schemas: $catalog{$composer}{$genre}{$work}{movements} just created the entire path as we went.
Why Perl?
Perl's regex engine is native to the language — /x for readable patterns, \G for stateful matching, named captures, and autovivification for building nested data structures without schema boilerplate. Designed by a linguist for exactly this class of problem.
It was not one parser — it became a family of parsers, and honestly we built it that way because the grammar had so many exception cases that a single clean parser would have been fiction. 132 composers across four centuries of music, each with their own conventions, and Chwialkowski had handled each one slightly differently. Some composers' vocal works got full text source attribution, others got abbreviated groups. Some operas were broken down aria by aria, others were not. The Kabalevsky entries listed individual Shakespeare sonnet settings but grouped other vocal works as "3 poems (c1927, Blok)" with no further detail. Every time we thought the grammar was stable, another composer's section would break it. We ended up with a dispatcher that selected the appropriate parsing strategy based on the composer and genre section, and a growing collection of exception handlers for the cases that did not fit any strategy. The core logic used Perl's native regex for field extraction, but the real engineering was in the exception handling — identifying the entry level (work, movement, sub-movement), deciding which rules applied to this particular composer, and building the tree without silently dropping the entries that did not conform. I remember a Thursday night debugging why every Dvořák entry was failing. The answer was the háček over the r. The OCR had rendered it as a regular r followed by a stray pixel that the regex was choking on. That was a typical evening.
The abbreviations alone required a dedicated expansion module. The catalog's front matter lists dozens of abbreviated forms for instruments (fl, ob, cl, bn, hn, tpt, tbn, str, pf), ensembles, voice types, and structural terms. I encoded every one. Without this, the parser could not distinguish cl (clarinet) from a stray OCR artifact, and believe me, at 1-2% error rate, there were stray artifacts everywhere.
Matching Audio to Catalog
With the parsed catalog in MongoDB, we had a canonical identity for every major classical composition. Now came the part that nearly broke us: matching it to actual audio.
Label metadata was, to put it gently, creative. "Beethoven: Symphony No. 5 in C minor, Op. 67 - I. Allegro con brio" from one label. "L. v. Beethoven — Sym. 5 c-Moll op. 67: 1. Allegro con brio" from another. "Symphony no.5 in C minor op.67 - 1st movement" from a third, with no composer field at all — Beethoven was apparently so obvious he did not need naming. One small label had tagged a Schubert symphony with a Köchel number, which is Mozart's catalog system. That one took me an hour to figure out why the match was failing.
The matching engine worked in layers:
- Catalog number matching. If the metadata contained a recognizable catalog number (Op. 67, BWV 140, K. 467), we matched on that first. Catalog numbers are unambiguous — Op. 67 is Beethoven's Fifth, period. But not all metadata included them, and some used outdated numbering. Older Köchel numbers for Mozart that had been renumbered in later editions were a recurring headache.
- Fuzzy title matching. For entries without clean catalog numbers, we used normalized string comparison: strip diacritics, lowercase, remove stopwords and abbreviations, compute similarity against canonical titles. This required careful threshold tuning. Too loose and you match the wrong work. Too strict and you miss legitimate variations. I spent days adjusting thresholds, running batch comparisons, staring at spreadsheets of near-misses.
- Hierarchical validation. A track claiming to be "Movement 1" of a work was validated against the parsed movement count. If our catalog said four movements and the label said five, something was wrong — possibly an alternate version, possibly a tagging error. These discrepancies went to the musicologists on the team, who could tell the difference. I could not, at least not at first.
- Performer normalization. "Berliner Philharmoniker" and "Berlin Philharmonic Orchestra" and "Berlin Phil." are the same ensemble. We built an alias table, initially seeded from common variations and then expanded as we hit new ones in the wild. Conductor names were worse, compounded by transliteration — Tchaikovsky appears in label metadata as Tchaikovsky, Tschaikowski, Chaikovsky, Chaikovskii, and Cajkovskij, depending on the label's country of origin. I counted seven distinct spellings before I stopped counting.
THE MATCHING PIPELINE
The fundamental design challenge was the one-to-many relationship. One entry in the Da Capo Catalog — Beethoven's Fifth — mapped to hundreds of recordings. Each recording was a distinct artistic event with its own conductor, orchestra, soloists, venue, and date. The database had to model the work as a canonical entity and the recordings as instances of that entity, linked but not collapsed. This is the thing that pop music metadata gets wrong: in pop, the song is the recording. In classical, the composition and the recording are separate concepts connected by a performance. Getting that distinction right in the data model was everything.
Cross-Field Search
Once the catalog was parsed and the audio matched, we built the search layer. Classical music search has to work across fields simultaneously in ways that pop music search never needs to consider.
A user searching "Karajan Beethoven 5" is expressing a query across three fields: conductor, composer, and work number. "Hilary Hahn Bach" is performer and composer. "Brahms D major violin" is composer, key, and instrument. The search engine had to decompose natural language queries into field-level constraints and return results ranked by match specificity.
But the real complexity of classical music search goes deeper than combining fields. Consider what a serious classical listener actually wants to find: Shostakovich's Fifth Symphony, performed at the Concertgebouw, conducted by Bernstein, from the Cold War era — because the political context of that performance matters to them. Or they want every recording of the Brahms Violin Concerto made in Berlin before the Wall fell, because those performances carry a specific weight. Or they want to hear how Furtwängler's Beethoven Ninth from 1951 Bayreuth differs from Karajan's 1962 Berlin recording — not different tracks, but different artistic and historical events.
This is the semantics of classical music searching. It is not keyword matching. It is temporal, geographical, political, and interpretive all at once. A venue is not just a location — the Concertgebouw in Amsterdam, the Musikverein in Vienna, the Royal Albert Hall in London each have an acoustic character that shapes the recording. A conductor is not interchangeable — Bernstein's Shostakovich is a fundamentally different artistic statement than Mravinsky's. A recording date is not just metadata — a Soviet-era Leningrad Philharmonic recording of Shostakovich carries meaning that a 2015 digital recording does not, even of the same notes played in the same hall. The search system had to understand these dimensions, not as keywords but as structured relationships that could be queried, filtered, and ranked.
We indexed every field from the parsed catalog: composer, work title, opus number, catalog number, key, genre, movement titles, instrumentation, and for vocal works, the text incipits. On the recording side: conductor, orchestra, soloists, venue, label, recording year, and recording location. The search index supported Boolean combinations across all of these, with fuzzy matching on text fields, exact matching on catalog numbers, and range queries on dates — so you could search for Shostakovich performances from 1945 to 1989 and get precisely the Cold War recordings, or filter Beethoven symphonies to only recordings made in Vienna.
The result was that typing "K. 467 Brendel" returned Alfred Brendel's recordings of Mozart's Piano Concerto No. 21 — not because we matched the string against a flat text field, but because the system understood that K. 467 identifies a specific Mozart piano concerto and Brendel is a pianist who recorded it. Searching "Shostakovich 5 Bernstein live" narrowed to live recordings by Bernstein of the Fifth Symphony — a query that requires understanding composer, work number, conductor, and recording type as separate structured dimensions. Every other streaming service in 2018 would have returned noise for these queries. We returned exactly what the user meant. I remember the first time it worked correctly end to end. It felt like the whole project clicked into place.
The Da Capo Catalog is out of print. Used copies sell for a few dollars. Chwialkowski never published a second edition. The book was written for music librarians and serious collectors, not for software engineers. But its obsessive completeness — every opus number, every catalog variant, every movement title, every aria in every cantata — made it, accidentally, the most valuable piece of documentation I have ever used in a software project.
Apple acquired Primephonic in August 2021. Apple Music Classical launched in March 2023, serving five million recordings with the hierarchical search architecture we built. The canonical metadata model that started with a scanned book and a Perl parser runs on every iPhone in the world now. I still think about that sometimes.
Somewhere in the acknowledgments of that 1,399-page catalog, Chwialkowski thanks the librarians and musicologists who helped him compile the data. He could not have imagined what it would become. But that is the thing about good data — when someone finally takes it seriously, it stops being a book and turns into infrastructure.
When someone finally takes data seriously, it stops being a book and turns into infrastructure.
Arindam Paul — on the Da Capo Catalog