Educational

Behind the Scenes: How We Maintain Bond Data Quality at Scale

Bond data is messy by nature—not because systems are broken, but because the source material itself is inconsistent, varied, and complex

Oct 14, 2025 @ London by Natasha Salmi

At ClimateAligned, we're tackling data quality with unprecedented transparency and human expertise. Learn how we handle the inherent challenges of bond data through AI-assisted tools, expert judgment, and complete transparency.

I recently sat down with Leo Browning, our ML expert and senior engineer, to discuss something most data providers don't talk about: what happens when the data isn't perfect.

The reality is that bond data quality challenges aren't primarily about system failures—they stem from the source material itself being inconsistent, varied, and complex. At ClimateAligned, we're taking a different approach: using AI-assisted tools to efficiently identify discrepancies, applying human expertise to make judgment calls, and providing unprecedented transparency so you can trace our reasoning for every data point.

Leo Browning, ML expert and senior engineer at ClimateAligned

Why Bond Data Is So Tricky

Here's the fundamental challenge: we're creating semi-structured data from unstructured, complex, and highly varied sources.

Think about it: bond documentation comes in every format imaginable. Japanese bonds might report in yen while their placement amounts are in USD. Municipal bonds might aggregate reporting across multiple series. French documents need translation. Pre-issuance frameworks use different terminology than post-issuance reports. And that's just scratching the surface.

"The edge cases are inherent to the data," Leo explained. "Even if you had a perfect system, you'd still have these challenges. It's not about the AI being imperfect—though it is—it's that the source material itself is inconsistent."

When we standardise this information into a consistent format—so you can actually analyse thousands of bonds together—discrepancies emerge. Not because something went wrong, but because standardisation reveals the underlying inconsistencies that were always there in the source documents.

Understanding Data Inconsistencies

The challenges we encounter aren't always straightforward errors—they're often symptoms of underlying source data inconsistencies.

Take currency mismatches as a simple example. If we were converting currencies incorrectly, that would be purely an error on our part. But if our data providers occasionally provides the wrong currency alongside their numbers, that's a source data inconsistency. When you move into actual bond documents, things get even more complex: you might see bonds with placement amounts in one currency, but all their reporting is done in another currency.

Leo said, "Japanese bonds will be issued in USD, but the reporting is all in yen. This is not an error—there's nothing wrong about that. It's just inconsistent because not everybody does that. And more than that, you don't know who doesn't do that."

This variability means you're often looking for symptoms of data issues rather than the data errors themselves.

For instance, our biggest current challenge: allocation vs. placement mismatches. We extract allocation data from post-issuance reports, but it differs from the placement information on the bond. Sometimes significantly—we currently have about 600 cases with greater than 10x differences.

Why does this happen? It could be:

  • Currency reporting differences (like those Japanese bonds)
  • Aggregate reporting structures (municipal bonds releasing series under the same umbrella)
  • Timing differences between documents
  • Actual source data errors in CBonds or issuer reports

"We had a case where an entity was reporting at a very aggregate level," Leo explained, "so all these bonds have really small placement amounts, but they report everything under this aggregate umbrella. In a case like that, you might be able to do a semi-systematic correction—maybe fix 10 or 15 at once if you make a judgment call."

Each mismatch requires investigation. Sometimes our team members will spend 25 minutes hunting through documentation trying to find what actually happened for a single bond. "And sometimes," Leo said, "there's just nothing. No documentation. So then what do you do?"

Our Approach: Human Judgment, AI Efficiency

Most data quality work can't be fixed systematically. If you could write a rule to catch and fix an error type in bulk, it would be relatively straightforward (though finding the pattern is the hard part). But that's not what we're dealing with.

"AI produces data that often isn't very systematic," Leo noted, "but AI tools also let us make semi-systematic corrections. Maybe not one-off, maybe not 100 at a time based on a rule—something in the middle."

Our workflow looks like this:

  1. Find the discrepancies using our data quality dashboard, which flags issues like missing placement amounts or allocation mismatches
  2. Investigate using multiple tools—our admin panel, the product itself, SQL queries, and source document review
  3. Make expert judgment calls about corrections, just like a human analyst would, but with the efficiency of AI-assisted tools
  4. Document the reasoning so anyone (including us, including you) can trace why a number looks the way it does

That last part is crucial. Anyone who's worked with bond data manually knows what happens without documentation: you spend hours marking data points, making judgment calls, and six months later when someone asks why you classified something a certain way, you honestly can't remember—you processed 200 of those that day.

That last part is crucial. When our AI extracts data, it stores its reasoning. When we make corrections, we adjust the reasoning. This means when you see a number in our product, you can click through to understand where it came from and what judgment calls were made.

"If an analyst were doing this by hand, they wouldn't have the time or energy to write out the reasoning behind every decision," Leo said. "But we have it built into our system, which makes identifying and correcting errors much more efficient."

The Hierarchy of Data Priority

We're realistic about resource allocation. We can't manually verify every single bond, and we don't need to.

High priority: Major bonds that everyone holds need to be airtight. These get extra scrutiny even if nothing looks obviously wrong.

Medium priority: Systematic patterns and discrepancies that we can catch with our tools. We investigate these regularly.

Lower priority: Smaller, less-held bonds where the data is broadly correct. We're still responsive when users report issues, but we don't proactively audit every detail.

"It's a hybrid approach," Leo explained. "Find broad systematic problems, check those, manually fix them. Look at the most important bonds. And stay very receptive when people tell us something's wrong."

Why Transparency Matters

Here's what makes our approach different: we show you our work.

Traditional data providers operate as black boxes. If you find an error, you report it and wait months for a correction—if you get one at all. You have no visibility into how the data was created or what assumptions were made.

We've built transparency into every level:

  • Reasoning available for every extraction: See why our AI made specific categorization decisions
  • Source citations: Know which document and which page numbers informed each data point
  • Methodology documentation: Understand our emissions calculations and categorization logic
  • Rapid corrections: When something needs fixing, we can do it immediately, not in the next quarterly update

"The semi-structured nature of our data is actually a strength in edge cases," Leo said. "It gives us the flexibility to go back, look at the reasoning, and make nuanced corrections. You're not locked into rigid rules that break down when reality gets messy."

The Bottom Line

Bond data quality is challenging—and anyone who tells you otherwise hasn't spent enough time in the documents. The source material is inconsistent. The reporting varies wildly. Edge cases are inevitable.

What matters is having a process that can:

  • Find problems efficiently using AI-assisted tools
  • Fix them accurately with expert human judgment
  • Show the work so you can verify and understand the data

We're not claiming perfection. We're claiming transparency, rapid iteration, and a team that actually understands the documents we're processing.

"This is an ongoing process," Leo told me. "But I think that's the point. It's not an achievement blog post—it's about how we think about data quality as a continuous practice."

If you've ever spent hours manually correcting bond data, or waited months for a provider to fix an obvious error, or wondered why a number looked suspicious but had no way to investigate—we see you. And we're building something different.

Start here to get access to high-quality, customisable sustainability data in the financial markets.