~Freeing 118 GB of W&B experiment data from a broken binary format
I spent the final weeks of 2024 migrating 5,203 deep learning
experiment logs out of the Weights &
Biases (W&B) cloud and the undocumented .wandb
binary file format so that I could more flexibly and efficiently analyse
the data in the course of writing a paper.
Getting a local copy of the data was easier than I expected, but parsing it into a usable format turned out to be much, much harder. Unbeknownst to me, the majority of my experiment logs had been written with a broken version of the W&B SDK that subtly corrupted the binary log files as they were written, meaning the parsers from the SDK couldn’t read them back later.
In the course of recovering 118 GB of data rendered inaccessible by this corruption, I reverse engineered the format, wrote a custom error-recovering parser, and rediscovered the historical bug in the SDK. After days spent reading, debugging, and writing parser code, I had a much better understanding of how and what data W&B stores during experiments, and a set of tools that would serve me well for offline analysis in the future.
This is that story.
- Background
- Accessing the data
- Anatomy of a W&B run folder
- Building an index
- Finding the metrics
- Anatomy of a
.wandbfile - Hacking together a parser from SDK internals
- Status: most of my data was unreadable
- Hunting for the source of the formatting errors
- Reviewing the SDK code for logic errors
- Building an error-recovering parser
- Status: most of my data was… still unreadable
- Finally discovering the bug
- Status: Data recovered!
- Extracting and reformatting the metric logs
- Conclusion
§Background
I’m a deep learning scientist. I train thousands of artificial neural networks, large and small, to solve statistical problems or play video games. Then I try to discern and explain the natural laws underpinning what and how these neural networks learn.
Each training run is its own scientific experiment. I use the W&B Python SDK to log various metrics throughout the course of training. This makes it easy to watch the networks evolve over training using W&B’s real-time web dashboard.
When it comes time to look for patterns, I need to run analyses or visualisation code over large sets of metric logs from previously completed experiments. Unfortunately, W&B is not so well-suited for this step, giving rise to a few problems:
- Speed: The web app slows to a crawl when exploring a large number of metrics from a large number of experiments.
- Organisation: The web app makes it difficult to flexibly group a large collection of experiments into sub-groups.
- Search: The web app struggles to accommodate complex search queries over extremely large collections of experiments.
- Visualisation: It’s difficult or impossible to perform complex data analysis or visualisations over multiple experiments (they do also offer an API, but it’s quirky and only partially documented).
- Storage: It’s easy to exceed the free tier’s 100 GB cloud storage limit, with no streamlined tools for compressing or filtering the data stored (and while the free tier is generous, the rates for extra storage are steep).
To be clear, this is all reasonable for a web app. Any cloud-based solution is going to face fundamental network and storage bottlenecks, making it impossible to compete with hosting my data locally with full control over the format and the ability to analyse my data with arbitrary programs.
So, late last year, as each of the above issues were growing increasingly frustrating after my latest collection of training runs had sprawled to a total of 5,203, I decided to devote some of my holidays to investing in the tooling I’d need to free my data from the W&B cloud.
§Accessing the data
I had anticipated that downloading all of the data from my 5,203 experiments from the W&B cloud might be difficult. However, as it turned out, I didn’t need to download anything from the W&B cloud at all!
By default, the W&B SDK locally caches every piece of data it logs about an experiment as it sends the data to the cloud. This way, it can recover from network issues without data loss. It stores all of the files associated with a particular experiment in a local ‘run folder’ created when the experiment is launched.
Moreover, these locally cached files are never automatically deleted, even after the data has been synced to the W&B cloud. So, I still had months of run folders lying around on the compute cluster from the start of my research project (taking up approximately 600 GB of space in total).
Getting these files from the cluster to my laptop hard drive was easy:
First, I deleted all of the image files I had logged with each experiment. These images were very useful when monitoring individual experiments, but not useful during bulk analysis, and I decided I preferred to free up the 427 GB of space they collectively consumed. (This was an example of a storage management task that was impossible via the web app but trivial with shell access:
rm -rvI wandb/run-*/files/media/.)Then, I
tarchived andgzipped each run folder. This reduced the total number of files which would make transferring easier, and compressing the files took the collection from 173 GB to 30 GB (the high compression factor here was an early sign of redundancy in W&B’s choices of storage format).I quickly and easily transferred these compressed files to my laptop using
rsync.
With my run collection slashed from 600 GB on the cluster to 30 GB on my laptop, this was the end of my storage concerns. I deleted these runs from the W&B cloud earning my way back into the free tier, and proceeded to dive into extracting my data from these run folders…
§Anatomy of a W&B run folder
Here is an example run folder:
run-20240921_210034-vk2qbwd4/
├── files/
│ ├── config.yaml
│ ├── wandb-metadata.json
│ ├── requirements.txt
│ ├── output.log
│ ├── wandb-summary.json
│ └── media/... (I deleted this earlier)
├── files/
│ ├── debug-core.log
│ ├── debug-internal.log
│ └── debug.log
└── run-vk2qbwd4.wandb
Not all of the run directories had all of these folders and files present; in some cases files were missing or empty; I later figured out that this normally indicated runs that had crashed shortly after spawning due to, for example, bugs in my own programs or issues with the compute cluster.
For the remaining runs, I assumed that all of the information I needed to filter, analyse, and visualise the valid parts of my experimental data was in here, somewhere! I just needed to figure out where and in which format so that I could parse it as input to my analysis and visualisation scripts.
§Building an index
For starters, which experiment was this?
The folder name run-20240921_210034-vk2qbwd4 revealed
that this run folder was from an experiment I ran on September 21st,
2024, and that W&B gave this run the unique ‘run id’
vk2qbwd4.
This was hardly enough for me to meaningfully decide how I wanted to analyse this experiment. For that, I would need the experiment hyperparameters and metadata (what version of the code I used, what kind of network I trained, etc.).
It turns out that this data was spread across two files in the run
folder: config.yaml and wandb-metadata.json.
Loading these files from the compressed tarchives and parsing them into
memory in order to run queries took about 10 minutes. This was
unworkable, but there was no reason I had to stick to the compressed
tarchive format.
Instead, I loaded and parsed the metadata files once and for all and dumped the resulting dictionary to a 20 MB pickle file that I could load back into memory in under a second. Now, I could essentially instantly find any experiment from my entire collection, querying and filtering with the full expressive power of Python.
This 20 MB index solved the organisation and search problems and the associated part of the speed problem. There only remained the task of efficiently analysing and visualising the time series of metrics from each experiment.
That was when things started to get more difficult.
§Finding the metrics
Where were the actual time series stored?
Since most of the files were stored in human-readable text formats, I
was able to eliminate them as possibilities, leaving only the 31 MB
binary file run-vk2qbwd4.wandb. I verified this with a
naive inspection of the contents of the file in my editor. Here is an
indicative excerpt:
n<82><01><01>0
1<12><19>ued/layout/prop_walls_avg<82><01><13>0.25924229621887207
/<12><19>perf/env_steps_per_second<82><01><11>636296.7653832459
)<12>#eval-fixed-standard-maze/avg_return<82><01><01>0
&<12><1b>step/env-step-replay-before<82><01><06>983040
)<12><1a>ued/distances/solvable_avg<82><01>
0.98828125
%<12><0c>ppo/avg_loss<82><01><14>0.004299817606806755
0<12><17>ppo/max/max_critic_loss<82><01><14>0.017197374254465103
<<12>6eval-fixed-standard-maze-3/proxy_proxy_pile/avg_return<82><01><01>0
&<12><1a>step/env-step-replay-after<82><01><07>1015808
4<12>.train-all/proxy_first_pile/avg_reward_per_step<82><01><01>0
4<12><1b>ppo/max/max_actor_approxkl3<82><01><14>0.002542640548199415
This inspection revealed that there were actually some plain-text
segments, including the metric names and values I was ultimately looking
for! For example, in the above excerpt you can see that, at some point,
the metric “ued/layout/prop_walls_avg” was logged with the value
0.25924229621887207, and the metric “ppo/avg_loss” was logged with the
value 0.004299817606806755. However, these were interspersed among
various non-printable characters (shown as <xx>)
acting as some kind of delimiters.
While some parts of the file were almost interpretable like this, the whole file wasn’t simply a table of metrics. Long stretches of the file contained larger chunks of non-printable characters and other strings that were mostly recognisably related to the experiment but mostly unrelated to metrics.
I hoped that, in whatever format I was seeing my experiment data encoded, it might be a standard binary file format I could parse with an existing Python library. Unfortunately, with a little research, I discovered that it was basically the opposite: a non-publicly-documented, bespoke format that not even the W&B SDK itself publicly exposed a parser for.
I would soon come to understand the format in intimate detail. Let me now explain it as context for the steps to come.
§Anatomy of a .wandb
file
At a high level, a .wandb file is an append-only
sequence of ‘records’. At various points during the run, whenever the
SDK decides to log some information to the cloud, it also writes it as a
new record to this file. At a low level, the format used to serialise
these records is bespoke, but assembled from reasonably standard or
simple components:
LevelDB-like robust log format: As revealed by a comment in the SDK source code, the overall structure of the file follows a variation of the LevelDB log format.
The LevelDB log format structures an append-only log as a sequence of 32 KiB ‘blocks’. Each block contains a sequence of ‘chunks’, which themselves comprise individual records wrapped in a 7-byte header storing their size and a checksum. If a chunk (record) would straddle the boundary of one of the 32 KiB blocks, it’s instead broken down and stored across a sequence of chunks in the different blocks. In principle, between the chunk-level checksums and the reliable and regular placement of block boundaries, this format allows detecting corrupt chunks and recovering subsequent records by skipping to the next block boundary.
In W&B’s case, there are some small differences from the LevelDB log format. Namely, they use a different checksum algorithm, and they include an additional 7-byte file header at the beginning of the first block, so a standard LevelDB log reader tool won’t be able to parse these files.
W&B Protobuf record format: The contents of the records themselves are binary data serialised with protobuf using a custom W&B schema. The schema includes different record types for the various kinds of logging events used by the SDK. Here are some examples:
First and foremost is the ‘history’ record type. A history record is created every time the experiment requested a set of metrics to be logged. This was the main record type I was interested in, as the sequence of these metric sets from throughout the run represented the time series I ultimately wanted to analyse and plot.
The protobuf format for these records was a list of key/value pairs, where the keys and values were serialised as JSON literals. This is why these keys and values were ultimately visible in the binary file itself, separated by protobuf delimiters (and the occasional LevelDB chunk header at a block boundary).
Interestingly, the same information was also effectively contained in another kind of record called a ‘summary’ record. The W&B summary refers to the latest value logged for each experimental metric. Every time this value changes, the SDK logs a ‘summary’ record describing the update, including the old and new latest value of the metric.
I noticed that this meant that almost every metric value was actually logged at least three times throughout the file (once in a history record, once as the new value in the next summary update record, and once as the old value in a subsequent summary update record).
There are a few record types that contained data that was already available from the other files in the run folder. For example, near the start of the log are ‘config’ records and ‘run’ records that store similar information about the details of the run as I previously extracted from
config.yamlandwandb-metadata.jsonfor my index. Likewise, there’s ‘output’ and ‘output_raw’ records (still not sure the difference), which the SDK creates whenever a line is printed to stdout or stderr. The same information, sans per-line timestamps, is stored inoutput.login the run folder.
There were various other record types, including ‘stats’ containing periodic checks of CPU/GPU/memory statistics, and ‘telemetry’ records reporting environment analytics, and some other less interesting record types from the perspective of this story.
This format wasn’t described in the W&B documentation. I would
have to gradually build the above understanding by looking at the source
code and doing my own research on the LevelDB log format over the course
of extracting the data from within the thousands of .wandb
files in my collection.
Let us now resume that story.
§Hacking together a parser from SDK internals
While the W&B SDK does not expose a parser as part of its public interface, it does at least contain code for reading the format. This made sense, it presumably needs to read these files when syncing the data to the cloud after network issues.
Therefore, since the W&B SDK is free and open software, as long as I was up for a little source diving and comfortable working with undocumented and potentially unstable SDK internals, I was free to hack together my own reader script.
I didn’t have to start from scratch. As you might imagine, people had tried to access the contents of this file before, and I was able to find this GitHub issue, where some users had proposed some partial scripts loading the relevant internal parts of the SDK and using them to parse the file:
To iterate through the chunks of the LevelDB-like log, I could load a
wandb.sdk.internal.datastore.DataStoreobject and point it at my data path, then repeatedly call the.scan_data()method.This method would return successive serialised bytestrings for protobuf records, which I could deserialise by instantiating a
wandb.protobuf.wandb_internal_pb2.Recordobject and using its.ParseFromString()method.
From this starting point, I developed my own script (gist) that was able to extract the sequence of history records for my example run.
Success!
I set about running this hacked-together parser on the remaining
5,202 .wandb files.
§Status: most of my data was unreadable
To my dismay, for the majority of my experiments—exactly 3428 of the
5203 runs, or about 66%—crashed the parser. Specifically, the
.scan_data() method failed assertions or crashed with other
runtime errors indicating that unexpected formatting violations were
being encountered.
As I’d later have the tools to confirm, the crashes were happening
near the beginning of these files, with a total of 118 GB out of the
total uncompressed 166 GB of .wandb files, or around 71%,
rendered unreadable due to these errors.
Uh oh!
§Hunting for the source of the formatting errors
I started generating and testing hypotheses as to possible sources of the errors.
Could it be unnatural termination? It was fairly common for experiments to crash, timeout, or be terminated early. I had thought W&B would do its best to clean up the log in such cases, but maybe it had failed! But there were too many affected runs for this to be the case.
What about natural corruption? No way—how could so many of files have spontaneously become corrupted without me losing the whole file system? Maybe there was some chance I had systematically damaged the files while preparing to migrate them from the cluster…
At this point, I noticed that one specific example error message was caused by the parser expecting the LevelDB-like log block to have a zero-padded buffer to the next block boundary (since there wasn’t space for a new chunk header), but there were non-zero bytes instead. I had the idea to patch the SDK code to not skip to the next block boundary, and it turned out the non-zero bytes comprised a real chunk header, with a valid checksum and all!
The fact that the reader expected padding where the file had a valid chunk header suggested something systematic was going wrong, with only two possibilities: either there was a logic error in the reader code I was hooking into to parse the files, or there was a logic error in the writing code that had been used to generate these files in the first place.
§Reviewing the SDK code for logic errors
I dug around the DataStore class source code for hours looking for the kind of error in managing the file pointer, during reading or writing, that could mistime block boundaries like this.
On the one hand, there had to be a bug here! This part of the code was an intricate parser/generator implementation with lots of complex state management to get right in every method. It seemed like exactly the kind of thing a programmer could easily get wrong, causing my problem.
On the other hand, how could there be such an obviously destructive bug here? This was code that had surely been carefully tested and had been running in production without major issues for years.
I finally concluded that there was not a single logic error in either the reader or the writer code. Every method had intricately advanced the object state in exactly the right manner.
Wait! This was not necessarily the right codebase to be reviewing!
W&B had recently replaced their SDK’s Python backend with a
faster Golang implementation. This new backend had its own LevelDB-like
log reader/writer code. I remembered opting into this faster backend
part-way through the research project. Many of my .wandb
files had probably been written with the writer from the Golang backend,
not the Python backend I had just reviewed.
If my .wandb files had been written with the Golang
writer, that’s where any bug responsible for persistent formatting
issues would be. So I started my review again from the top of the Golang
writer source.
On the one hand, this had to be the explanation! New code, less battle-tested, structured differently from the Python code but just as intricate and delicate as before.
On the other hand, this module was clearly marked as adapted directly from the LevelDB Go SDK itself, which was certainly battle-tested! How could the W&B developers have introduced a bug while making the minimal changes required to switch from the LevelDB format proper to their variant?
Unfortunately, while it took me longer than reviewing the Python code since I’m not fluent in Go, I eventually managed to make rough sense of the writer, and I was again forced to conclude there were no issues.
At a loss, I moved on to a different approach…
§Building an error-recovering parser
If I couldn’t figure out the source of the parser crashes, maybe I didn’t need to?
I had become more familiar with the LevelDB log format and now understood the idea that parsing could resume after corruption by skipping to the next block boundary. The Python parser from the SDK didn’t offer this feature, but it seemed simple enough to implement the recovery protocol myself. And maybe, just maybe, whatever formatting errors affected my log files would be contained locally to small parts of the files, allowing me to recover enough of my data to analyse it.
So at this point, I took a break from reviewing parser code and decided to build my own error-recovering parser for W&B’s LevelDB-like log format.
I wanted to make sure I didn’t introduce any more errors myself, so I designed the parser in a way that avoided having to manage complex state across a large range of methods. Instead, I designed the parser in a more functional style, as a series of transducers, or functions transforming a stream of objects at one layer of each conceptual level of abstraction to a stream of objects at the next. In particular:
One function strips the W&B header and breaks the entire binary file into a stream of 32 KiB ‘blocks’.
One function mines each block for the chunks it contains, validating the chunk header checksums.
The next function takes the stream of chunks and extracts their internal data, using a stateful loop to aggregate chunks that straddled block boundaries back into individual records.
The next function takes this stream of records and parses their contents with protobuf, deserialising the actual data we came looking for.
Compared to the W&B SDK’s approach of maintaining a complex ‘state machine’ object, this way of factorising the logic into stream transformations seemed simpler to me. It certainly paid off when it came time to implement error recovery: Recovering from errors at each level (bad chunks, bad multi-chunk records, etc.) was fairly straightforward to implement requiring only modifications confined within each stream transformation function.
Yay, functional programming principles!
After testing and ironing out a few errors that had nevertheless crept in to my version using the example file I knew should parse correctly, I was finally ready to try error recovery on my entire collection.
§Status: most of my data was… still unreadable
Compared to the SDK parser, I was able to recover an additional… 515 MB of data using this approach.
Ouch!
Seemed like the formatting errors were generally not as isolated as I had hoped, and my implementation efforts had barely moved the needle on how much of my experimental data I could access for analysis.
§Finally discovering the bug
My only option was to return to trying to get to the bottom of the mysterious errors.
Fortunately, my work building my own parser was not in vain. I
noticed that the Python SDK and my parser were not throwing errors at
exactly the same parts of the broken .wandb files. This led
me to identify a general corner case to which the parsers responded
differently. Specifically, if there was a chunk that straddled one of
the 32 KiB block boundaries:
The SDK parser would happily read that chunk. It would then continue parsing until it came within the chunk-header-length of a block boundary (at which point, the format insists you should zero-pad until the start of the next block).
My parser was incapable of reading such a chunk. It would have been sliced into two by the bytes-to-blocks transducer, and attempting to parse the chunk would lead to an error because the length in the chunk header would not match up with the number of bytes left in the current block.
In other words, my parser was actually being more strict about the format than the Python SDK’s parser. This allowed me to see that all of the disparate formatting errors that had caused crashes in the Python SDK parser in my broken files were downstream of one common formatting issue that my new parser was detecting:
Near the end of every block, chunks were systematically sized to fit into the block as if the block boundary was supposed to fall exactly 7 bytes later than a multiple of 32 KiB from the start of the file.
For the SDK’s permissive parser, this hadn’t caused an issue for most blocks, until it happened to make precisely the kind of difference that led to the errors I had seen earlier, for example where this altered block boundary would suggest that there was just enough space to write the start of a new chunk where the true block boundary would say it’s time to stop writing and zero-pad to the boundary instead (hence the intelligible corruption from earlier).
Each block boundary offset by exactly 7 bytes…
I returned to the SDK to look one more time for the counting bug I knew must be there somewhere. I still could see no issues in either the Golang or Python writers that could have caused this alignment issue. Besides, on priors, the Python SDK had been in production for years, and the Golang writer had been a straightforward port of LevelDB’s own battle-tested implementation.
Okay, I tried to replicate the error in a fresh environment to see if I could catch the Python or Golang writers in the act of making a wrong decision about how to break chunks near block boundaries. The issue didn’t replicate. The newly generated logs parsed without errors! The SDK code was solid!
Seven bytes…
That was it! The port of the Golang LevelDB log writer to W&B’s variant format! One of the only differences between the formats was to write a 7 byte file header at the start of the file. The Golang writer must not have included the header as part of the first block! That would cause this exact issue to affect all blocks, and would be an easy error to make while executing a deceptively simple port.
The Golang writer I had reviewed did not have this issue, but now I
knew the bug must have been there at some point, so I checked
git blame. Indeed, I found a set of four-month-old commits
and an associated
PR that described and fixed exactly the bug I was looking for:
changing the writer object to start measuring the number of bytes they
had written from 7 bytes rather than 0, to account for the 7-byte file
header that had already been written before the writer was
initialised.
Unfortunately for me, I had installed this broken version of the Golang writer in my virtual environment on the cluster during the brief window that it was public before being fixed, and hadn’t updated it since then. All of the runs generated from that point onwards had been affected, even though the bug had long been (silently) noticed and patched by the devs.
§Status: Data recovered!
It was a simple matter to add an option to my stream-based parser to exclude the file header from the first 32 KiB block.
I then re-parsed my entire collection of .wandb files.
Since some files were written with a working parser before I switched on
the Golang backend, I parsed each file with and without the option,
taking the version with less total corrupt bytes as the canonical parsed
version.
As a result, over the entirety of my collection of 5,203 runs with a
total of almost 166 GB of .wandb files, only 2.66
MB of data was truly irrecoverable—this amount
plausibly due to improperly terminated runs, I didn’t care, I had
enough.
§Extracting and reformatting the metric logs
We at last come to the final chapter of this story.
Parsing the entire collection took a couple of hours, so I proceeded
to dump the resulting history records from these .wandb
files in a format that was not so redundant:
Storing each metric key once as a column header instead of in every row.
Storing the metric values as binary floating point numbers rather than JSON strings.
Not storing the many unrelated fields, especially the redundant mentions of all of the history record entries in their accompanying summary records.
The resulting collection of thousands of pickle files still sits on my laptop taking less than 7 GB even after adding data from another 3,000 experiments conducted in the following months.
Loading data into memory from a specific subset of a few hundred runs, identified through the index I built earlier, now takes less than a minute.
At that point, my experimental data was free from the W&B cloud!
§Conclusion
I packaged my stream-transforming, error-recovering
.wandb file parser and published it on GitHub. It seems
acceptably stable, since it depends only on the protobuf schema from the
W&B SDK, and not also the internal path to the LevelDB-like log
reader code. It will probably remain useful for myself, if not others,
in managing future W&B experiment collections, especially
collections with data corruption or collections generated by the faulty
writer from that specific version of the SDK.
At the time, I was upset at the W&B devs for shipping me a broken
SDK, causing me all of this trouble. When they noticed and fixed the
problem, I thought they could have not only fixed it but also recognised
that it could affect people trying to sync offline run logs created with
the faulty writer. The least they could do would be to mention the
problem in release notes or a GitHub issue, instead of just noting it in
the PR (there is actually an issue referenced from the PR, but the link
is broken or not public for some reason). Plausibly they could take some
small steps to make the issue and a fix substantially more discoverable,
like shipping a new version of the reader that gives an intelligible
error message when it tries to parse a .wandb file with
this kind of problem.
On the other hand, I had enabled the Golang backend at a relatively early stage of its development after it was advertised. In doing so, I vaguely remember having to acknowledge at some point that I understood that the backend was still in development and may not be fully stable. That’s on me, I should learn the lesson of being a bit more cautious with my experimental data.
Another thing I’ve thought about is, how could I have realised the bug faster?
- If I had stepped through the parser by hand on an example broken file to see if I could understand the error, I probably would have noticed the silent off-by-7 error almost immediately, and that would have helped me find the issue.
- If I had been more thorough in checking the git blame initially, I might have found the PR sooner. More generally, I suppose as a general principle if I am going to review code looking for a bug, I should take special care to ensure that it’s the same code I’m running that is giving rise to the bug.
These are good things for me to keep in mind for next time. However, overall, the time wasn’t fully wasted. Eventually finding the bug after a long frustrating search was quite satisfying. Moreover, I got to learn a lot about the LevelDB log format for robust record logging, and have some fun implementing a stream-transforming parser.
Finally, I did eventually fulfil my objective of building the capacity I need to index and analyse my experiment data outside of W&B’s cloud, which will be useful for future projects too.