The Unseen Hurdle of RAG: Parsing Corrupt Outlook Files
My goal for the day was simple: get a proof-of-concept running for a RAG project on my own emails. I wanted to query years of my own history to find achievements and interesting threads. But instead of working with LLMs, I spent the entire day just trying to get the data out. It turns out, the first step is often the hardest.
I hit a wall trying to extract and parse my old Outlook .ost and .pst files. It was a journey through obscure libraries, environment issues, and corrupted data.
- The
.pstand.ostnightmare: These file formats are a real pain. The libraries for handling them, likepypff, feel a bit dated and require a specific setup. There aren’t many modern alternatives, so you work with what you’ve got. - Environment shenanigans: The extraction process required a Linux environment. Of course, my WSL installation decided to stop working today, forcing me to go through a full reinstall just to get started.
- Corruption is everywhere: My main
.ostfile turned out to be corrupt. The extraction script not only failed but triggered a blue screen of death. My smaller.pstarchive also needed repairs with the classicscanpst.exetool. - Pivoting to the archive: Because the live
.ostfile was so problematic (and larger), I shifted my focus to trying to repair and extract my archived.pstfile. It contains incredibly important emails, so I’m hoping the repair process works.
Today was a classic reminder that in any data project, the “glamorous” AI part is tiny compared to the “boring” work of data extraction and cleaning. Before you can build a smart search, you have to wrestle your data out of its proprietary prison. On a small positive note, I did make some progress on my RTL Chrome extension, so the day wasn’t a total loss.
Next: Hopefully, get a clean export from the repaired .pst file so I can finally start indexing.
Tags: RAG, Data Engineering, Python, Outlook, WSL, Learning in Public