Taming the Colab beast for a RAG pipeline
My “Rav Oury Cherki RAG pipeline” project is all about making his teachings accessible. This past week, I spent most of my time wrestling with environments and getting the data ingestion solid. The initial RAG idea was easy, but then the real work began: making it runnable in Google Colab.
- Colab environment was a nightmare. I had to pin
lightning-fabricto a specific version, deal with some weirdpartial-numpyswaps after an install cascade, and switch to%pipmagic because regularpipwas causing issues. Then came the “runtime shims” for things liketorchaudio.infoandnumpy.NAN. It felt like I was constantly patching holes in a leaky boat. - Dependency hell and
pyannote: Thenumpyissues were particularly frustrating. Gettingpyannote(for diarization) to work consistently across different Colab runtimes and localvenvsetups was a nightmare. I spent a lot of time exploringvenvtheory just to understand why things were breaking. - Data ingestion wins: Despite the environment fight, I built
playlist recoveryto grab content, aWhatsApp chat parser(which was surprisingly tricky), andname-variation search. Abatch diarize notebookis now working, and getting aGDrive downloaderworking was a big win for source material. I also addedcandidate review toolsand fixedneeds_transcriptionregeneration. - Hacky solutions: I even had to bypass SSL certificate verification for some URL shorteners just to follow redirects. Not ideal, but necessary to get the data.
When I finally got a “successful E2E run,” it was a huge relief. All those little fixes, the constant debugging, the moments of wanting to throw my laptop across the room… it paid off. It really hammered home how much time environment setup can suck up, even on seemingly simple projects. The “glamorous” AI stuff often sits on a foundation of gritty, low-level engineering. I learned a ton about dependency management and the quirks of cloud notebooks.
Next: Focus on improving RAG retrieval quality and expanding data sources.