Building a content sync pipeline: The struggle with large files and CI
This past week, I kicked off the halachot-orvishua project. The idea is a reliable pipeline to pull Jewish law content from orvishua.net. I want an automated system that checks for new content, downloads it, and then tells me what changed, especially which “mega-files” need re-uploading. It’s about keeping a local, updated copy and knowing what’s fresh.
This period was mostly about setting up the core infrastructure and figuring out the content flow.
- Scripts-only repo, content on Google Drive: My first big decision was to make this a “scripts-only repo.” The actual content wouldn’t live in Git; it would be on Google Drive. This immediately caused problems.
- CI facepalm: My initial CI setup tried to download content. That was a no-go, given the “scripts-only” decision and the sheer size of the content. I had to quickly fix it to a “detect-only” mode where the workflow doesn’t try to pull down the actual files. Total facepalm moment, setting CI to do the exact opposite of what I wanted Git to manage.
- Core logic: change detection and notifications: I got the system to detect changes. Crucially, it now specifies which sources needed re-uploading and exactly which mega-files were affected. This was a big win, the whole point of the project.
- Evolving content strategy: The “scripts-only” idea felt clean, but then I realized I needed some representation of the output in Git for tracking. I implemented syncing the output to a git-tracked directory. It felt like a necessary compromise. Later, I reorganized everything: scripts into
scripts/, tracked content intocontent/, usingnew_content/andcontent_index.jsonfor systematic management. It’s a constant battle between keeping it simple and making it robust for large, dynamic data. - Line endings, of course: Had to add
.gitattributesand normalize them to LF. Always line endings.
I finally got a successful sync run, pulling down 2 new Q&A items and confirming 0 new articles. That little green checkmark felt good.
Honestly, the hardest part was wrestling with the content strategy. “Scripts-only, content on Drive” seemed clean, but the practicalities of CI, tracking changes, and needing some versioned output meant constant re-evaluation. It felt like I was building the plane while flying it. The “mega-files” aspect really drove home that Git isn’t always the right tool for all data, but you still need a way to track metadata about that data. Architectural decisions are rarely one-and-done; they evolve as you understand the problem better.
Next: Refine the notification content and potentially add more robust error handling.