When I was running the bioinformatics core at UMass, I built the same RNA-seq pipeline five times. I don’t mean five versions of the same pipeline. I mean five separate pipelines - each one from scratch, each one a little different, each one crafted for a specific PI’s version of “standard.”
Different genome builds. Different file formats. Different expectations around outputs: some wanted counts tables, some wanted Excel spreadsheets, others just wanted me to “run it and tell me what it means.”
It wasn’t malicious, it was just messy in the way real science often is.
However, the general structure was always the same:
- Align reads
- Quantify expression
- Normalize
- Generate reports
- Share results
Each cycle felt productive, until I zoomed out and realized how much time I was spending not doing new science, but reinventing infrastructure.
One of the PIs, I remember, insisted on aligning to hg19 because that’s what one of their collaborators was doing. Another preferred a specific version of STAR because it "worked last time." One group had sample names that didn’t match between FASTQs and metadata, but swore they’d fixed it (they hadn’t). I spent hours just renaming files to conform to inconsistent naming schemes across collaborations.
By the fifth time, I finally started saving my environment images and modularizing pipeline templates. I wasn’t formalizing anything yet, but I was starting to move a little faster. I still had to duct tape pieces together, though at least I wasn’t starting from zero.
It was becoming clear to me that this wasn’t a code problem, but rather a systems problem.
It Wasn’t About the Code
When I looked back, I realized I had spent more time managing inputs and expectations than actually writing pipelines.
Each team brought their own context: their own tools, naming conventions, legacy assumptions, and even beliefs about what “done” looked like. Some of the differences were superficial. Some were deeply embedded in how they thought about data. And all of it added friction.
And it wasn’t just about technical inconsistency, it was about trust. When something didn’t work or results didn’t look “right,” we weren’t debugging code; we were debugging understanding. What did you expect to see? What parameters did you assume I used? Which file is the real ground truth?
That’s the moment I stopped thinking beyond pipelines and started thinking about interfaces. Interfaces between people and tools. Between assumptions and execution. Between expectation and reality.
What we all needed was a system of reusable parts: components flexible enough to cover 90% of the use cases, but structured enough to avoid rework for each new one.
At UMass, I never quite got there. But the pain stuck with me.
Years later, when we started building Via Scientific, those lessons became the foundation of our thinking:
- Make environments reproducible. You shouldn’t need a Slack message to remember what version of STAR you used.
- Separate metadata from execution. Let pipelines assume clean inputs, and let metadata systems help clean them.
- Build for modularity. You shouldn’t need to rebuild a pipeline just to change one component.
- Make human expectations visible. If someone assumes something, surface it. Don’t bury it in config files or comments.
The goal isn’t to eliminate complexity. In bioinformatics, that’s impossible. The goal is to make that complexity navigable without having to solve the same problem six times.
Why This Still Matters
I talk to groups every week who are still living in that cycle: rebuilding pipelines, revalidating environments, re-explaining what “aligned counts” mean.
It’s exhausting because the tools they’ve been given weren’t designed for reusability. Most workflows were built for execution, not extension.
At Via, we’re trying to change that. But more than that, we’re trying to share this mindset: that reusability isn’t a nice-to-have. It’s the only way we get faster as a field.
If you're building pipelines, running cores, or scaling infrastructure in bioinformatics: ask yourself not just whether your code works, but whether it will work the next time. And the time after that.
Because once you’ve built the same pipeline five times, you start to wonder what else could’ve been possible if you only had to build it once.