Tired of bloated RAG frameworks? Discover Shraga — a minimal, production-ready open-source alternative built by BigData Boutique to simplify and scale GenAI applications without the overhead.
When you’re building a RAG application, one of the first decisions you face is whether to use an existing framework or build your own. Frameworks like LangChain, LangGraph, LlamaIndex and CrewAI are great for getting started — they abstract away a lot of the complexity and let you move fast. But as soon as you want something custom, performant, or production-grade, those abstractions often get in the way.
At BigData Boutique, we’ve worked on quite a few RAG systems — across industries, use cases, and architectures — and we kept running into the same issues. So we built our own open-source framework. Something minimal, composable, and easy to debug.
Better yet, Shraga is designed to get you up and running in no time, and also allows to almost immediately deploy what you have to be used by internal users, then external, then real production with analytics over usage, history and consumption of tokens and cost. Today, this is how we run quick and efficient GenAI POCs with customers, and how they often end up running the final product in production.
In this post, I’ll walk through how we approached it, what worked for us, and why we think it’s worth considering if you’re serious about putting GenAI into production.
What We Actually Needed From a Framework
When we started looking for a framework to commit to, most of the popular RAG frameworks available were a non-starter for us. They abstracted too much, were too opinionated about how things should be structured, and came with a long list of heavy dependencies. That might be fine for a demo, but not for teams who need to understand what’s going on under the hood — or ship something reliable to production.
We wanted something that could serve us in both prototyping and production—where we could start “quick and dirty” when needed, but evolve into clean, modular, well-typed code as the project matured.
We also wanted to:
- Spin up a project quickly, with minimal boilerplate and few dependencies.
- Support multiple LLM providers, embedders, document retrievers, and external services.
- Have reusable utilities for the parts that always take time: EDA, data cleanup, chunking, embedding, ingestion.
- Support both headless usage via API and a drop-in UI so customers could interact with the system directly.
In short: we didn’t need a framework that does everything — we needed a toolkit that does the right things well.
The Evolution of a RAG Application with Shraga
A typical lifecycle, or evolution, of a RAG application built with Shraga would look something like this:
- Every RAG project starts as a POC in the shape of sketches and tries. This can be done in Jupyter notebooks, Python scripts somewhere or with Shraga directly. Shraga is a Python framework so you can leverage this rich ecosystem which is also the de-facto standard for anything data science and GenAI.
- When your RAG application matures and you are happy with the results and ready to start collecting feedback, Shraga can be used as a testbed to expose it to internal users or design partners. The built-in UI is great at serving your RAG to customers quickly, providing debug and trace information and gathering feedback.
- As your Shraga implementation matures and you are getting it ready for production, you can build your own UI on top of the API provided by Shraga or just continue using Shraga’s which has security built-in.
- Shraga makes it easy to learn from experience and iterate quickly to improve as your RAG matures. Collect feedback from users, internal or external, sort through issues, improve, run evaluations, and iterate.
- Once your RAG gets attention and usage starts to accumulate, so might the associated costs. Leverage Shraga’s analytics to review good and bad queries, and also estimate the cost per query and opportunities for optimization in cost and performance.
Flows: The Core Building Block
The core abstraction of the ShRAGa framework is the BaseFlow
. It represents a single, modular unit of logic — something that receives input, performs an action (which may or may not involve an LLM), and returns a response. Flows are easy to register, compose, and expose as tools in agent-based systems.
This design allows us to treat everything — from answering a user question, to fetching data, to chaining other flows — as a consistent, testable component.
class FlowBase(ABC):
@abstractmethod
async def execute(self, request: FlowRunRequest) -> FlowResponse:
"""Main entry point for the flow."""
raise NotImplementedError()
@staticmethod
def id() -> str:
"""Unique identifier for this flow."""
raise NotImplementedError()
@staticmethod
def description() -> str:
"""
Optional. A human-readable description. This is can be used by ReAct or planning flows as a
description for what the flow does.
"""
return ""
@staticmethod
def get_tool_desc():
"""
Optional. A pydantic schema of the flow's input. Used for ReAct or
planning flows
"""
return None
async def execute_another_flow_by_id(
self, flow_id: str, request: FlowRunRequest
) -> FlowResponse:
"""Helper method to delegate execution to another flow."""
Here’s a basic example of a flow that takes a user question and calls OpenAI’s chat model to answer it:
from shraga_common.core.models import FlowBase, FlowRunRequest, FlowResponse
from shraga_common.services import LLMService, OpenAIService
class SimpleQAFlow(BaseFlow):
llmservice: LLMService = None
def __init__(self, config: ShragaConfig, flows: Optional[dict] = None):
super().__init__(config, flows)
self.llmservice = OpenAIService(self.config)
@staticmethod
def id() -> str:
return "simple_qa"
async def execute(self, request: FlowRunRequest) -> FlowResponse:
question = request.input.get("question")
if not question:
return FlowResponse(error="Missing 'question' field.")
prompt = f"Answer the user's question {question}"
answer = await self.llmservice.invoke_model(
prompt, {"model_id": "sonnet_3_7"}
)
return FlowResponse(response_text=response,
payload={"answer": answer},
trace=self.trace_log,
)
This flow is minimal by design — but its easy to extend it with retrieval, memory, prompt templates, or anything else. And since all flows follow the same pattern, they can be composed, reused, or registered in agents and routers without special handling.
A FastAPI Layer That Does the Boring Stuff
Since all flows follow the same interface, exposing them through an API is straightforward. But real applications need more than just routing — they need configuration, authentication, logging, and tracking baked in from day one.
That’s why we built a thin FastAPI layer that wraps all flows and handles the infrastructure concerns you’d expect in a production system. It lets us go from idea to deploy endpoint in minutes, without rewriting boilerplate every time.
What’s included:
- Environment-aware configuration (
app/config
) - User, JWT and OAuth authentication, for simple user access control (
app/auth
) - Logging, error handling, analytics and performance stats (
app/services
)
This layer is intentionally boring — and that’s the point. It gives us a stable foundation for running LLM-powered services while keeping the core logic inside flows clean and focused.
Ingestion and Embedding: Getting Quality Data In
The API and flow layers make it easy to expose functionality — but what actually makes a RAG system good is the data. Garbage in, garbage out still applies. And for most projects, the bulk of the real work happens before the LLM is ever called: preprocessing, chunking, embedding, and indexing.
That’s why we built two key components to handle ingestion:
DocHandler: Preprocess and Structure Content
DocHandler
is responsible for taking raw documents — usually HTML, Markdown, or plain text — and converting them into clean, structured chunks. It handles:
- Stripping unnecessary HTML tags (but keeping structure like lists and tables)
- Normalizing text
- Attaching metadata
- Chunking documents based on semantic or structural cues
Each handler can be customized per document source, and supports multi-language inputs, too.
BaseEmbedder: Plug-and-Play Embedding
Once content is chunked, we pass it through a BaseEmbedder
— an interface that wraps any embedding provider. We currently have implementations for Bedrock, Google and OpenAI.
Key features:
- Embeds text chunks in batches
- Preserves metadata alongside vectors
- Supports caching and optional local storage
- Easy to switch providers or models with a single config change
Together, these components give us a clean ingestion pipeline that can go from raw HTML to fully indexed documents very quickly — and since both are modular, they can be reused across projects or extended for specific formats and verticals.
A Simple but Powerful UI: shraga-ui
Once you’ve got clean data indexed and flows exposed through an API, the next step is making the system usable — ideally without rebuilding a frontend from scratch every time.
That’s where shraga-ui comes in.
It’s a lightweight React component library we built to make it easy to add a full-featured GenAI chat interface to any project. You can drop it into your app with just a few files, point it at your API, and you’re done.
Out of the box, it supports capabilities such as:
- A responsive chat interface
- Source citation display
- Chat history
- Analytics
- Authentication
If your use case needs a UI, it’s ready to go. If not, your API is still fully usable headless — for integrations, mobile apps, or external platforms.
Next, we’ll show how we tie this all together into a complete RAG flow with a working example.
Project Setup Overview
We’ve created a full working example of this framework in shraga-tutorial, which you can clone and run locally.
Backend 🐍
The main entry point for the backend is main.py. This file initializes the FastAPI application, loads the environment configuration, and registers all available flows with the app. While shraga-common includes a couple of default flow implementations, we typically implement our own flows by subclassing BaseFlow, following the standard structure we outlined earlier. This makes it easy to add new functionality or change behavior without needing to touch the rest of the infrastructure.
import os
import uvicorn
from shraga_common.app import get_config, setup_app
from shraga_common.flows.demo import flows
config_path = os.getenv("CONFIG_PATH", "config.demo.yaml")
app = setup_app(config_path, flows)
if __name__ == "__main__":
uvicorn.run(
"main:app",
host="0.0.0.0",
port=os.environ.get("PORT") or get_config("server.port") or 8000,
reload=bool(os.environ.get("ENABLE_RELOAD", False)),
log_config=None,
server_header=False,
)
For better organization, we usually place custom flows in a dedicated flows/
directory. Each flow is self-contained and can integrate with whatever tools or services it needs: AWS Knowledge Bases, Pinecone, OpenSearch, Elasticsearch, or any LLM provider like Bedrock or OpenAI.
We often write separate flows for different tasks — for example, one flow to handle document retrieval, and another to handle generation. This separation keeps things modular and lets us compose flows together in more complex pipelines or agents when needed.
Frontend ⚛️
The frontend lives in a separate frontend/
directory and is built with React and Vite. We use pnpm as the package manager, so you’ll want to run pnpm install
followed by pnpm dev
to get it running locally. The frontend uses our published UI package, @shragaai/shraga-ui, which provides a ready-to-use chat component that connects to your backend API.
Bootstrapping the UI is as simple as creating a simple html file and calling:
createRoot(document.getElementById("root")!)
This mounts the default chat interface into your application. If you’d like to customize the chat experience, you can swap in your own component by passing it as a second argument:
createRoot(document.getElementById("root")!, AlternativeChatComponent)
With this setup, you get a full-stack RAG system that’s cleanly structured, easy to debug, and ready to extend — whether you’re working on a one-off demo or building something meant to last.
Summary
With this setup in place, we’re able to move quickly from raw data to a working, production-ready RAG system — without relying on heavyweight abstractions or external orchestration layers. Everything from ingestion to flow logic to UI is modular and easy to extend.
That said, the big frameworks out there — like LangChain or CrewAI — are doing great work, and we do use them when they make sense for a specific project. But as a team that values fast prototyping, strong delivery, and being close to the infrastructure, ShRAGa has proven to be the right tool for us.
We’d love to hear what’s working for you. Are you all-in on one of the popular frameworks? Are you building things from scratch? Or maybe you’ve rolled your own internal system like we did? Let us know — we’re always up for a good architecture chat.