Building a RAG Pipeline from Scratch: How I Made Hundreds of Research Papers Searchable

Imagine you've collected 500 research papers on machine learning. You want to know: "What are the latest techniques for training large language models efficiently?"

You could read every paper. That's months of work. You could use a search engine — but keyword search only finds exact words. A paper that describes "reducing compute requirements for transformer training" won't show up when you search for "efficient LLM training", even though it's exactly what you need.

I ran into this problem and built a solution: a RAG pipeline that lets you ask questions in plain English and get answers grounded in your document collection. This post walks through how it works, what I built, and how you can run it yourself.

Why arXiv?

Before diving into the tech, a quick note on the data source.

arXiv is a free, open-access repository of over 2 million research papers covering computer science, physics, mathematics, and more. For this project it has three things going for it: a free REST API that requires no authentication, structured metadata for every paper (title, authors, abstract, publication date), and downloadable PDFs for the full text.

I considered a few alternatives before settling on it:

Source	Why I passed
Wikipedia	Good breadth, but not enough depth per topic
PubMed	Rich literature, but locked to the biomedical domain
SEC EDGAR	Interesting financial filings, but dense legal language is hard to evaluate
Common Crawl	Massive scale, but noisy and complex to filter into something useful

arXiv hits the sweet spot: high-quality text, easy to scope by category (cs.AI, cs.LG, cs.CL), and the AI/ML domain makes it straightforward to judge whether the answers coming out are actually good.

What is RAG?

RAG stands for Retrieval-Augmented Generation. It combines two steps:

Step	What it does
Retrieval	Finds the most relevant passages from your documents
Generation	Feeds those passages to an AI model to write a coherent answer

This is different from just asking ChatGPT or Claude a question directly. A plain AI model answers from its training data — which may be outdated, incomplete, or simply wrong for your specific documents. With RAG, the AI is grounded in the text you provide. It can only answer from what's in your collection.

Think of it as the difference between asking a professor a question from memory, versus giving them your documents and asking them to answer based only on those.

Key Concepts (Simply Explained)

Before diving into the architecture, here are the four ideas that make RAG work.

Embeddings — a map of meaning

An embedding converts text into a list of numbers — for example, 384 numbers. These numbers aren't random; they encode the meaning of the text. Sentences with similar meanings produce similar lists of numbers.

Analogy: Think of every piece of text as a point on a map. Similar texts sit close together on that map, even if they use completely different words. "Efficient transformer training" and "reducing compute for LLMs" would be neighbours on this map.

When you ask a question, it gets converted to the same kind of number list. Finding relevant text becomes finding the nearest points on the map.

Chunking — why we split documents

A single research paper can be 30 pages long. Embedding the whole thing as one list of numbers loses too much detail — one set of numbers can't capture everything in a document that long.

Chunking splits each document into smaller overlapping pieces. In this pipeline, each chunk is 512 tokens (~350–400 words), with a 50-token overlap between consecutive chunks so sentences don't get cut in half. A single paper might produce 50–100 searchable chunks, each with its own embedding.

Vector similarity — measuring closeness

Once your question is an embedding and your chunks are embeddings, you need a way to measure which chunks are most relevant. That's cosine similarity — a score between 0 and 1 that measures how similar two embeddings are (1 = identical meaning, 0 = completely unrelated).

The database finds the chunks with the highest similarity to your question — those are your most relevant passages.

pgvector — PostgreSQL with a superpower

pgvector is an extension for PostgreSQL (a standard relational database) that adds support for storing and searching embeddings. Without it, a database can only store text, numbers, and dates — not lists of 384 decimal numbers.

With pgvector, a single SQL query can find the top-5 most relevant chunks across millions of stored embeddings. No separate vector database needed.

How the Pipeline Works

The system has two phases: ingestion (run once to load your documents) and querying (run every time a user asks a question).

INGESTION PHASE
────────────────────────────────────────────────
  arXiv API
    ↓  fetches paper metadata + PDF URL
  PDF Parser
    ↓  extracts raw text from PDF pages
  Chunker
    ↓  splits text into 512-token chunks
  Embedder
    ↓  converts each chunk to a 384-number vector
  PostgreSQL + pgvector
     stores: chunk text + vector + paper metadata
────────────────────────────────────────────────

QUERY PHASE
────────────────────────────────────────────────
  User types a question in the Streamlit GUI
    ↓
  Embedder
    ↓  converts the question to a 384-number vector
  Retriever
    ↓  finds top-5 chunks closest to the question vector
  Generator
    ↓  sends question + chunks to Claude API
  Streamlit GUI
     displays answer + expandable source citations
────────────────────────────────────────────────

Let me trace through a real example. The user asks: "What is the attention mechanism in transformers?"

The question is converted to a vector: [-0.031, 0.087, -0.142, ..., 0.063]
PostgreSQL finds the 5 stored chunks with the closest vectors — these come from papers like "Attention Is All You Need" and the BERT paper
Those chunks are formatted as numbered excerpts and sent to Claude along with the question
Claude reads the excerpts and writes a grounded answer, citing the source papers
The GUI displays the answer with a collapsible "Sources" section showing which chunks were used

Project Structure

rag_application/
│
├── docker-compose.yml        ← starts the PostgreSQL + pgvector container
├── requirements.txt          ← all Python packages needed
├── .env                      ← your API keys and DB URL (not committed)
├── .env.example              ← template showing what variables are needed
│
├── scripts/                  ← one-off command-line tools (run manually)
│   ├── download_model.py     ← downloads the embedding model before first use
│   ├── setup_db.py           ← creates database tables and indexes
│   └── ingest.py             ← fetches papers, embeds them, stores in DB
│
└── src/                      ← main application code
    ├── database/
    │   ├── models.py         ← defines what the database tables look like
    │   └── session.py        ← manages database connections
    ├── ingestion/
    │   ├── fetcher.py        ← downloads paper list from arXiv API
    │   ├── parser.py         ← downloads and reads PDF text
    │   └── chunker.py        ← splits text into 512-token chunks
    ├── embedding/
    │   └── embedder.py       ← converts text chunks into 384-number vectors
    ├── retrieval/
    │   └── retriever.py      ← finds the most relevant chunks for a query
    ├── generation/
    │   ├── prompts.py        ← all text templates sent to the AI
    │   └── generator.py      ← calls the Claude API and returns the answer
    └── gui/
        └── app.py            ← the Streamlit web interface

Each component is small and has a single responsibility. fetcher.py only fetches. chunker.py only chunks. This makes the code easy to read, test, and swap out — if you wanted to replace arXiv with another data source, you'd only touch fetcher.py.

Technology Choices

Here's a summary of the stack, followed by the reasoning behind the three most interesting decisions.

Technology	Role	Why I chose it
PostgreSQL + pgvector	Vector store + metadata DB	Single database for everything — no separate vector service
sentence-transformers (`all-MiniLM-L6-v2`)	Embedding model	Free, runs locally on CPU, zero per-call cost
Anthropic Claude API (`claude-sonnet-4-6`)	Answer generation	Long context window, excellent instruction-following
Streamlit	GUI	Working chat UI in pure Python — no HTML or JS needed
LangChain	Text splitting only	Battle-tested chunker; nothing else from the framework is used
Docker	Database container	Runs pgvector without touching your local PostgreSQL install
pdfplumber	PDF parsing	Reliable text extraction; falls back to abstract if a PDF is broken

Vector store: why not a dedicated vector database?

The obvious choices were Pinecone (managed cloud), ChromaDB (embedded), or FAISS (in-memory). I went with PostgreSQL + pgvector instead, for one simple reason: it keeps paper metadata and chunk vectors in the same database. With a dedicated vector DB, you end up querying two systems — one for the vectors, one for the title/authors/date — and joining the results yourself. pgvector's <=> cosine distance operator lets a single SQL query do both in one shot.

The only wrinkle: pgvector has no pre-built Windows binary for PostgreSQL 18, so running the database inside Docker (using the pgvector/pgvector:pg16 image) was the easiest path to a working setup.

Embedding model: why not the OpenAI API?

Ingesting 500 papers at 50–100 chunks each means roughly 25,000–50,000 embedding calls. At OpenAI's text-embedding-3-small pricing, that adds up. all-MiniLM-L6-v2 is a ~90 MB model that runs locally on CPU, costs nothing per call, and is fast enough to ingest a full paper collection in minutes. Quality is slightly below the large paid models, but more than sufficient for this use case.

LLM: Claude over GPT-4o or local models

The main alternative was running a local model via Ollama (Llama 3 or Mistral) — free and fully private. I chose the Claude API — specifically claude-sonnet-4-6 — because local models at a reasonable size still struggle with synthesis tasks: taking 5 disparate excerpts and writing a coherent, well-attributed answer. Claude Sonnet 4.6 handles this well, and its long context window means I can pass all retrieved chunks without truncation concerns. GPT-4o would also work; I chose Claude because I already had API access and Sonnet 4.6 hits the right balance of quality and cost for a generation step that runs on every user query.

Getting Started

The full code and setup instructions are at github.com/hhphan/rag_application.

# 1. Clone the repo
git clone https://github.com/hhphan/rag_application.git
cd rag_application

# 2. Install Python dependencies
pip install -r requirements.txt

# 3. Copy the environment template and fill in your API key
cp .env.example .env
# Edit .env — set your ANTHROPIC_API_KEY and DATABASE_URL

# 4. Start the PostgreSQL + pgvector database
docker compose up -d

# 5. Download the embedding model
python scripts/download_model.py

# 6. Create the database tables
python scripts/setup_db.py

# 7. Ingest papers from arXiv (edit the search query in the script first)
python scripts/ingest.py

# 8. Launch the GUI
streamlit run src/gui/app.py

Open your browser at http://localhost:8501 and start asking questions.

What Tripped Me Up

No project goes together without friction. Two things caught me out.

pgvector setup. The Docker approach (pgvector/pgvector:pg16) sounds simple — one docker compose up -d and you're running. In practice I hit two issues. First, the pgvector extension isn't enabled by default inside the container; you need to run CREATE EXTENSION IF NOT EXISTS vector; manually after the container starts, otherwise every insert fails with a cryptic error about an unknown type. Second, if you already have a local PostgreSQL instance running on port 5432, Docker silently binds to a port you're not expecting and the app can't connect. The fix is to either stop your local Postgres first, or explicitly map to a different host port (e.g. 5433:5432) in docker-compose.yml. Both of these are now documented in the README — I wish I'd hit the docs before hitting the errors.

Downloading the embedding model. The pipeline uses all-MiniLM-L6-v2 from HuggingFace via the sentence-transformers library. The first time you run the ingestion script, it downloads roughly 90 MB of model weights. On a slow or interrupted connection, this download can silently corrupt the local cache — and the error you get back (OSError: unable to load weights) doesn't make it obvious that the cache is the problem. The fix is to delete the cached model directory (usually ~/.cache/huggingface/) and re-run the download script. I added a dedicated python scripts/download_model.py step to the setup sequence for exactly this reason: download once, verify it works, then ingest.

Where This Goes Next

The current version is a solid foundation. A few directions I'm exploring:

Hybrid search — combine vector similarity with keyword matching (BM25) so the retriever catches both semantic relevance and exact term matches
Re-ranking — after retrieving the top-5 chunks, run a second, more expensive model to re-score and re-order them for better precision
Multi-source ingestion — extend the fetcher beyond arXiv to support PDFs from local disk, websites, or other academic databases

If you build on this or run into issues, the GitHub repo has the full code. Feel free to open an issue or pull request.

Hien Phan