Before you can search anything, you need data in a searchable form. In this part, 200 ArXiv research papers flow through a three-stage pipeline:
The output is a DataFrame with one row per paper, each carrying its text, metadata, and a vector embedding ready for Oracle ingestion in Part 3.
We use load_dataset from the datasets library with streaming=True to avoid downloading the entire dataset into memory. This lets you work with large datasets efficiently — you pull only what you need.
Key points:
nick007x/arxiv-papersarxiv_id, title, abstract, authors, and a combined text fieldWhy streaming? The full ArXiv dataset is large. Streaming lets you iterate through records one at a time and stop after 200 — no disk space wasted, no memory pressure.
We use the nomic-ai/nomic-embed-text-v1.5 model from sentence-transformers. This is a 768-dimensional model that runs locally — no API key required.
Important detail — the prefix scheme: Nomic embeddings use asymmetric prefixes:
search_document: for documents being indexedsearch_query: for queries at retrieval timeThis asymmetric prefixing improves retrieval quality by signalling intent to the model. A document prefix tells the model “this is content to be found”, while a query prefix tells it “this is a question seeking content”. Mixing them up degrades results.
What happens during embedding:
search_document:float32 lists in the DataFrameThe resulting dimension (768) determines the VECTOR column size in Oracle.
The first time this cell runs it downloads the model weights (~550MB). This can take 2-5 minutes in Codespaces. The model is cached after the first download so subsequent runs are instant.
This section is pre-built. Read through the code to understand how data flows from Hugging Face through embedding and into a DataFrame ready for Oracle ingestion.