This is the core retrieval part. You will implement and compare five retrieval strategies, all running natively inside Oracle — no external search service required.
| Strategy | Technique | Best For |
|---|---|---|
| Keyword | Oracle Text CONTAINS() |
Exact term matching |
| Vector | VECTOR_DISTANCE() with HNSW |
Semantic similarity |
| Hybrid Pre-Filter | Text filter first, then vector rank | Known-keyword + semantic |
| Hybrid Post-Filter | Vector candidates first, then text filter | Broad semantic + keyword refinement |
| Hybrid RRF | Reciprocal Rank Fusion of both lists | Balanced fusion of both signals |
| Graph | SQL Property Graph + vector seed | Multi-hop relationship discovery |
Why multiple strategies? No single retrieval method is best for all queries. Keyword search excels when the user knows the exact terminology. Vector search finds relevant results even when no keywords match. Hybrid combines both signals. Graph discovers connections that neither keyword nor vector search can surface alone. A production RAG system needs the right strategy for each query type.
Uses the Oracle Text index to find documents where CONTAINS(text, :keyword) matches. Results are ranked by SCORE(1) — Oracle Text’s built-in relevance score.
This is provided complete — read through it to understand the baseline that other strategies build on.
vector_search_research_papersWrite a function that performs pure semantic search:
search_query: prefixarray.array('f', ...) for Oracle bindingVECTOR_DISTANCE(embedding, :q, COSINE)(rows, columns) tupleWhy search_query: prefix? The nomic embedding model was trained with asymmetric prefixes (see Part 2). Using search_query: at retrieval time tells the model “this is a question seeking content”, matching it against documents that were indexed with search_document:. Omitting the prefix degrades search quality.
Key SQL pattern:
SELECT arxiv_id, title, abstract,
ROUND(1 - VECTOR_DISTANCE(embedding, :q, COSINE), 4) AS similarity_score
FROM research_papers
ORDER BY similarity_score DESC
FETCH APPROX FIRST :top_k ROWS ONLY WITH TARGET ACCURACY 90
Why FETCH APPROX? This tells Oracle to use the HNSW index for approximate search rather than exact brute-force scan. WITH TARGET ACCURACY 90 guarantees at least 90% of the true top-k results appear in the output — a good trade-off between speed and completeness.
Complete solution:
def vector_search_research_papers(conn, embedding_model, search_query, top_k=5):
query_embedding = embedding_model.encode(
[f"search_query: {search_query}"],
convert_to_numpy=True, normalize_embeddings=True
)[0].astype(np.float32).tolist()
query_embedding_array = array.array('f', query_embedding)
query = f"""
SELECT arxiv_id, title, abstract,
SUBSTR(text, 1, 200) AS text_snippet,
ROUND(1 - VECTOR_DISTANCE(embedding, :q, COSINE), 4) AS similarity_score
FROM research_papers
ORDER BY similarity_score DESC
FETCH APPROX FIRST {top_k} ROWS ONLY WITH TARGET ACCURACY 90
"""
with conn.cursor() as cur:
cur.execute(query, q=query_embedding_array)
rows = cur.fetchall()
columns = [desc[0] for desc in cur.description]
return rows, columns
Key concept: 1 - VECTOR_DISTANCE(...) converts distance to similarity. COSINE distance ranges from 0 (identical) to 2 (opposite). After the conversion, similarity ranges from -1 to 1, where 1 means identical. Ordering DESC puts the most similar results first.
hybrid_search_research_papers_pre_filterCombine keyword filtering with vector ranking:
WHERE CONTAINS(text, :kw, 1) > 0 to pre-filterVECTOR_DISTANCEWhy pre-filter? This strategy first narrows the candidate set with keywords, then re-ranks by semantic similarity. It is useful when the user’s query contains a specific term that must appear in the results — for example, “transformer architecture for protein folding” should only return papers that actually mention the keyword, ranked by how semantically relevant they are.
Complete solution:
def hybrid_search_research_papers_pre_filter(conn, embedding_model, search_phrase, top_k=10, show_explain=False):
query_embedding = embedding_model.encode(
[f"search_query: {search_phrase}"],
convert_to_numpy=True, normalize_embeddings=True
)[0].astype(np.float32).tolist()
query_embedding_array = array.array('f', query_embedding)
with conn.cursor() as cur:
sql = f"""
SELECT arxiv_id, title, abstract,
SUBSTR(text, 1, 200) AS text_snippet,
ROUND(1 - VECTOR_DISTANCE(embedding, :q, COSINE), 4) AS similarity_score
FROM research_papers
WHERE CONTAINS(text, :kw, 1) > 0
ORDER BY similarity_score DESC
FETCH APPROX FIRST {top_k} ROWS ONLY WITH TARGET ACCURACY 90
"""
cur.execute(sql, q=query_embedding_array, kw=search_phrase)
rows = cur.fetchall()
columns = [desc[0] for desc in cur.description]
return rows, columns, None
Key concept: The WHERE CONTAINS(...) clause runs against the Oracle Text index, and the ORDER BY VECTOR_DISTANCE(...) uses the HNSW index. Oracle’s query optimiser handles both in a single execution plan — no two-pass logic in Python. This is a key advantage of running both search types inside the database.
These are provided as complete implementations. Read through them to understand the patterns:
1/(k + rank). Each result gets a combined score regardless of which method found it. This is the most balanced hybrid strategy.SIMILAR_TO and shared-author paths using GRAPH_TABLE(). This discovers papers that are not directly similar but are connected through co-authorship or topic networks.The final cell runs all strategies on the same query and displays results side-by-side so you can compare ranking behaviour. Pay attention to which papers appear in one strategy but not others — this illustrates why strategy selection matters for RAG quality.
“ORA-29902: CONTAINS error” — The Oracle Text index may not be synced. Run: EXEC CTX_DDL.SYNC_INDEX('rp_text_idx')
Empty vector results — Verify data was ingested: SELECT COUNT(*) FROM research_papers. You should see 200 rows.
Slow vector queries — Check that the HNSW index was created successfully. Without it, Oracle falls back to exact brute-force scan.