Contributor
itsayushpandey

Build out Beam Use Cases - Implement semantic search pipelines


Mentors
Danny McCormick
Organization
Apache Software Foundation
Technologies
Python, Java, Redis, Beam, LLMs, Embeddings, NLP
Topics
machine learning, indexing, Data Engineering, Large Language Models, Retrieval-Augmented Generation
Apache Beam through its unified model for batch and streaming data-parallel processing pipelines, runners for executing them on a variety of distributed processing backends and ML specialized transforms within MLTransform (such as EmbeddingManager and other MLTransformProvider) make it uniquely positioned for building out RAG (Retrieval Augmented Generation) based applications. These applications are one of the most useful and commonly being built applications on LLMs (Large Language Models). For this project we will focus on building a knowledge base on a vector database for a text corpus, and enriching user's questions with matching text chunks using semantic search. This is a crucial part of any RAG applications and helps us in building the right prompt context for LLMs. We will implement the following deliverables to achieve this: 1. Build a Beam pipeline that takes in a batch text corpus from a public dataset as parameter to pipeline and uses MLTransform to generate and save Embeddings in batch mode to a vector database. - Initial scope: Wikipedia dataset with JinaAI Embeddings read from object storage and written to RedisIO to publish to known vector DB. 2. Build a Beam pipeline that takes in stream of text questions from clients and enriches it with related texts from the vector DB. - Initial scope: KafkaIO based reading of queries which is populated by producers independently and published in a different topic for results. 3. New enrichment handlers for vector database queries over Redis based Vector DB Stretch goals: 1. Implement enrichment handlers for OpenSearch (AWS supported)[4] 2. Implement enrichment handlers for Vertex AI Vector Search (GCP Supported)[5] The goal is to demonstrate semantic search building capabilities trivially using Beam and hence evaluation of search results is not tied to a broad benchmark (such as MTEB) for this project's scope.