A RAG pipeline ingests documents, chunks them, and stores vector embeddings permanently. Three legal questions arise: who owns the copyright in the ingested content, what happens when the retrieval reproduces copyrighted text verbatim, and what does a GDPR erasure request require when the "data" is a vector embedding?
Key Analysis
Ingesting a third-party document into a RAG pipeline without a license may constitute copyright infringement — even if the original document is not stored in plaintext.
RAG retrieval that returns exact multi-sentence passages from a copyrighted document is closer to reproduction than fair use.
Deleting a GDPR-subject's data from a vector database requires more than deleting the record — the embedding must be removed and any derived outputs identified.
Risk Signals
RAG pipelines that ingest documents without verifying the client holds rights to use them for AI training/retrieval.
Chunking strategies that reproduce large coherent passages verbatim in retrieval output.
No documented erasure workflow for removing a person's data from ChromaDB collections.
Action Items
Require clients to warrant that they hold rights to ingest documents into the RAG pipeline.
Implement a similarity floor and a maximum passage length for retrieval output to reduce verbatim reproduction risk.
Build and test the cascade deletion workflow (chunk IDs → delete from ChromaDB) before any erasure request arrives.