Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata filtering support and fix multi-document and metadata issue #22

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Athe-kunal
Copy link

This PR includes the following changes

  1. If we pass multiple metadata for each document in a folder, the add_to_index will error out because in the below code, it tries to index the ith element, but it is a dictionary, hence it will raise KeyError. Fixed it by renaming metadata as doc_metadata and removed the indexing
for i, item in enumerate(input_items):
     current_doc_id = doc_ids[i] if doc_ids else self.highest_doc_id + 1 + i
     current_metadata = metadata[i] if metadata else None
  1. Added support for filtering based on metadata, so that users can pass a filter_metadata field in the search and get ColPali will only search from these documents.
results = model.search(query, k=5,filter_metadata={"file_name":"attention.pdf"})


print(f"Search results for '{query}':")
for result in results:
   print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

Results

Search results for 'what's the BLEU score of this new strange method?':
Doc ID: 4, Page: 9, Score: 19.75
Doc ID: 4, Page: 8, Score: 19.5
Doc ID: 4, Page: 11, Score: 17.75
Doc ID: 4, Page: 1, Score: 17.625
Doc ID: 4, Page: 14, Score: 17.375

Now if we check doc ID to metadata

model.model.doc_id_to_metadata

Results

{0: {'file_name': 'attention_table.png'},
 1: {'file_name': 'product_c.png'},
 2: {'file_name': 'financial_report.pdf'},
 3: {'file_name': 'attention_with_a_mustache.pdf'},
 4: {'file_name': 'attention.pdf'}}

It only pulled from Doc ID 4, which we intended.

Please let me know about your suggestions. Looking forward to your collaboration

@Athe-kunal
Copy link
Author

@bclavie
Let me know about the PR, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant