2024-06-08

AI Cohere and Weaviate Vector DB

Weaviate is an open-source vector database designed for storing data objects and vector embeddings generated by your preferred machine learning models. It seamlessly scales to handle billions of data objects, functioning as a cloud-native, modular, real-time vector search engine tailored to enhance the scalability of your machine learning models. It is accessible through GraphQL, REST, and various language clients. ¹

setup

python3 -m venv env-cohere
source env-cohere/bin/activate

# `-I`  Ignore the installed packages, overwriting them.
# `-U`  Upgrade all specified packages to the newest available version.
pip3 install "weaviate-client==3.*"

pip3 install -U weaviate-client==3.26.2 cohere==5.5.4 numpy==1.26.4
pip3 install --upgrade --force-reinstall weaviate-client cohere
pip3 show weaviate-client cohere
pip3 index versions weaviate-client cohere

hands-on

import weaviate
import json

WEAVIATE_URL = "https://some-endpoint.weaviate.network"  # Replace with your endpoint
WEAVIATE_API_KEY = "YOUR-WEAVIATE-API-KEY"  # Replace w/ your Weaviate instance API key
COHERE_API_KEY = "YOUR-COHERE-API-KEY"  # Replace with your

# https://weaviate.io/
# https://console.weaviate.cloud
# https://console.weaviate.cloud/dashboard

# https://weaviate.io/developers/weaviate/client-libraries/python/python_v3
# https://weaviate.io/developers/weaviate/modules/reader-generator-modules/generative-cohere
client = weaviate.Client(
  url = WEAVIATE_URL,
  auth_client_secret=weaviate.AuthApiKey(api_key=WEAVIATE_API_KEY),
  additional_headers = {
    "X-Cohere-Api-Key": COHERE_API_KEY
  }
)

# create a class if it doesn't exist
CLASS_NAME = "Question"

# https://weaviate-python-client.readthedocs.io/en/latest/weaviate.schema.html
client.schema.delete_class(CLASS_NAME)  # Delete the class if it exists

if not client.schema.exists(class_name=CLASS_NAME):
  class_obj = {
    "class": CLASS_NAME,
    # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    "vectorizer": "text2vec-cohere",
    "moduleConfig": {
      "text2vec-cohere": {},
      "generative-cohere": {
        # Optional - Defaults to `command-xlarge-nightly`. Can also use`command-xlarge-beta` and `command-xlarge`
        "model": "command-r-plus",
      },
    }
  }
  client.schema.create_class(class_obj)

DATA = [
 'How did serfdom develop in and then leave Russia ?',
 'What films featured the character Popeye Doyle ?',
 "How can I find a list of celebrities ' real names ?",
 'What fowl grabs the spotlight after the Chinese Year of the Monkey ?',
 'What is the full form of .com ?',
 'What contemptible scoundrel stole the cork from my lunch ?',
 "What team did baseball 's St. Louis Browns become ?",
 'What is the oldest profession ?',
 'What are liver enzymes ?',
 'Name the scar-faced bounty hunter of The Old West .',
 'When was Ozzy Osbourne born ?',
 'Why do heavier objects travel downhill faster ?',
 'Who was The Pride of the Yankees ?',
 'Who killed Gandhi ?',
 'What is considered the costliest disaster the insurance industry has ever faced ?',
 'What sprawling U.S. state boasts the most airports ?',
 'What did the only repealed amendment to the U.S. Constitution deal with ?',
 'How many Jews were executed in concentration camps during WWII ?',
 "What is 'Nine Inch Nails' ?",
 'What is an annotated bibliography ?'
]

import cohere
co = cohere.Client(COHERE_API_KEY)
embeds = co.embed(
  texts=DATA,
  model='embed-english-v3.0',
  input_type='search_document',
  truncate='END'
).embeddings

# check the dimensionality of the returned vectors.
import numpy as np
from pprint import pprint
shape = np.array(embeds).shape
pprint(shape)

# md5
# https://www.geeksforgeeks.org/md5-hash-python/
from hashlib import md5
ids = [md5(DATA[i].encode()).hexdigest() for i in range(shape[0])]

batch_size=100
client.batch.configure(batch_size=batch_size)  # Configure batch
with client.batch as batch:  # Configure a batch process
  for i, text in enumerate(DATA):  # Batch import all Questions
    print(f"importing question: {i+1}")
    properties = {
      "text": text,
      "md5": ids[i],
    }
    batch.add_data_object(
      data_object=properties,
      class_name=CLASS_NAME,
      vector=embeds[i]  # Add custom vector
    )

# semantic search
response = (
  client.query
  .get(CLASS_NAME, ["text", "md5", "_additional{id, certainty}"])
  # .with_near_text({"concepts": ["What was the cause of the major recession in the early 20th century?"]})
  .with_near_vector({
    "vector": co.embed(
      texts=["Russia"],
      model='embed-english-v3.0',
      input_type='search_document',
      truncate='END'
    ).embeddings[0],
    "certainty": 0.5
  })
  .with_limit(5)
  .do()
)

print(json.dumps(response, indent=4))

GraphQL query

https://console.weaviate.cloud/apps/query

{
  Get {
    Question (
      limit: 2
    )
    {
      text
      md5
      _additional {
        id
        certainty
      }
    }
  }
}

Other Vector Databases:

Experimenting with Vector Databases: Chromadb, Pinecone, Weaviate and Pgvector ↩