Sign in or Join to comment or subscribe

2024-05-20T13:08:03Z ago

To answer the question on Vector vs Graph DB –

They’re 100% different things. A vector DB will hold what are arrays of numbers and look like blob to a human. The queries you do against a vector DB are mostly “what are the vectors most similar to this input vector?” where you input vector is also an array of numbers.

A Graph DB, on the other hand, is not much more than a relational DB. Nodes are like rows in tables, edges are like foreign keys. Graph DBs store run of the mill information we typically think of when we think of storing things in a relational DB.

2024-05-20T13:39:32Z ago

Thanks for the explanation. Why is finding similar vectors important? Wouldn’t finding closely connected nodes be relevant? Do you have an example of a query or dataset I can experiment with?

2024-05-20T14:15:28Z ago

Vectors, in a vector db context, represent “embeddings.”

An embedding is a representation of some problem/concept/idea as an N-dimensional vector. Usually these are generated by some ML algorithm. Helpful explanation, I know.

One of the most common embeddings is word2vec where a sentence in natural language is converted into a vector. Why would someone convert natural language to a mathematical concept?

The nice property that falls out is that you can then start applying normal math operations to the sentence.

Something simpler to think about are hex codes for numbers. red and green have their own hex codes:

  • red: 0xFF0000
  • green: 0x00FF00

Since they’re represented as numbers, we can add red + green and get yellow:

    red 0xFF0000
+ green 0x00FF00
= yellow 0xFFFF00

And we can ask other questions about colors like “is orange (0xFFA500) more similar to yellow (0xFFFF00) or blue?”

Math.abs(orange - yellow) = Math.abs(0xFFA500 - 0xFFFF00) = 23040
Math.abs(orange - blue) = Math.abs(0xFFA500 - 0x0000FF) = 16753665

Clearly orange is much more similar to blue than yellow as the magnitude of the difference of the two colors is smaller.

Note: color is really a 3 dimensional vector here (R part, G part, B part) so it would be more accurate to model each color as a vector and find the cosine similarity between the vectors. The cosine similarity of orange and yellow is nearly 1 (very similar), the cosine similarity of orange and blue is 0 (no similarity).

Back to word2vec –

suffice it to say, some researchers figured out what values to give words to produce vectors that, when finding the cosine distance between them, results in finding semantically similar text.

There are embeddings everywhere these days. From general content embeddings that you can use to find similar content to domain specific embeddings to find things like “abusive content”, “hate speech”, violence, gore, etc. (sorry for the negative slant, I used to work on abuse @ Meta).

VectorDBs provide an efficient way to search all the vectors you may be generating for your site. As an example, every photo upload @ Meta gets some set of vectors generated which is then used to check against a vector DB of known harmful images to see if the new upload has similarity to any known bad / banned imagery.

Player art
  0:00 / 0:00