The idea in 60 seconds

  • Vector Databases are a part of many Generative AI solutions.
  • They store concepts as many (hundreds) of dimensions of information.
  • The storage process is called ‘embedding’ and it relates concepts in a mathematical way.
  • Similar ideas / words are stored in similar places (mathematically.)
  • This allows for a search for ‘similar’ concepts. (Because they’re closer to each other in the database.)
  • Jennifer Anniston is a useful example of the concepts involved in Vector Databases.

What is a Vector Database?

Classic databases are tables of structured data. Think of a 3 column list in Excel. That’s what a typical, historical database looks like (in my head anyway.) 

Vector databases store unstructured information across multiple dimensions (we’re used to 3 in the real world but some Vector Databases store 1500 or more dimensions of every piece of information they take in.) 

It’s the mathematical coordinates in the database of this information which allows computers to link concepts and the structure / relationships between them so AIs can employ that informaiton. 

Examples of Vector Databases include Pinecone and ChromaDB – both of which we considered as our database for the Interview Bot. (in the end we chose Pinecone because the developer preferred it.

Below: A Visual Representation of The Sort Of Database Our Interview Bot Uses. ( A Vector Database.)

I think a visual is useful here. This is a picture of the concept of a Vector Database – although it shows only 3 dimensions. As in the human brain, similar concepts are stored near each other.

Source: weaviate.io 

What do Vector Databases Do And How Are They Helpful ?

You might think I’m joking. I’m not. Most brains in the Western world have a Jennifer Aniston neuron.

Source: brainlatam.com 

I think of Vector Databases as a bit like a human brain. To explain what I mean by that, it might be worth using a weird example. You’ll think I am joking but I am not.

There’s an actual thing called the Jennifer Aniston neuron in the human brain. Many people have it. Show someone a picture of Jennifer Aniston and the neuron will fire. But also, say ‘We were on a break’ – it’ll fire then too. The neuron fires (for technical reasons the linked article explains) when the concept of Jennifer Aniston is triggered. In our brain, we have tones of memories, encoded in neurons, which are triggered when we are reminded of something – and the brain remembers things ‘similar to Jennifer Aniston’ in areas around the Jennifer Aniston neuron. 

Vector databases store the concepts to which they are exposed as mathematical coordinates. That probably makes sense to you. Computers don’t actually understand Jennifer Anniston, do they? We all know that. Storing her as coordinates in a database is the sort of thing a computer does.

In that sense, the Vector Database is a bit like a brain. Words / concepts are encoded as mathematical coordinates and then stored. Similar concepts are stored closer together. 

So, Ross is closer to Jennifer Aniston than ‘sea salt’ is. All the structure of your CV, Linked IN profile, Questions and answers – and, indeed, language, is stored this way. That’s how computers and AI ‘remember’ and how they can identify things ‘similar to’ what you ask it (because they’re closer to each other in the Vector Database.)

But of course, they’re much better at it than we are. You might have heard, when Google taught its AI 240 languages and how to translate between them, it figured out for itself how to do the next language translation. They didn’t have to teach it, it knew enough about the structure of translation and the structure of languages to figure it out. 

It’s The Lack Of Structure Which Reveals The Structure

Counterintuitively, it’s by storing unstructured information in this way that the structure of language, images and even video is captured and retained in the vector database. Vector databases appear to be what underpin the results which make people go ‘wow’ when they’re interacting with LLMs. This is known as an emergent property which I have covered in a separate article.