Skip to content

Commit 94271c8

Browse files
committed
reviewing and cleaning up tutorial
1 parent 0c6a0b0 commit 94271c8

File tree

1 file changed

+113
-91
lines changed

1 file changed

+113
-91
lines changed

docs/howtos/solutions/vector/getting-started-vector/index-getting-started-vector.mdx

Lines changed: 113 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,16 @@ Now, product 1 `Puma Men Race Black Watch` might be represented as the vector `[
4141

4242
In a more complex scenario, like natural language processing (NLP), words or entire sentences can be converted into dense vectors (often referred to as embeddings) that capture the semantic meaning of the text.Vectors play a foundational role in many machine learning algorithms, particularly those that involve distance measurements, such as clustering and classification algorithms.
4343

44+
## What is a vector database?
45+
46+
A vector database is a database that's optimized for storing and searching vectors. It's a specialized database that's designed to store and search vectors efficiently. Vector databases are often used to power vector search applications, such as recommendation systems, image search, and textual content retrieval. Vector databases are also referred to as vector stores, vector indexes, or vector search engines. Vector databases use vector similarity algorithms to search for vectors that are similar to a given query vector.
47+
48+
:::tip
49+
50+
[<u>**Redis Cloud**</u>](https://redis.com/try-free) is a popular choice for vector databases, as it offers a rich set of data structures and commands that are well-suited for vector storage and search. Redis Cloud allows you to index vectors and perform vector similarity search in a few different ways outlined further in this tutorial. It also maintains a high level of performance and scalability.
51+
52+
:::
53+
4454
## What is vector similarity?
4555

4656
Vector similarity is a measure that quantifies how alike two vectors are, typically by evaluating the `distance` or `angle` between them in a multi-dimensional space.
@@ -52,81 +62,10 @@ When vectors represent data points, such as texts or images, the similarity scor
5262
- **Image Search**: Store vectors representing image features, and then retrieve images most similar to a given image's vector.
5363
- **Textual Content Retrieval**: Store vectors representing textual content (e.g., articles, product descriptions) and find the most relevant texts for a given query vector.
5464

55-
## How to calculate vector similarity?
56-
57-
Several techniques are available to assess vector similarity, with some of the most prevalent ones being:
58-
59-
### Euclidean Distance (L2 norm)
60-
61-
**Euclidean Distance (L2 norm)** calculates the linear distance between two points within a multi-dimensional space. Lower values indicate closer proximity, and hence higher similarity.
62-
63-
<img
64-
src={EuclideanDistanceFormulaImage}
65-
alt="EuclideanDistanceFormulaImage"
66-
width="300"
67-
className="margin-bottom--md"
68-
/>
69-
70-
For illustration purposes, let's assess `product 1` and `product 2` from the earlier ecommerce dataset and determine the `Euclidean Distance` considering all features.
65+
:::tip CALCULATING VECTOR SIMILARITY
7166

72-
<img
73-
src={EuclideanDistanceSampleImage}
74-
alt="EuclideanDistanceSampleImage"
75-
width="600"
76-
className="margin-bottom--md"
77-
/>
67+
If you're interested in learning more about the mathematics behind vector similarity, scroll down to the [<u>**How to calculate vector similarity?**</u>](#how-to-calculate-vector-similarity) section.
7868

79-
As an example, we will use a 2D chart made with [chart.js](https://www.chartjs.org/) comparing the `Price vs. Quality` features of our products, focusing solely on these two attributes to compute the `Euclidean Distance`.
80-
81-
![chart](./images/euclidean-distance-chart.png)
82-
83-
### Cosine Similarity
84-
85-
**Cosine Similarity** measures the cosine of the angle between two vectors. The cosine similarity value ranges between -1 and 1. A value closer to 1 implies a smaller angle and higher similarity, while a value closer to -1 implies a larger angle and lower similarity. Cosine similarity is particularly popular in NLP when dealing with text vectors.
86-
87-
<img
88-
src={CosineFormulaImage}
89-
alt="CosineFormulaImage"
90-
width="450"
91-
className="margin-bottom--md"
92-
/>
93-
94-
:::note
95-
If two vectors are pointing in the same direction, the cosine of the angle between them is 1. If they're orthogonal, the cosine is 0, and if they're pointing in opposite directions, the cosine is -1.
96-
:::
97-
98-
Again, consider `product 1` and `product 2` from the previous dataset and calculate the `Cosine Distance` for all features.
99-
100-
![sample](./images/cosine-sample.png)
101-
102-
Using [chart.js](https://www.chartjs.org/), we've crafted a 2D chart of `Price vs. Quality` features. It visualizes the `Cosine Similarity` solely based on these attributes.
103-
104-
![chart](./images/cosine-chart.png)
105-
106-
### Inner Product
107-
108-
**Inner Product (dot product)** The inner product (or dot product) isn't a distance metric in the traditional sense but can be used to calculate similarity, especially when vectors are normalized (have a magnitude of 1). It's the sum of the products of the corresponding entries of the two sequences of numbers.
109-
110-
<img
111-
src={IpFormulaImage}
112-
alt="IpFormulaImage"
113-
width="450"
114-
className="margin-bottom--md"
115-
/>
116-
117-
:::note
118-
The inner product can be thought of as a measure of how much two vectors "align"
119-
in a given vector space. Higher values indicate higher similarity. However, the raw
120-
values can be large for long vectors; hence, normalization is recommended for better
121-
interpretation. If the vectors are normalized, their dot product will be `1 if they are identical` and `0 if they are orthogonal` (uncorrelated).
122-
:::
123-
124-
Considering our `product 1` and `product 2`, let's compute the `Inner Product` across all features.
125-
126-
![sample](./images/ip-sample.png)
127-
128-
:::tip
129-
Vectors can also be stored in databases in **binary formats** to save space. In practical applications, it's crucial to strike a balance between the dimensionality of the vectors (which impacts storage and computational costs) and the quality or granularity of the information they capture.
13069
:::
13170

13271
## Generating vectors
@@ -144,7 +83,7 @@ git clone https://github.com/redis-developer/redis-vector-nodejs-solutions.git
14483

14584
### Sentence vector
14685

147-
To procure sentence embeddings, we'll make use of a Hugging Face model titled [Xenova/all-distilroberta-v1](https://huggingface.co/Xenova/all-distilroberta-v1). It's a compatible version of [sentence-transformers/all-distilroberta-v1](https://huggingface.co/sentence-transformers/all-distilroberta-v1) for transformer.js with ONNX weights.
86+
To generate sentence embeddings, we'll make use of a Hugging Face model titled [Xenova/all-distilroberta-v1](https://huggingface.co/Xenova/all-distilroberta-v1). It's a compatible version of [sentence-transformers/all-distilroberta-v1](https://huggingface.co/sentence-transformers/all-distilroberta-v1) for transformer.js with ONNX weights.
14887

14988
:::info
15089

@@ -196,7 +135,7 @@ const embeddings = await generateSentenceEmbeddings('I Love Redis !');
196135
console.log(embeddings);
197136
/*
198137
768 dim vector output
199-
embeddings = [
138+
embeddings = [
200139
-0.005076113156974316, -0.006047232076525688, -0.03189406543970108,
201140
-0.019677048549056053, 0.05152582749724388, -0.035989608615636826,
202141
-0.009754283353686333, 0.002385444939136505, -0.04979122802615166,
@@ -242,7 +181,7 @@ async function generateImageEmbeddings(imagePath: string) {
242181
// Load MobileNet model
243182
const model = await mobilenet.load();
244183

245-
//to check properly classifying image
184+
// Classify and predict what the image is
246185
const prediction = await model.classify(imageTensor);
247186
console.log(`${imagePath} prediction`, prediction);
248187

@@ -286,7 +225,7 @@ const imageEmbeddings = await generateImageEmbeddings('images/11001.jpg');
286225
console.log(imageEmbeddings);
287226
/*
288227
1024 dim vector output
289-
imageEmbeddings = [
228+
imageEmbeddings = [
290229
0.013823275454342365, 0.33256298303604126, 0,
291230
2.2764432430267334, 0.14010703563690186, 0.972867488861084,
292231
1.2307443618774414, 2.254523992538452, 0.44696325063705444,
@@ -392,12 +331,14 @@ You can observe products JSON data in RedisInsight:
392331
![products data in RedisInsight](./images/products-data-gui.png)
393332
394333
:::tip
334+
395335
Download <u>[RedisInsight](https://redis.com/redis-enterprise/redis-insight/)</u> to visually explore your Redis data or to engage with raw Redis commands in the workbench. Dive deeper into RedisInsight with these <u>[tutorials](/explore/redisinsight/)</u>.
336+
396337
:::
397338
398339
### Create vector index
399340
400-
For searches to be conducted on JSON fields in Redis, they must be indexed. The methodology below highlights the process of indexing different types of fields. This encompasses vector fields such as productDescriptionEmbeddings and productImageEmbeddings.
341+
For searches to be conducted on JSON fields in Redis, they must be indexed. The methodology below highlights the process of indexing different types of fields. This encompasses vector fields such as `productDescriptionEmbeddings` and `productImageEmbeddings`.
401342
402343
```ts title="src/redis-index.ts"
403344
import {
@@ -437,14 +378,14 @@ const createRedisIndex = async () => {
437378
"DISTANCE_METRIC" "L2"
438379
"INITIAL_CAP" 111
439380
"BLOCK_SIZE" 111
440-
"$.productDescription" as productDescription TEXT NOSTEM SORTABLE
441-
"$.imageURL" as imageURL TEXT NOSTEM
381+
"$.productDescription" as productDescription TEXT NOSTEM SORTABLE
382+
"$.imageURL" as imageURL TEXT NOSTEM
442383
"$.productImageEmbeddings" as productImageEmbeddings VECTOR "HNSW" 8
443384
"TYPE" FLOAT32
444385
"DIM" 1024
445386
"DISTANCE_METRIC" "COSINE"
446387
"INITIAL_CAP" 111
447-
388+
448389
*/
449390
const nodeRedisClient = await getNodeRedisClient();
450391

@@ -520,24 +461,28 @@ const createRedisIndex = async () => {
520461
```
521462
522463
:::info FLAT VS HNSW indexing
464+
523465
FLAT: When vectors are indexed in a "FLAT" structure, they're stored in their original form without any added hierarchy. A search against a FLAT index will require the algorithm to scan each vector linearly to find the most similar matches. While this is accurate, it's computationally intensive and slower, making it ideal for smaller datasets.
524466
525467
HNSW (Hierarchical Navigable Small World): HNSW is a graph-centric method tailored for indexing high-dimensional data. With larger datasets, linear comparisons against every vector in the index become time-consuming. HNSW employs a probabilistic approach, ensuring faster search results but with a slight trade-off in accuracy.
468+
526469
:::
527470
528471
:::info INITIAL_CAP and BLOCK_SIZE parameters
472+
529473
Both INITIAL_CAP and BLOCK_SIZE are configuration parameters that control how vectors are stored and indexed.
530474
531475
INITIAL_CAP defines the initial capacity of the vector index. It helps in pre-allocating space for the index.
532476
533477
BLOCK_SIZE defines the size of each block of the vector index. As more vectors are added, Redis will allocate memory in chunks, with each chunk being the size of the BLOCK_SIZE. It helps in optimizing the memory allocations during index growth.
478+
534479
:::
535480
536481
## What is vector KNN query?
537482
538483
KNN, or k-Nearest Neighbors, is an algorithm used in both classification and regression tasks, but when referring to "KNN Search," we're typically discussing the task of finding the "k" points in a dataset that are closest (most similar) to a given query point. In the context of vector search, this means identifying the "k" vectors in our database that are most similar to a given query vector, usually based on some distance metric like cosine similarity or Euclidean distance.
539484
540-
### KNN query with Redis
485+
### Vector KNN query with Redis
541486
542487
Redis allows you to index and then search for vectors [using the KNN approach](https://redis.io/docs/stack/search/reference/vectors/#pure-knn-queries).
543488
@@ -558,11 +503,11 @@ const queryProductDescriptionEmbeddingsByKNN = async (
558503
/* sample raw query
559504
560505
FT.SEARCH idx:products
561-
"*=>[KNN 5 @productDescriptionEmbeddings $searchBlob AS score]"
562-
RETURN 4 score brandName productDisplayName imageURL
563-
SORTBY score
564-
PARAMS 2 searchBlob "6\xf7\..."
565-
DIALECT 2
506+
"*=>[KNN 5 @productDescriptionEmbeddings $searchBlob AS score]"
507+
RETURN 4 score brandName productDisplayName imageURL
508+
SORTBY score
509+
PARAMS 2 searchBlob "6\xf7\..."
510+
DIALECT 2
566511
567512
*/
568513
//https://redis.io/docs/interact/search-and-query/query/
@@ -650,18 +595,18 @@ KNN queries can be combined with standard Redis search functionalities using <u>
650595
Range queries retrieve data that falls within a specified range of values.
651596
For vectors, a "range query" typically refers to retrieving all vectors within a certain distance of a target vector. The "range" in this context is a radius in the vector space.
652597
653-
### Range query with Redis
598+
### Vector range query with Redis
654599
655600
Below, you'll find a Node.js code snippet that illustrates how to perform vector `range query` for any range (radius/ distance)provided:
656601
657602
```js title="src/range-query.ts"
658603
const queryProductDescriptionEmbeddingsByRange = async (_searchTxt, _range) => {
659604
/* sample raw query
660-
605+
661606
FT.SEARCH idx:products
662607
"@productDescriptionEmbeddings:[VECTOR_RANGE $searchRange $searchBlob]=>{$YIELD_DISTANCE_AS: score}"
663-
RETURN 4 score brandName productDisplayName imageURL
664-
SORTBY score
608+
RETURN 4 score brandName productDisplayName imageURL
609+
SORTBY score
665610
PARAMS 4 searchRange 0.685 searchBlob "A=\xe1\xbb\x8a\xad\x...."
666611
DIALECT 2
667612
*/
@@ -736,3 +681,80 @@ console.log(JSON.stringify(result2, null, 4));
736681
:::info Image vs text vector query
737682
The syntax for KNN/range vector queries remains consistent whether you're dealing with image vectors or text vectors.
738683
:::
684+
685+
## How to calculate vector similarity?
686+
687+
Several techniques are available to assess vector similarity, with some of the most prevalent ones being:
688+
689+
### Euclidean Distance (L2 norm)
690+
691+
**Euclidean Distance (L2 norm)** calculates the linear distance between two points within a multi-dimensional space. Lower values indicate closer proximity, and hence higher similarity.
692+
693+
<img
694+
src={EuclideanDistanceFormulaImage}
695+
alt="EuclideanDistanceFormulaImage"
696+
width="300"
697+
className="margin-bottom--md"
698+
/>
699+
700+
For illustration purposes, let's assess `product 1` and `product 2` from the earlier ecommerce dataset and determine the `Euclidean Distance` considering all features.
701+
702+
<img
703+
src={EuclideanDistanceSampleImage}
704+
alt="EuclideanDistanceSampleImage"
705+
width="600"
706+
className="margin-bottom--md"
707+
/>
708+
709+
As an example, we will use a 2D chart made with [chart.js](https://www.chartjs.org/) comparing the `Price vs. Quality` features of our products, focusing solely on these two attributes to compute the `Euclidean Distance`.
710+
711+
![chart](./images/euclidean-distance-chart.png)
712+
713+
### Cosine Similarity
714+
715+
**Cosine Similarity** measures the cosine of the angle between two vectors. The cosine similarity value ranges between -1 and 1. A value closer to 1 implies a smaller angle and higher similarity, while a value closer to -1 implies a larger angle and lower similarity. Cosine similarity is particularly popular in NLP when dealing with text vectors.
716+
717+
<img
718+
src={CosineFormulaImage}
719+
alt="CosineFormulaImage"
720+
width="450"
721+
className="margin-bottom--md"
722+
/>
723+
724+
:::note
725+
If two vectors are pointing in the same direction, the cosine of the angle between them is 1. If they're orthogonal, the cosine is 0, and if they're pointing in opposite directions, the cosine is -1.
726+
:::
727+
728+
Again, consider `product 1` and `product 2` from the previous dataset and calculate the `Cosine Distance` for all features.
729+
730+
![sample](./images/cosine-sample.png)
731+
732+
Using [chart.js](https://www.chartjs.org/), we've crafted a 2D chart of `Price vs. Quality` features. It visualizes the `Cosine Similarity` solely based on these attributes.
733+
734+
![chart](./images/cosine-chart.png)
735+
736+
### Inner Product
737+
738+
**Inner Product (dot product)** The inner product (or dot product) isn't a distance metric in the traditional sense but can be used to calculate similarity, especially when vectors are normalized (have a magnitude of 1). It's the sum of the products of the corresponding entries of the two sequences of numbers.
739+
740+
<img
741+
src={IpFormulaImage}
742+
alt="IpFormulaImage"
743+
width="450"
744+
className="margin-bottom--md"
745+
/>
746+
747+
:::note
748+
The inner product can be thought of as a measure of how much two vectors "align"
749+
in a given vector space. Higher values indicate higher similarity. However, the raw
750+
values can be large for long vectors; hence, normalization is recommended for better
751+
interpretation. If the vectors are normalized, their dot product will be `1 if they are identical` and `0 if they are orthogonal` (uncorrelated).
752+
:::
753+
754+
Considering our `product 1` and `product 2`, let's compute the `Inner Product` across all features.
755+
756+
![sample](./images/ip-sample.png)
757+
758+
:::tip
759+
Vectors can also be stored in databases in **binary formats** to save space. In practical applications, it's crucial to strike a balance between the dimensionality of the vectors (which impacts storage and computational costs) and the quality or granularity of the information they capture.
760+
:::

0 commit comments

Comments
 (0)