Skip to content

Commit d37b5fe

Browse files
dev-bot@jina.aidev-bot@jina.ai
authored andcommitted
chore(docs): sync up README from docarray
Signed-off-by: dev-bot@jina.ai <Jina Dev Bot>
1 parent 3c1d780 commit d37b5fe

File tree

1 file changed

+34
-39
lines changed

1 file changed

+34
-39
lines changed

README.md

Lines changed: 34 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -14,21 +14,21 @@
1414

1515
<!-- start elevator-pitch -->
1616

17-
DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer the multi-modal data with a Pythonic API.
17+
DocArray is a library for nested, unstructured, multimodal data in transit, including text, image, audio, video, 3D mesh, etc. It allows deep-learning engineers to efficiently process, embed, search, recommend, store, and transfer multimodal data with a Pythonic API.
1818

19-
🚪 **Door to cross-/multi-modal world**: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data. The foundation data structure of [Jina](https://github.com/jina-ai/jina), [CLIP-as-service](https://github.com/jina-ai/clip-as-service), [DALL·E Flow](https://github.com/jina-ai/dalle-flow), [DiscoArt](https://github.com/jina-ai/discoart) etc.
19+
🚪 **Door to multimodal world**: super-expressive data structure for representing complicated/mixed/nested text, image, video, audio, 3D mesh data. The foundation data structure of [Jina](https://github.com/jina-ai/jina), [CLIP-as-service](https://github.com/jina-ai/clip-as-service), [DALL·E Flow](https://github.com/jina-ai/dalle-flow), [DiscoArt](https://github.com/jina-ai/discoart) etc.
2020

2121
🧑‍🔬 **Data science powerhouse**: greatly accelerate data scientists' work on embedding, k-NN matching, querying, visualizing, evaluating via Torch/TensorFlow/ONNX/PaddlePaddle on CPU/GPU.
2222

2323
🚡 **Data in transit**: optimized for network communication, ready-to-wire at anytime with fast and compressed serialization in Protobuf, bytes, base64, JSON, CSV, DataFrame. Perfect for streaming and out-of-memory data.
2424

25-
🔎 **One-stop k-NN**: Unified and consistent API for mainstream vector databases that allows nearest neighboour search including Elasticsearch, Redis, ANNLite, Qdrant, Weaviate.
25+
🔎 **One-stop k-NN**: Unified and consistent API for mainstream vector databases that allows nearest neighbor search including Elasticsearch, Redis, AnnLite, Qdrant, Weaviate.
2626

27-
👒 **For modern apps**: GraphQL support makes your server versatile on request and response; built-in data validation and JSON Schema (OpenAPI) help you build reliable webservices.
27+
👒 **For modern apps**: GraphQL support makes your server versatile on request and response; built-in data validation and JSON Schema (OpenAPI) help you build reliable web services.
2828

29-
🐍 **Pythonic experience**: designed to be as easy as a Python list. If you know how to Python, you know how to DocArray. Intuitive idioms and type annotation simplify the code you write.
29+
🐍 **Pythonic experience**: as easy as a Python list. If you can Python, you can DocArray. Intuitive idioms and type annotation simplify the code you write.
3030

31-
🛸 **Integrate with IDE**: pretty-print and visualization on Jupyter notebook & Google Colab; comprehensive auto-complete and type hint in PyCharm & VS Code.
31+
🛸 **IDE integration**: pretty-print and visualization on Jupyter notebook and Google Colab; comprehensive autocomplete and type hints in PyCharm and VS Code.
3232

3333
Read more on [why should you use DocArray](https://docarray.jina.ai/get-started/what-is/) and [comparison to alternatives](https://docarray.jina.ai/get-started/what-is/#comparing-to-alternatives).
3434

@@ -61,9 +61,9 @@ DocArray consists of three simple concepts:
6161

6262
Let's see DocArray in action with some examples.
6363

64-
### Example 1: represent multimodal data in dataclass
64+
### Example 1: represent multimodal data in a dataclass
6565

66-
The following news article card can be easily represented via `docarray.dataclass` and type annotation:
66+
You can easily represent the following news article card with `docarray.dataclass` and type annotation:
6767

6868

6969
<table>
@@ -104,9 +104,9 @@ d = Document(a)
104104
</table>
105105

106106

107-
### Example 2: a 10-liners text matching
107+
### Example 2: text matching in 10 lines
108108

109-
Let's search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice".
109+
Let's search for top-5 similar sentences of <kbd>she smiled too much</kbd> in "Pride and Prejudice":
110110

111111
```python
112112
from docarray import Document, DocumentArray
@@ -137,7 +137,7 @@ Here the feature embedding is done by simple [feature hashing](https://en.wikipe
137137

138138
### Example 3: external storage for out-of-memory data
139139

140-
When your data is too big, storing in memory is probably not a good idea. DocArray supports [multiple storage backends](https://docarray.jina.ai/advanced/document-store/) such as SQLite, Weaviate, Qdrant and ANNLite. They are all unified under **the exact same user experience and API**. Take the above snippet as an example, you only need to change one line to use SQLite:
140+
When your data is too big, storing in memory is not the best idea. DocArray supports [multiple storage backends](https://docarray.jina.ai/advanced/document-store/) such as SQLite, Weaviate, Qdrant and AnnLite. They're all unified under **the exact same user experience and API**. Take the above snippet: you only need to change one line to use SQLite:
141141

142142
```python
143143
da = DocumentArray(
@@ -146,15 +146,13 @@ da = DocumentArray(
146146
)
147147
```
148148

149-
The code snippet can still run **as-is**. All APIs remain the same, the code after are then running in a "in-database" manner.
149+
The code snippet can still run **as-is**. All APIs remain the same, the subsequent code then runs in an "in-database" manner.
150150

151-
Besides saving memory, one can leverage storage backends for persistence, faster retrieval (e.g. on nearest-neighbour queries).
151+
Besides saving memory, you can leverage storage backends for persistence and faster retrieval (e.g. on nearest-neighbor queries).
152152

153+
### Example 4: complete workflow of visual search
153154

154-
155-
### Example 4: a complete workflow of visual search
156-
157-
Let's use DocArray and the [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in `/left` and `/right`. Images that share the same filename are perceptually similar. For example:
155+
Let's use DocArray and the [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) dataset to build a simple meme image search. The dataset contains 6,016 image-pairs stored in `/left` and `/right`. Images that share the same filename appear similar to the human eye. For example:
158156

159157
<table>
160158
<thead>
@@ -175,31 +173,31 @@ Let's use DocArray and the [Totally Looks Like](https://sites.google.com/view/to
175173
</tbody>
176174
</table>
177175

178-
Our problem is given an image from `/left`, can we find its most-similar image in `/right`? (without looking at the filename of course).
176+
Given an image from `/left`, can we find the most-similar image to it in `/right`? (without looking at the filename).
179177

180178
### Load images
181179

182-
First we load images. You *can* go to [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset) website, unzip and load images as below:
180+
First we load images. You *can* go to [Totally Looks Like](https://sites.google.com/view/totally-looks-like-dataset)'s website, unzip and load images as below:
183181

184182
```python
185183
from docarray import DocumentArray
186184

187185
left_da = DocumentArray.from_files('left/*.jpg')
188186
```
189187

190-
Or you can simply pull it from Jina Cloud:
188+
Or you can simply pull it from Jina AI Cloud:
191189

192190
```python
193191
left_da = DocumentArray.pull('jina-ai/demo-leftda', show_progress=True)
194192
```
195193

196194
**Note**
197-
If you have more than 15GB of RAM and want to try using the whole dataset instead of just the first 1000 images, remove [:1000] when loading the files into the DocumentArrays left_da and right_da.
195+
If you have more than 15GB of RAM and want to try using the whole dataset instead of just the first 1,000 images, remove `[:1000]` when loading the files into the DocumentArrays `left_da` and `right_da`.
198196

199197

200-
You will see a running progress bar to indicate the downloading process.
198+
You'll see a progress bar to indicate how much has downloaded.
201199

202-
To get a feeling of the data you will handle, plot them in one sprite image. You will need to have matplotlib and torch installed to run this snippet:
200+
To get a feeling of the data, we can plot them in one sprite image. You need matplotlib and torch installed to run this snippet:
203201

204202
```python
205203
left_da.plot_image_sprites()
@@ -232,7 +230,7 @@ Did I mention `apply` works in parallel?
232230

233231
### Embed images
234232

235-
Now convert images into embeddings using a pretrained ResNet50:
233+
Now let's convert images into embeddings using a pretrained ResNet50:
236234

237235
```python
238236
import torchvision
@@ -245,7 +243,7 @@ This step takes ~30 seconds on GPU. Beside PyTorch, you can also use TensorFlow,
245243

246244
### Visualize embeddings
247245

248-
You can visualize the embeddings via tSNE in an interactive embedding projector. You will need to have pydantic, uvicorn and fastapi installed to run this snippet:
246+
You can visualize the embeddings via tSNE in an interactive embedding projector. You will need to have pydantic, uvicorn and FastAPI installed to run this snippet:
249247

250248
```python
251249
left_da.plot_embeddings(image_sprites=True)
@@ -255,8 +253,7 @@ left_da.plot_embeddings(image_sprites=True)
255253
<a href="https://docarray.jina.ai"><img src="https://github.com/docarray/docarray/blob/main/.github/README-img/tsne.gif?raw=true" alt="Visualizing embedding via tSNE and embedding projector" width="90%"></a>
256254
</p>
257255

258-
Fun is fun, but recall our goal is to match left images against right images and so far we have only handled the left. Let's repeat the same procedure for the right:
259-
256+
Fun is fun, but our goal is to match left images against right images, and so far we have only handled the left. Let's repeat the same procedure for the right:
260257

261258
<table>
262259
<tr>
@@ -289,9 +286,9 @@ right_da = (
289286
</tr>
290287
</table>
291288

292-
### Match nearest neighbours
289+
### Match nearest neighbors
293290

294-
We can now match the left to the right and take the top-9 results.
291+
Now we can match the left to the right and take the top-9 results.
295292

296293
```python
297294
left_da.match(right_da, limit=9)
@@ -312,7 +309,7 @@ left/02262.jpg right/04520.jpg 0.16477376
312309
...
313310
```
314311

315-
Or shorten the loop as one-liner using the element & attribute selector:
312+
Or shorten the loop to a one-liner using the element and attribute selector:
316313

317314
```python
318315
print(left_da['@m', ('uri', 'scores__cosine__value')])
@@ -337,7 +334,7 @@ Better see it.
337334
<a href="https://docarray.jina.ai"><img src="https://github.com/jina-ai/docarray/blob/main/.github/README-img/9nn.png?raw=true" alt="Visualizing top-9 matches using DocArray API" height="250px"></a>
338335
</p>
339336

340-
What we did here is revert the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so that you can visualize them using image sprites.
337+
Here we reversed the preprocessing steps (i.e. switching axis and normalizing) on the copied matches, so you can visualize them using image sprites.
341338

342339
### Quantitative evaluation
343340

@@ -350,7 +347,7 @@ groundtruth = DocumentArray(
350347
)
351348
```
352349

353-
Here we create a new DocumentArray with real matches by simply replacing the filename, e.g. `left/00001.jpg` to `right/00001.jpg`. That's all we need: if the predicted match has the identical `uri` as the groundtruth match, then it is correct.
350+
Here we created a new DocumentArray with real matches by simply replacing the filename, e.g. `left/00001.jpg` to `right/00001.jpg`. That's all we need: if the predicted match has the identical `uri` as the groundtruth match, then it is correct.
354351

355352
Now let's check recall rate from 1 to 5 over the full dataset:
356353

@@ -372,25 +369,23 @@ recall@4 0.052194148936170214
372369
recall@5 0.0573470744680851
373370
```
374371

375-
More metrics can be used such as `precision_at_k`, `ndcg_at_k`, `hit_at_k`.
372+
You can also use other metrics like `precision_at_k`, `ndcg_at_k`, `hit_at_k`.
376373

377-
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner) you could do much better in just 10 extra lines of code. [Here is how](https://finetuner.jina.ai/notebooks/image_to_image/).
374+
If you think a pretrained ResNet50 is good enough, let me tell you with [Finetuner](https://github.com/jina-ai/finetuner) you can do much better with [just another ten lines of code](https://finetuner.jina.ai/notebooks/image_to_image/).
378375

379376

380377
### Save results
381378

382-
You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form,
379+
You can save a DocumentArray to binary, JSON, dict, DataFrame, CSV or Protobuf message with/without compression. In its simplest form:
383380

384381
```python
385382
left_da.save('left_da.bin')
386383
```
387384

388-
To reuse it, do `left_da = DocumentArray.load('left_da.bin')`.
389-
385+
To reuse that DocumentArray's data, use `left_da = DocumentArray.load('left_da.bin')`.
390386

391387
If you want to transfer a DocumentArray from one machine to another or share it with your colleagues, you can do:
392388

393-
394389
```python
395390
left_da.push('my_shared_da')
396391
```
@@ -406,7 +401,7 @@ Intrigued? That's only scratching the surface of what DocArray is capable of. [R
406401

407402
<!-- start support-pitch -->
408403
## Support
409-
- Join our [Slack community](https://slack.jina.ai) and chat with other community members about ideas.
404+
- Join our [Slack community](https://jina.ai/slack) and chat with other community members about ideas.
410405

411406

412407
> DocArray is a trademark of LF AI Projects, LLC

0 commit comments

Comments
 (0)