Primitive Data Types in Neural Search System

A primitive data type is a data type for which the programming language provides built-in support. When it comes to framework design, primitive types often refer to the basic building blocks, allowing more complicated composite types to be recursively constructed. Examples such as ndarray in Numpy, tensor in Tensorflow: when writing a Numpy or Tensorflow program, the main object manipulated and passed around is those primitive data types.

What is the primitive data type in Jina then? To many readers and users of Jina, the concept of Executor, Driver, Pea, Pod, Flow should be very familiar. They define different abstraction layers, and together they compose the neural search design patterns. Thanks to these Jina idioms, one can quickly bootstrap a cross/multi-modality search system in no time. But are they primitive data types? No. Before v0.8, Jina has no primitive data type: drivers directly work with Protobuf messages for generating or parsing a stream of bytes in the network layer. The figure below illustrates this idea.

I will explain the new primitive data types Document, QueryLang, NdArray; and the composite types DocumentSet, QueryLangSet, Request, and Message in this blog post. These data types are available since v0.8 in the new jina.types module. Primitive data types complete Jina’s design by clarifying the low-level data representation in Jina, yielding a much simpler, safer, and faster interface on the high-level. Most importantly, they ensure the universality and extensibility for Jina in the long-term.

Jina is an easier way for enterprises and developers to build cross- & multi-modal neural search systems on the cloud. You can use Jina to bootstrap a text/image/video/audio search system in minutes. Give it a try:

New Data Types
- Primitive Types
- Composite Types
Jina Data Types In Action
Design Decisions

New Data Types

In v0.8, we introduced three primitive data types, four composite types, and some derived helper types.

Primitive Types

Document is a basic data type for representing a real-world document. It can contain text, image, array, embedding, URI, and accompanied by rich meta information. It can be recurred both vertically and horizontally to have nested documents and matched documents. Document is the main object Client and Driver work with. User creates it when preparing the input; and its lifetime spans over the entire indexing and searching processes in Jina.
QueryLang is a basic data type for representing the query language structure. Client can use QueryLang to build filter/sort/select queries and convert from/to QLDriver.
NdArray is a basic data type for representing fixed-size multidimensional items of the same type. As the fundamental numeric type in Jina, NdArray is often used to represent embedding, blob, images, audios, texts; and joins the computation of other frameworks such as numpy, tensorflow, pytorch.
- DenseNdArray is a specific data type for the dense representation of a NdArray. Same as numpy.ndarray, it contains values of all elements, the shape and the data type of each element. One can consider it as a numpy.ndarray “view” of the Protobuf data. DenseNdArray also provides a quantization interface to allow lossy compression.
- SparseNdArray is a specific data type for sparse representation of a NdArray, where substantial memory requirement reductions can be realized by storing only the non-zero entries. Jina v0.8 provides the scipy, tensorflow, pytorch “views” of the Protobuf data, which can directly join the corresponding framework’s computation.

Composite Types

Besides primitive data types, three new composite types provide boxing on primitive types. This enables a more Pythonic interface and keeps the data safe from outside interference and misuse.

DocumentSet is a mutable sequence of Document. It allows one to slice/modify/add/delete the sequence and iterate over its element via a generator.
QueryLangSet is a mutable sequence of QueryLang. Like DocumentSet, it allows one to slice/modify/add/delete the sequence and iterate over its element via a generator.
Request is a data type for representing the message passing between Pods, Client and Gateway. It contains all data all Pods require, including DocumentSet, QueryLangSet and meta information. Request also provides a lazy interface to the underlying Protobuf data, avoiding unnecessary (de)serialization and (de)compression. The lifetime of Request spans over the entire indexing and searching processes in Jina: it is the first object users send to Jina and the final object retrieved from Jina.
Message is a container of a Protobuf Envelope and the primitive type Request. It is the actual data type passing internally between Jina Pods.

The next figure illustrates the connections between those data types:

Jina Data Types In Action

Now let’s look at some examples. Say we have an image, and we want to create a Document to contain this image.

1
2
3

# build a fake WHC image
import numpy
fake_img = numpy.random.randint(0, 255, [32, 32, 3], dtype=numpy.uint8)

As a comparision, the new way versus the old way of creating such document:

Primitive Type in v0.8 Before

1
2
3

 
from jina import Document
d = Document(content=fake_img)

 
from jina.proto import jina_pb2
from jina.helper import array2pb
d = jina_pb2.DocumentProto()

fake_img_pb = array2pb(fake_img)
d.blob.CopyFrom(fake_img_pb)

1 2	numpy.testing.assert_equal(d.content, fake_img)

1
2
3

 
from jina.helper import pb2array
numpy.testing.assert_equal(pb2array(getattr(d, d.WhichOneof('content'))), fake_img)

One can immediately notice that the new data type encapsulates the Protobuf access. That only scratches the surface of Jina data type. Let’s now see more usages.

Setting Content

from jina import Document
d = Document()
# set content to text, same as `d.text = ...`
d.content = '123'
# set content to buffer, same as `d.buffer = ...`
d.content = b'1e2f2c'
# set content to blob, same as `d.blob = ...`
d.content = np.random.random([3,4,5])

The MIME type of the document is auto-guessed from the content.

Converting Content Types

One can use convert_* methods to switch between different document content. The example below reads the content from README.md into text field.

from jina import Document
d = Document(uri='./README.md')
d.convert_uri_to_text()

print(d.content) # print out the content of README.md

Construct From Existing Document

Document object can be constructed from existing Document-like structure, such as binary or JSON string, Dict or a DocumentProto Protobuf object:

from jina import Document
from jina.proto import jina_pb2

# from dict
d1 = Document({'text': 'hello world!'})

# from binary buffer
d2 = Document(b'j\x0chello world!')

# from json
d3 = Document('{"text": "hello world!"}')

# from raw proto
d = jina_pb2.DocumentProto()
d.text = 'hello world!'
d4 = Document(d)

Unique Identifier of Document

Since Jina v0.6, every document has a unique identifier id associated with all contents of the document. This ensures same content documents always have the same id. With the new Document type, the content-aware id can be set via update_id(), or get auto set when using it as a context manager:

from jina import Document

d1 = Document()
d1.content = 'hello world'
d1.update_id()

with Document() as d2:
    d2.content = 'hello world'

assert d1.id == d2.id  # True

Access Nested Document

Nested document can be accessed via properties chunks and matches. Both properties return a DocumentSet object, allowing one to access the nested documents as a Python List:

from jina import Document
d = Document()
c1 = Document()
c2 = Document()
d.chunks.add(c1)
d.chunks[0].chunks.add(c2)

for c in d.chunks:
    for cc in c.chunks:
        print(repr(cc))

Construct Query Language

In Jina v0.5, we have introduced a new set of drivers for enabling query languages. Those drivers allow the user to override its parameter to get alternative result. One example is top-k retrieval or pagination, where the start and the end position of result slicing is a parameter at query time. With the new QueryLang type, constructing a query language becomes extremely simple.

from jina.drivers.querylang.slice import SliceQL
from jina import QueryLang, Flow

ql = SliceQL(start=3, end=5, priority=999)
q = QueryLang(ql)

with Flow() as f:
    f.index(..., queryset=q)

Same as Document, a QueryLang object can be also constructed from binary or JSON string, Dict or a QueryLangProto object. To manage multiple QueryLang objects, one can use QueryLangSet similar to DocumentSet.

Construct Request

Putting everything together, constructing a Request on the client side becomes extremely easy and Pythonic:

from jina import Request, Document, QueryLang

def generate_req(batch: Iterator[Any], mode: str, queryset: Sequence[BaseDriver]) -> Request:
    req = Request()
    req.request_type = str(mode)
    for c in batch:
        with Document(content=c) as d:
            req.docs.append(d)
    
    req.queryset.extend(queryset)
    return req

Design Decisions

Finally, let’s review some design decisions made in Jina data types.

View, not copy.

As Protobuf object already provides a Python interface, which can be considered as a “storage” representation, we don’t want to copy it or add another storage layer. Otherwise, it will introduce data inconsistency between the Protobuf object and the Jina data type object. Our goal is to provide an enhanced “view” to the Protobuf “storage” by maintaining a reference.

The next figure uses Document as an example and visualizes the relations between primitive, composite, and Protobuf data types.

Delegate, not replicate.

Protobuf object provides attribute access already. For simple data types such as str, float, int, the experience is good enough. We do not want to replicate every attribute defined in Protobuf again in the Jina data type, but really focus on the ones that need unique logic or particular attention.

To delegate attribute getter/setter to Protobuf object, all Jina data types implement the following fallback:

1 2	def __getattr__(self, name: str): return getattr(self._inner_proto, name)

More than a Pythonic interface.

Jina data type is compatible with the Python idiom. Moreover, it summarizes common patterns used in the drivers and the client and makes those patterns safer and easier to use. For example, doc_id conversion is previously implemented inside different drivers, which is error-prone. Another great example is the lazy access to Request. A Request object will trap all read/write access to its content. The serialization and decompression (to a Protobuf object) are only triggered when there is access. Otherwise, the deserialization and decompression are ignored, yielding more efficient message passing, especially on Peas equipped with RouteDriver.

If you’d like to find out more about Jina primitive data type or discuss the design pattern for neural search, welcome to join our monthly Engineering All Hands via Zoom or Youtube live stream. Previous meeting recordings can be found on our Youtube channel. If you like Jina and want to join us as a full-time AI / Backend / Frontend developer, please submit your CV to our job portal. Let’s build the next neural search ecosystem together!

Table of Contents