New Features in Jina
v0.5 You Should Know About
Today we are excited to release Jina
v0.5.0. Jina is an easier way for enterprises and developers to build neural search in the cloud. The new version is a great accomplishment made by our engineers and the open-source community: it contains 700 new commits covering 400 files and 7000 lines of code since
v0.4.0. The new release also introduces breaking changes and advanced features such as recursive document representation, query language driver, and hub rebuild. In this post, I will go through some of the highlight features we have made on
v0.5.0 and explain the rationale behind them.
Table of Contents
- Recursive Document Representation
- New Query Language Driver
- Hub as a Git Submodule
Recursive Document Representation
Document is anything you want to search for. Engineering-wise,
Document is a first-class structure defined by Protobuf. On the high-level, the Flow ingests
Document for indexing, searching, training and evaluating; on the low-level, the driver reads and modifies
Document or passes it to another driver. Besides the general properties such as
weight, what makes
Document interesting is that each
Document contains a list of
Chunk is a small semantic unit of a
Document, like a sentence or a 64x64 pixel image patch. Most of the algorithms in Jina works on the
Chunk level, the results are aggregated back to
ScoredResult (which wraps a
Document structure) before returning to the user. Before
ScoredResult have their own Protobuf specification, though they share many common attributes.
v0.5.0, we unify the Protobuf definition of
ScoredResult. This dramatically simplifies the overall interface.
One immediate consequence is that
Document now has a recursive structure with arbitrary width and depth instead of a trivial bi-level structure. Roughly speaking,
chunks can have the next level
chunks and the same level
matches; and so does
matches. This could go on and on. The following figure illustrates this structure.
This recursive structure allows
Document to represent many real-world objects much comfortable than before. For example, in NLP a long document is composed of semantic chapters; each chapter consists of multiple paragraphs, which can be further segmented into sentences. In CV, a video is composed of one or more scenes, including one or more shots (i.e. a sequence of frames taken by a single camera over a continuous period time). Each shot includes one or more frames. Such hierarchical structures can be very well represented with this new Protobuf definition.
Recursive Drivers Work as Convolution/Deconvolution
With a low-level change like that, one may expect massive work on refactoring the high-level implementation. In fact, the refactoring work is mostly concentrated on the
Driver layer, leaving
Flow untouched. I talked about the importance of having layered APIs for progressive abstraction in large framework design in my earlier post. Feel free to read that post if you are interested.
Erasing the difference between
Chunk does now make
Executor completely “Protobuf-agnostic”, allowing one to implement
Executor more consistently. For instance, our previous
DocSpecificExecutor can now be unified into one.
What is really interesting is the way that
Driver works now. In
v0.5.0, we add two essential attributes
traverse_on to the drivers, which allows them to follow the Document structure and work recursively. By default, all drivers work on the root level (i.e.,
level_depth=0) only. To enable the recursion on deeper/wider levels, one can simply specify
depth_range. In the following example,
BaseKVIndexer indexes all
Document from level-0 and 1. One can also limit the operation to a specific level by setting
depth_range: [k, k].
Recursive segmentation becomes extremely simple. Say we want to segment a long document while keeping its hierarchical structure, the implementation could look like the following:
The recursion can go both directions: deeper or shallower, depending on the preorder or postorder traversal) that the driver follows. In the segmenting example above,
SegmentDriver follows the preorder traversal: create segments first and then deep diving into them. On the contrary,
BaseRankDriver works in another direction: we first have to collect matching scores at deeper level and then aggregate this info back to the root level. Readers who have a basic AI background may quickly realize this is quite similar to convolution and deconvolution operations in deep neural networks.
Hello-World after Refactoring
To see how effective this structure is, let’s make a side-by-side comparison on the Flow YAML specification behind
jina hello world. From 23 lines to 11 lines! Not bad at all!
New Query Language Driver
v0.5.0, we introduce a new set of
Driver called Query Language Driver. It is designed for filtering the content of the request based on the condition defined in the YAML or the payload, so that the executor only works on the filtered request. Query language driver can also be used to route requests into different sub-flows based on the condition. In the example below, by equipping
FilterQL driver to
Mode2Encoder will only work on
modality = mode2 and leave others untouched.
Besides setting the attributes in YAML, one can also override the attributes of a Query Language Driver by setting
request.queryset in the Protobuf message. For example,
from jina.flow import Flow
The following query language drivers have been implemented in
|Query Language Driver||Description||Common Name in other query language|
|select attributes to include in the results|
|take a part of the list of |
|sort list of |
|reverse the list of collections|
Retrospect: Query Language From Scratch
It is interesting to take a retrospect of how we end up here. Essentially, we need an in-memory ORM/SQL that supports expressions and clauses and allows users to compose advance queries. Though Python has built-in advanced comprehensions covering a big chunk of SQL already, it is not easy to define Python expressions in YAML or JSON. In the early version of Jina, I have tried to leverage
Driver to parse arbitrary Python expression. It certainly accomplished the task itself but leaving a big security risk to the whole stack. It is not a fair exchange.
We also reviewed several open-source solutions that implement similar functionality, however, each has some limitation:
FieldMask: can not select sub-fields in the
- QuerySet in Django: add extra dependency
- pythonql: can not parse query language from YAML/JSON;
- lookupy: do not support nested structure in a list, not actively maintained;
- py-enumerable: designed to be LINQ in Python, can not parse query language from YAML/JSON;
- pyflwor: a customized model view for the data, not actively maintained and BSD license;
- django-pb-model: need to maintain two views (Protobuf and Django) of every
Requestand frequently converting between them.
In the end, we decided to implement the query language from scratch. There are two ways to implement it:
- Implement a single all-mighty
QueryLangDriver, which handles complex language parsing internally.
- Implement each query operator as an orthogonal
Driver. The complex query language is realized by chaining small independent drivers together.
While Option (1) looks like a cleaner solution, implementing a new query language is non-trivial. We likely have to create a new DSL and its parser. Frankly, nobody likes DSL. Option (2) is much easier to implement and to test. However, some complex syntax may not be realized, as the query language built in this way is always fluent-style.
v0.5.0 goes for the second option. We also leave an interface for the first option, see
Hub as a Git Submodule
v0.5.0 all concrete
Executor classes are moved to a separate repository
jina-hub repository is then used as a git submodule in
jina repository under the folder
hub/. Only abstract executors such as
BaseExecutor are maintained in the
jina repository under the folder
executors/. We also clarify the file structure of a customized executor. Besides the Python code, the bundle should include
Dockerfile, tests, and
README.md. Users can simply type
jina hub new in the console to create a new executor from the template. The following figure illustrates this idea.
To list all executors, one can use
jina check. As before, Hub executors can be used in the executor YAML,
or in Python,
from jina.hub.encoders.image.BigTransferEncoder import BigTransferEncoder
One can also build the executor into a Docker image via:
jina hub build .
and then use the image directly in Flow API
f = Flow().add(uses='jinahub/pod.encoder.BigTransferEncoder:0.0.2')
v0.5.0, unit test is required for all Hub executors. To enforce this rule, we add
pytest into the bundle template and
Dockerfile. The developer needs to implement the test case based on the given boilerplate. This ensures that every built image can be run out of the box. We wrap the Hub building logic into a Github action and publish it in Github Marketplace, allowing people to use it in their own CI/CD workflows.
Decoupling Hub from the Jina main repository is crucial and must be done now. With the increasing size of the project, the border between the core and hub becomes very fuzzy to the community. In the last two months, we found that nearly all executors are committed to
jina/executors. However, customized executors often require extra dependencies and specific running environment that are not available locally, which poses the hardship of testing them in a single run. Before
v0.5.0, not all executors in
jina/executors are well tested, and the code quality varies greatly. This refactoring sets the Jina Hub back on track and helps our engineers and community to scale better as the project grows.
Centralized vs. Decentralized Jina Hub
Note that, committing to
jina-ai/jina-hub repository is not the only way to contribute or use the Hub. One can also maintain executors in its own (private) repository by simply adding this Github action to the CI/CD workflow. In other words, we provide both centralized and decentralized ways to use Hub. The following figure illustrates these two procedures.
This post highlighted the three major updates in
v0.5.0. Besides that, there are hundreds of patches about improving performance, scalability, and easy-to-use. While the project’s growth has introduced challenges, we feel very fortunate to have the open-source community on our side and keep the momentum going strong on the development of Jina. The number of contributors to Jina has increased to 49 by today. If you want to know more about our engineering roadmap and the rationale behind specific features, welcome to join our monthly Engineering All Hands in Public via Zoom or Youtube live stream. If you want to join us as full-time AI/back-end engineers, we are more than welcome, and please send your CV to [email protected]. Let’s build the next neural search ecosystem together!