A Better Practice for Managing Many extras_require
Dependencies in Python
Background
One problem I was facing when building GNES: Generic Neural Elastic Search is how to effectively control the dependencies in Python. The main dependencies of GNES is very simple: numpy
, grpcio
, pyzmq
and a YAML parser. But this only runs a vanilla GNES with no fancy deep learning models nor preprocessors. As GNES aims to provide a generic solution for multi-modality semantic search, it often requires extra packages from different domains, such as NLP and CV. Apparently, these optional dependencies are not required for all uses of GNES: people who use GPT2 as text encoder may require pytorch-transformer
but not opencv
and Tensorflow. We already know that setuptools
provides a extras_require
field that allows you to define those plugin-like dependencies as following:
1 | setup( |
This allows one to later cherry-pick the feature via pip install gnes[bert]
. However, in practice I found difficult to maintain this feature map as it grows due to the following problems:
- a package can enable multiple features, but I don’t want to copy-paste the same string
opencv-python>=4.0.0
to everywhere; - the same feature can be enabled by different packages, but often users only install the one they preferred. See
bert
in the last example; - it is not straightforward to see which features a package enabled and what consequences if this package is broken/removed/badly-licensed;
- unable to install all dependencies via
pip install gnes[all]
; - hardcoding this big map in
setup.py
is cumbersome. By decoupling them I can version-control each one better.
Anyhow, if you dislike something, you always find thousands of reasons to dislike it.
Solution: Build an “Inverted Index”
The solution that I figured out is simple but effective: maintain a package-to-feature map in a separate plain text file, then parse it to an inverted map in setup.py
. So inverted index, kind of. For example, my extra-requirements.txt
looks like the following:
1 | # FORMAT |
If you work with an open governance model, you may want to set up some guidelines in
CONTRIBUTING
about the feature tags, so that other contributors can follow, e.g.:- always reuse the existing feature tags if possible;
- when create a new tag, keep it alphabetical, short and general;
- tag order does not matter, and duplicated tag will be ignored.
Now here is my parser to “invert” this map into the format that extras_require
needs:
1 | # this function should be put in 'setup.py' |
, which returns a map as follows:
1 | bert: {'pytorch-transformers', 'flair>=0.4.1', 'bert-serving-server>=1.8.6', 'bert-serving-client>=1.8.6'} |
A nice thing about this solution is that you can add sophisticated logic in the parse function to enable fancy features. In this example, I use the canonical package name as the feature tag (e.g. annoy: [annoy==1.15.2]
), and add tag all
to allow a full installation via pip install .[all]
. You can also add logic to improve granularity of the tags, or introduce hierarchies to them. Just remember to assign it to extras_require
in your setup
construct:
1 | setup( |
The complete example can be found here.
Conclusion
In this article, I introduce a simple yet effective solution to manage optional dependencies via extras_require
in Python. This method is particularly useful when you have too many optional dependencies and you want users to install them as plugins or on-demand features. Comparing to the traditional method where we hardcode a big feature map in setup.py
, this method is more readable and easier to maintain.
Managing a readable setuptools
script is probably not on the priority list of every (or any?) AI engineer, still I think it is a nice little trick to keep in mind and make life a bit easier. I’m not going to call it the best practice because that would be way ambitious. So a better practice it is (than what described in the documentation).
That said, in GNES we actually have a more sustainable way to manage AI model dependencies via Docker containers, including required Python packages, deep learning framework, pretrained models, hyperparameters and GPU drivers. Everything is wrapped in one container, and the model developers enjoys the full autonomy. This concept is what I called Model as Docker, and Docker as a Plugin. Interested readers are recommended to checkout GNES Hub and find out how to use it in GNES.