Welcome to EuclidesDB
EuclidesDB is a multi-model machine learning feature database that is tight coupled with PyTorch and provides a backend for including and querying data on the model feature space.
Category: C/C++ / Machine Learning |
Watchers: 33 |
Star: 625 |
Fork: 33 |
Last update: Jun 25, 2022 |
EuclidesDB is a multi-model machine learning feature database that is tight coupled with PyTorch and provides a backend for including and querying data on the model feature space.
Hello. Not an issue but general questions. The doc says "EuclidesDB is a multi-model machine learning feature database". Does this mean a query will be against two or more feature spaces? Secondly, where is the "multi-modal learning" taking place? Thanks
Great concept. I am interested in integrating RNN / LSTM support with this, is that on the roadmap for development? I saw anecdotal reference to it in one of the issues. Curious about using EuclidesDB for document search, potentially in the millions of documents (sentiment analysis and medical coding category correlations). I am assuming that would negate the use of your brute force method for search; how would Annoy or the other options scale with this?
I am not sure whether it is a new feature or issue. But my requirement is I have the documents like [ { "domain":"ecommerce", "public":true, "date":"2018-09-12", "doc_vec": [-0.92,0.34,0.778...] ,# dimension 100 "id":"hdh632387987" }, { "domain":"media", "public":false, "date":"2018-09-12", "doc_vec": [-0.12,0.24,0.778...] ,# dimension 100 "id":"hdh632383264687" } .... ] Required features:
So that we can use GPU to speed up the indexing
I see the main sorting function exposed is db.find_similar_image
which takes as input an image, a model, and a topk
parameter and returns the topk
most similar images. If we would like to sort the entire database on a query image, would setting topk
to the number of images added to the database for a particular model space work or would this be prohibitively slow for large databases (500k+ images)? I plan on testing this myself but wanted to see if you did similar tests already.
Sometimes it is useful to search for items similar to another item that is already in the database, so we should implement an RPC method to search for similar items based on an item id that is already in the database.
Due to the limitations of the PyTorch version of the prebuilt container, I wanted to build it myself by running:
docker build -t euclides_package package/
Even after modifying the whole Dockerfile, I can't get it to create the container successfully. Latest error was
CMake Error at CMakeLists.txt:31 (find_package): By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Torch", but CMake did not find one.
Could not find a package configuration file provided by "Torch" with any of the following names:
TorchConfig.cmake
torch-config.cmake
Add the installation prefix of "Torch" to CMAKE_PREFIX_PATH or set "Torch_DIR" to a directory containing one of the above files. If "Torch" provides a separate development package or SDK, be sure it has been installed.
Even when adding these paths to the Dockerfile, it still fails to find them. Has anyone been able to build this container? Could you share your Dockerfile?
Hi, thanks for creating this tool. I got an error while installation. It was on a Ubuntu GPU EC2 instance. Even though this looks like a Pytorch error, any idea how to debug/fix it?
./euclidesdb -c config.conf
Configuration config.conf loaded.
[EuclidesDB] 2020-05-02 01:21:18,816 [INFO]: EuclidesDB v.0.2.0 initialized.
terminate called after throwing an instance of 'c10::Error'
what(): [enforce fail at inline_container.cc:166] . file not found: resnet101/model.json
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*) + 0x76 (0x7f6529e92bb6 in lib/libc10.so)
frame #1: torch::jit::PyTorchStreamReader::getFileID(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4ac (0x7f652b5fab0c in lib/libcaffe2.so)
frame #2: torch::jit::PyTorchStreamReader::getRecord(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x33 (0x7f652b5fad13 in lib/libcaffe2.so)
frame #3: <unknown function> + 0x60a442 (0x7f652f0bd442 in lib/libtorch.so.1)
frame #4: torch::jit::load(std::istream&, c10::optional<c10::Device>) + 0x392 (0x7f652f0bf892 in lib/libtorch.so.1)
frame #5: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) + 0x70 (0x7f652f0bfa50 in lib/libtorch.so.1)
frame #6: ./euclidesdb() [0x485487]
frame #7: ./euclidesdb() [0x4869ce]
frame #8: ./euclidesdb() [0x411739]
frame #9: __libc_start_main + 0xe7 (0x7f65294f3b97 in /lib/x86_64-linux-gnu/libc.so.6)
frame #10: ./euclidesdb() [0x4141d9]
Work In Progress
See #6.
/source/proto
and python/examples/image
have to be shared at higher level in the folder hierarchy, I temporarily did ln -s
python/ -> client/python/
and java/ -> client/java/
?euclidesdb/euclidesdb
using bmuschko/gradle-docker-plugin
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with: status = StatusCode.CANCELLED details = "Cannot find the module: r" debug_error_string = "{"created":"@1572025938.938877000","description":"Error received from peer ipv6:[::1]:50000","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Cannot find the module: r","grpc_status":1}"
rpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Socket closed" debug_error_string = "{"created":"@1572025626.182212000","description":"Error received from peer ipv6:[::1]:50000","file":"src/core/lib/surface/call.cc","file_line":1052,"grpc_message":"Socket closed","grpc_status":14}"
Thank you for creating this.
In case of millions or billions of feature vectors, how should we scale? Where is the index stored (in RAM or disk...)? How fast it is when adding a new image and refreshing the index?
The parameter search_k
in Annoy isn't the same as the top_k
. This must be fixed in the search engine.