lasso - A multithreaded and single threaded string interner that allows strings to be cached with a minimal memory footprint, associating them with a unique key that can be used to retrieve them at any time
A multithreaded and single threaded string interner that allows strings to be cached with a minimal memory footprint, associating them with a unique k
A multithreaded and single threaded string interner that allows strings to be cached with a minimal memory footprint, associating them with a unique key that can be used to retrieve them at any time. A Rodeo allows O(1) internment and resolution and can be turned into a RodeoReader to allow for contention-free resolutions with both key to str and str to key operations. It can also be turned into a RodeoResolver with only key to str operations for the lowest possible memory usage.
Which interner do I use?
For single-threaded workloads Rodeo is encouraged, while multi-threaded applications should use ThreadedRodeo. Both of these are the only way to intern strings, but most applications will hit a stage where they are done interning strings, and at that point is where the choice between RodeoReader and RodeoResolver. If the user needs to get keys for strings still, then they must use the RodeoReader (although they can still transfer into a RodeoResolver) at this point. For users who just need key to string resolution, the RodeoResolver gives contention-free access at the minimum possible memory usage. Note that to gain access to ThreadedRodeo the multi-threaded feature is required.
Interner
Thread-safe
Intern String
str to key
key to str
Contention Free
Memory Usage
Rodeo
❌
✅
✅
✅
N/A
Medium
ThreadedRodeo
✅
✅
✅
✅
❌
Most
RodeoReader
✅
❌
✅
✅
✅
Medium
RodeoResolver
✅
❌
❌
✅
✅
Least
Cargo Features
By default lasso has zero dependencies and only Rodeo is exposed. To make use of ThreadedRodeo, you must enable the multi-threaded feature.
multi-threaded - Enables ThreadedRodeo, the interner for multi-threaded tasks
hashbrown-table - Uses hashbrown as the internal HashMap
ahasher - Use ahash's RandomState as the default hasher
no-std - Enables no_std + alloc support for Rodeo and ThreadedRodeo
Automatically enables the following required features:
dashmap/no_std - no_std compatibility for DashMap
hashbrown-table - no_stdHashMap
ahasher - no_std hashing function
serialize - Implements Serialize and Deserialize for all Spur types
Example: Using Rodeo
use lasso::Rodeo;
letmut rodeo = Rodeo::default();
let key = rodeo.get_or_intern("Hello, world!");
// Easily retrieve the value of a key and find the key for valuesassert_eq!("Hello, world!", rodeo.resolve(&key));
assert_eq!(Some(key), rodeo.get("Hello, world!"));
// Interning the same string again will yield the same keylet key2 = rodeo.get_or_intern("Hello, world!");
assert_eq!(key, key2);
Example: Using ThreadedRodeo
use lasso::ThreadedRodeo;
use std::{thread, sync::Arc};
let rodeo = Arc::new(ThreadedRodeo::default());
let key = rodeo.get_or_intern("Hello, world!");
// Easily retrieve the value of a key and find the key for valuesassert_eq!("Hello, world!", rodeo.resolve(&key));
assert_eq!(Some(key), rodeo.get("Hello, world!"));
// Interning the same string again will yield the same keylet key2 = rodeo.get_or_intern("Hello, world!");
assert_eq!(key, key2);
// ThreadedRodeo can be shared across threadslet moved = Arc::clone(&rodeo);
let hello = thread::spawn(move|| {
assert_eq!("Hello, world!", moved.resolve(&key));
moved.get_or_intern("Hello from the thread!")
})
.join()
.unwrap();
assert_eq!("Hello, world!", rodeo.resolve(&key));
assert_eq!("Hello from the thread!", rodeo.resolve(&hello));
Example: Creating a RodeoReader
use lasso::Rodeo;
// Rodeo and ThreadedRodeo are interchangeable hereletmut rodeo = Rodeo::default();
let key = rodeo.get_or_intern("Hello, world!");
assert_eq!("Hello, world!", rodeo.resolve(&key));
let reader = rodeo.into_reader();
// Reader keeps all the strings from the parentassert_eq!("Hello, world!", reader.resolve(&key));
assert_eq!(Some(key), reader.get("Hello, world!"));
// The Reader can now be shared across threads, no matter what kind of Rodeo created it
Example: Creating a RodeoResolver
use lasso::Rodeo;
// Rodeo and ThreadedRodeo are interchangeable hereletmut rodeo = Rodeo::default();
let key = rodeo.get_or_intern("Hello, world!");
assert_eq!("Hello, world!", rodeo.resolve(&key));
let resolver = rodeo.into_resolver();
// Resolver keeps all the strings from the parentassert_eq!("Hello, world!", resolver.resolve(&key));
// The Resolver can now be shared across threads, no matter what kind of Rodeo created it
Benchmarks
Benchmarks were gathered with Criterion.rs OS: Windows 10 CPU: Ryzen 9 3900X at 3800Mhz RAM: 3200Mhz Rustc: Stable 1.43.1
When the nightly feature is enabled, this is the performance you can expect from Rodeo. The functions listed are the ones currently affected by the changes of the nightly feature, and the benchmarks were preformed with std's RandomState. Testing was done on Rust Nightly 1.45.0
(I wrote a comparison of memory usage of different interners. This issue is suggesting some improvements from my own interner implementation that are the cause of the nearly 3x memory usage difference.) You've quoted memory footprint as a key feature of lasso, so I thought you'd be interested to see my analysis. I've not actually measured performance characteristics of any of these tweaks, just the memory usage patterns.
The current definition of the (single threaded) Rodeo is roughly (degenericized)
When using the raw_entry API, the size of the str -> Spur map can be cut at least to a third by instead storing (effectively) HashSet<Spur> (though for raw_entry it's a HashMap<Spur, ()>) and using raw_entry to compare the Spur entries as if they were the string they map to. The cost for this memory saving is an extra indirection involved in hashing/comparing entries in the map (for any str -> Spur conversions).
There is also a further optimization potential for the Spur -> str map. There are two potential options here:
Reduce memory usage of the map by about 25%. Instead of storing the &[u8] directly in the vector, store indexes into the arena. IIUC, these would minimally be { bucket_idx: u32, start_idx: u32, end_idx: u32 } for the current arena. (In my own crate, it's just (u32, u32) as the arena is a single contiguous allocation; however, I believe the less-copies arena approach you have here to probably be significantly better for tail-latency.) The cost would be extra indirection both on Spur -> str lookups and on str -> Spur lookups if the above change to that map is used.
Allow interning of real &'static str in the interner without copying them into internal storage. With a direct Spur -> str map that doesn't go indirect through the arena, this is trivial to support; just put the &'static str into the map. And interning of statically known strings is a surprisingly common and useful thing to be able to do; most contexts that benefit from an interner have some set of known names they know they'll always use and already have baked into the binary's static data section in order to treat them specially. (Here's rustc's list!)
If lasso can incorporate even just the first size optimization for the Spur -> str map, even if just with the hashbrown feature, I'd feel a lot more comfortable recommending lasso instead of maintaining my own interner.
Suppose a new string is being interned by two threads running at the same time, then the if at line 331 (shard.read().get(...)) would return None for both threads. Both threads continue to line 341, one will call self.map.insert() which will write the string into the map, the second thread would do the same, overwriting the value from the first thread.
This would then create a situation where two Spurs could exist that resolve to identical strings, but not be equal to each other because their keys are different.
One option to address this would be to lock the shard for the duration as we insert into the arena and increment the key. If we would like to avoid locking for that long, perhaps checking the return value from self.map.insert(...) would help, as we can verify that we already added that string in, and just return its key, but then we would need to undo the other operations, and it doesn't seem possible to remove strings from the arena. This way we may end up using slightly more memory and keys than actually needed (which is already the case in this scenario), but at the very least, the Spur equality property would be maintained.
I recently came across a situation where I was storing a &mut Lasso and needed to pass it in to something that expected a T: Resolver, and found that &mut Lasso: !Resolver. I would assume this isn't a particularly controversial addition, since these blanket impls are a pretty standard thing, but let me know if I've missed something important here.
As of now, interners just return None when their memory limit has been exhausted, but in the future I plan to shift over to Results to give better insight to what the interner's doing.
A number of methods were added, mostly to allow the same construction that capacity and hashers get, but the set_memory_limits method was also added which allows in-flight configuration of memory limits
Currently there's no sanity checking either, e.x. setting your capacity to 100 and max memory to 10 will allocate 100 bytes of memory but nothing more. I'm not entirely sure how/when sanity checking should be done, of if it's even useful
Adds a new public trait, Internable, which is by default implemented for str, OsStr, Path, and [u8]. This is a noticeably breaking change, due to the addition of a V generic on most things, as well as regressions for a couple type-inference locations.
I can add in #26 now if desired, as this seems a good time to make that change, if breaking inference anyways.
impl Default for Rodeo<Spur, RandomState> redirects to Self::new(), and Self::new() is implemented for any Rodeo<K, RandomState> where K: Key.
Is there any reason not to implement impl<K: Key> Default for Rodeo<K, RandomState>?
Thaks for this nice library :) Here is the issue I am facing:
When producing a release build of a library using lasso 0.4 with the following dependencies
serde = { version = "1.0.114", features = ["derive"] }
serde_json = "1.0.56"
typetag = "0.1"
lasso = { version = "0.4", features = ["multi-threaded", "serialize"] }
During
cargo build --release
Following compilation errors appear
error[E0277]: the size for values of type `str` cannot be known at compilation time
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
|
457 | let elem: &_ = $slice.get_unchecked($idx);
| ^^^^^^^^^^^^^ doesn't have a size known at compile-time
|
::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:455:49
|
455 | let key_string: &str = unsafe { index_unchecked!(strings, key.into_usize()) };
| ------------------------------------------- in this macro invocation
|
= help: the trait `std::marker::Sized` is not implemented for `str`
= note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0277]: the size for values of type `str` cannot be known at compilation time
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:433:27
|
433 | let mut strings = Vec::with_capacity(capacity.strings);
| ^^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time
|
= help: the trait `std::marker::Sized` is not implemented for `str`
= note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
= note: required by `std::vec::Vec::<T>::with_capacity`
error[E0599]: no method named `push` found for struct `std::vec::Vec<str>` in the current scope
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:471:29
|
471 | strings.push(allocated);
| ^^^^ method not found in `std::vec::Vec<str>`
|
= note: the method `push` exists but the following trait bounds were not satisfied:
`str: std::marker::Sized`
error[E0599]: no method named `get_unchecked` found for struct `std::vec::Vec<str>` in the current scope
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
|
457 | let elem: &_ = $slice.get_unchecked($idx);
| ^^^^^^^^^^^^^ method not found in `std::vec::Vec<str>`
|
::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:476:38
|
476 | ... unsafe { index_unchecked!(strings, key.into_usize()) };
| ------------------------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0308]: mismatched types
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/reader.rs:490:13
|
490 | strings,
| ^^^^^^^ expected `&str`, found `str`
|
= note: expected struct `std::vec::Vec<&'static str>`
found struct `std::vec::Vec<str>`
error[E0277]: the size for values of type `str` cannot be known at compilation time
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
|
457 | let elem: &_ = $slice.get_unchecked($idx);
| ^^^^^^^^^^^^^ doesn't have a size known at compile-time
|
::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:960:49
|
960 | let key_string: &str = unsafe { index_unchecked!(strings, key.into_usize()) };
| ------------------------------------------- in this macro invocation
|
= help: the trait `std::marker::Sized` is not implemented for `str`
= note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0277]: the size for values of type `str` cannot be known at compilation time
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:938:27
|
938 | let mut strings = Vec::with_capacity(capacity.strings);
| ^^^^^^^^^^^^^^^^^^ doesn't have a size known at compile-time
|
= help: the trait `std::marker::Sized` is not implemented for `str`
= note: to learn more, visit <https://doc.rust-lang.org/book/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait>
= note: required by `std::vec::Vec::<T>::with_capacity`
error[E0599]: no method named `push` found for struct `std::vec::Vec<str>` in the current scope
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:976:29
|
976 | strings.push(allocated);
| ^^^^ method not found in `std::vec::Vec<str>`
|
= note: the method `push` exists but the following trait bounds were not satisfied:
`str: std::marker::Sized`
error[E0599]: no method named `get_unchecked` found for struct `std::vec::Vec<str>` in the current scope
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/util.rs:457:31
|
457 | let elem: &_ = $slice.get_unchecked($idx);
| ^^^^^^^^^^^^^ method not found in `std::vec::Vec<str>`
|
::: /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:981:38
|
981 | ... unsafe { index_unchecked!(strings, key.into_usize()) };
| ------------------------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0308]: mismatched types
--> /Users/pablomartinez/.cargo/registry/src/github.com-1ecc6299db9ec823/lasso-0.4.1/src/single_threaded.rs:995:13
|
995 | strings,
| ^^^^^^^ expected `&str`, found `str`
|
= note: expected struct `std::vec::Vec<&'static str>`
found struct `std::vec::Vec<str>`
error: aborting due to 10 previous errors
Some errors have detailed explanations: E0277, E0308, E0599.
For more information about an error, try `rustc --explain E0277`.
error: could not compile `lasso`.
The issue doesn't appear when producing a non release build. Ended moving to lasso0.3.1 but the Key serializations isn't as nicer as in 0.4.1 :)