veloren/common/src/volumes/chunk.rs

448 lines
16 KiB
Rust
Raw Normal View History

common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
use crate::vol::{
BaseVol, IntoPosIterator, IntoVolIterator, RasterableVol, ReadVol, VolSize, WriteVol,
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
};
use core::{hash::Hash, iter::Iterator, marker::PhantomData, mem};
use hashbrown::HashMap;
use serde::{Deserialize, Serialize};
2019-01-02 19:22:01 +00:00
use vek::*;
#[derive(Debug)]
common: Rework volume API See the doc comments in `common/src/vol.rs` for more information on the API itself. The changes include: * Consistent `Err`/`Error` naming. * Types are named `...Error`. * `enum` variants are named `...Err`. * Rename `VolMap{2d, 3d}` -> `VolGrid{2d, 3d}`. This is in preparation to an upcoming change where a “map” in the game related sense will be added. * Add volume iterators. There are two types of them: * _Position_ iterators obtained from the trait `IntoPosIterator` using the method `fn pos_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `Vec3<i32>`. * _Volume_ iterators obtained from the trait `IntoVolIterator` using the method `fn vol_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `(Vec3<i32>, &Self::Vox)`. Those traits will usually be implemented by references to volume types (i.e. `impl IntoVolIterator<'a> for &'a T` where `T` is some type which usually implements several volume traits, such as `Chunk`). * _Position_ iterators iterate over the positions valid for that volume. * _Volume_ iterators do the same but return not only the position but also the voxel at that position, in each iteration. * Introduce trait `RectSizedVol` for the use case which we have with `Chonk`: A `Chonk` is sized only in x and y direction. * Introduce traits `RasterableVol`, `RectRasterableVol` * `RasterableVol` represents a volume that is compile-time sized and has its lower bound at `(0, 0, 0)`. The name `RasterableVol` was chosen because such a volume can be used with `VolGrid3d`. * `RectRasterableVol` represents a volume that is compile-time sized at least in x and y direction and has its lower bound at `(0, 0, z)`. There's no requirement on he lower bound or size in z direction. The name `RectRasterableVol` was chosen because such a volume can be used with `VolGrid2d`.
2019-09-03 22:23:29 +00:00
pub enum ChunkError {
2019-01-02 19:22:01 +00:00
OutOfBounds,
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// The volume is spatially subdivided into groups of `4*4*4` blocks. Since a
/// `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4`
/// groups. (These numbers are generic in the actual code such that there are
/// always `256` groups. I.e. the group size is chosen depending on the desired
/// total size of the `Chunk`.)
///
/// There's a single vector `self.vox` which consecutively stores these groups.
/// Each group might or might not be contained in `self.vox`. A group that is
/// not contained represents that the full group consists only of `self.default`
/// voxels. This saves a lot of memory because oftentimes a `Chunk` consists of
/// either a lot of air or a lot of stone.
///
/// To track whether a group is contained in `self.vox`, there's an index buffer
/// `self.indices : [u8; 256]`. It contains for each group
///
/// * (a) the order in which it has been inserted into `self.vox`, if the group
/// is contained in `self.vox` or
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// * (b) 255, otherwise. That case represents that the whole group consists
/// only of `self.default` voxels.
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
///
/// (Note that 255 is a valid insertion order for case (a) only if `self.vox` is
/// full and then no other group has the index 255. Therefore there's no
/// ambiguity.)
///
/// ## Rationale:
///
/// The index buffer should be small because:
///
/// * Small size increases the probability that it will always be in cache.
/// * The index buffer is allocated for every `Chunk` and an almost empty
/// `Chunk` shall not consume too much memory.
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
///
/// The number of 256 groups is particularly nice because it means that the
/// index buffer can consist of `u8`s. This keeps the space requirement for the
/// index buffer as low as 4 cache lines.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk<V, S: VolSize, M> {
indices: Vec<u8>, /* TODO (haslersn): Box<[u8; S::SIZE.x * S::SIZE.y * S::SIZE.z]>, this is
* however not possible in Rust yet */
2019-01-02 19:22:01 +00:00
vox: Vec<V>,
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
default: V,
2019-01-02 19:22:01 +00:00
meta: M,
phantom: PhantomData<S>,
}
impl<V, S: VolSize, M> Chunk<V, S, M> {
pub const GROUP_COUNT: Vec3<u32> = Vec3::new(
S::SIZE.x / Self::GROUP_SIZE.x,
S::SIZE.y / Self::GROUP_SIZE.y,
S::SIZE.z / Self::GROUP_SIZE.z,
);
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// `GROUP_COUNT_TOTAL` is always `256`, except if `VOLUME < 256`
const GROUP_COUNT_TOTAL: u32 = Self::VOLUME / Self::GROUP_VOLUME;
const GROUP_LONG_SIDE_LEN: u32 = 1 << ((Self::GROUP_VOLUME * 4 - 1).count_ones() / 3);
const GROUP_SIZE: Vec3<u32> = Vec3::new(
Self::GROUP_LONG_SIDE_LEN,
Self::GROUP_LONG_SIDE_LEN,
Self::GROUP_VOLUME / (Self::GROUP_LONG_SIDE_LEN * Self::GROUP_LONG_SIDE_LEN),
);
const GROUP_VOLUME: u32 = [Self::VOLUME / 256, 1][(Self::VOLUME < 256) as usize];
const VOLUME: u32 = (S::SIZE.x * S::SIZE.y * S::SIZE.z) as u32;
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// Creates a new `Chunk` with the provided dimensions and all voxels filled
/// with duplicates of the provided voxel.
pub fn filled(default: V, meta: M) -> Self {
// TODO (haslersn): Alter into compile time assertions
//
// An extent is valid if it fulfils the following conditions.
//
// 1. In each direction, the extent is a power of two.
// 2. In each direction, the group size is in [1, 256].
// 3. In each direction, the group count is in [1, 256].
//
// Rationales:
//
// 1. We have code in the implementation that assumes it. In particular,
// code using `.count_ones()`.
// 2. The maximum group size is `256x256x256`, because there's code that
// stores group relative indices as `u8`.
// 3. There's code that stores group indices as `u8`.
debug_assert!(S::SIZE.x.is_power_of_two());
debug_assert!(S::SIZE.y.is_power_of_two());
debug_assert!(S::SIZE.z.is_power_of_two());
debug_assert!(0 < Self::GROUP_SIZE.x);
debug_assert!(0 < Self::GROUP_SIZE.y);
debug_assert!(0 < Self::GROUP_SIZE.z);
debug_assert!(Self::GROUP_SIZE.x <= 256);
debug_assert!(Self::GROUP_SIZE.y <= 256);
debug_assert!(Self::GROUP_SIZE.z <= 256);
debug_assert!(0 < Self::GROUP_COUNT.x);
debug_assert!(0 < Self::GROUP_COUNT.y);
debug_assert!(0 < Self::GROUP_COUNT.z);
debug_assert!(Self::GROUP_COUNT.x <= 256);
debug_assert!(Self::GROUP_COUNT.y <= 256);
debug_assert!(Self::GROUP_COUNT.z <= 256);
Self {
indices: vec![255; Self::GROUP_COUNT_TOTAL as usize],
vox: Vec::new(),
default,
meta,
phantom: PhantomData,
}
}
/// Compress this subchunk by frequency.
pub fn defragment(&mut self)
where
V: Clone + Eq + Hash,
{
// First, construct a HashMap with max capacity equal to GROUP_COUNT (since each
// filled group can have at most one slot).
let mut map = HashMap::with_capacity(Self::GROUP_COUNT_TOTAL as usize);
let vox = &self.vox;
let default = &self.default;
self.indices
.iter()
.enumerate()
.for_each(|(grp_idx, &base)| {
let start = usize::from(base) * Self::GROUP_VOLUME as usize;
let end = start + Self::GROUP_VOLUME as usize;
if let Some(group) = vox.get(start..end) {
// Check to see if all blocks in this group are the same.
let mut group = group.iter();
let first = group.next().expect("GROUP_VOLUME ≥ 1");
if group.all(|block| block == first) {
// All blocks in the group were the same, so add our position to this entry
// in the HashMap.
map.entry(first).or_insert(vec![]).push(grp_idx);
}
} else {
// This slot is empty (i.e. has the default value).
map.entry(default).or_insert(vec![]).push(grp_idx);
}
});
// Now, find the block with max frequency in the HashMap and make that our new
// default.
let (new_default, default_groups) = if let Some((new_default, default_groups)) = map
.into_iter()
.max_by_key(|(_, default_groups)| default_groups.len())
{
(new_default.clone(), default_groups)
} else {
// There is no good choice for default group, so leave it as is.
return;
};
// For simplicity, we construct a completely new voxel array rather than
// attempting in-place updates (TODO: consider changing this).
let mut new_vox =
Vec::with_capacity(Self::GROUP_COUNT_TOTAL as usize - default_groups.len());
let num_groups = self.num_groups();
self.indices
.iter_mut()
.enumerate()
.for_each(|(grp_idx, base)| {
if default_groups.contains(&grp_idx) {
// Default groups become 255
*base = 255;
} else {
// Other groups are allocated in increasing order by group index.
// NOTE: Cannot overflow since the current implicit group index can't be at the
// end of the vector until at the earliest after the 256th iteration.
let old_base = usize::from(mem::replace(
base,
(new_vox.len() / Self::GROUP_VOLUME as usize) as u8,
));
if old_base >= num_groups {
// Old default, which (since we reached this branch) is not equal to the new
// default, so we have to write out the old default.
new_vox
.resize(new_vox.len() + Self::GROUP_VOLUME as usize, default.clone());
} else {
let start = old_base * Self::GROUP_VOLUME as usize;
let end = start + Self::GROUP_VOLUME as usize;
new_vox.extend_from_slice(&vox[start..end]);
}
}
});
// Finally, reset our vox and default values to the new ones.
self.vox = new_vox;
self.default = new_default;
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// Get a reference to the internal metadata.
pub fn metadata(&self) -> &M { &self.meta }
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// Get a mutable reference to the internal metadata.
pub fn metadata_mut(&mut self) -> &mut M { &mut self.meta }
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
pub fn num_groups(&self) -> usize { self.vox.len() / Self::GROUP_VOLUME as usize }
/// Returns `Some(v)` if the block is homogeneous and contains nothing but
/// voxels of value `v`, and `None` otherwise. This method is
/// conservative (it may return None when the chunk is
/// actually homogeneous) unless called immediately after `defragment`.
pub fn homogeneous(&self) -> Option<&V> {
if self.num_groups() == 0 {
Some(&self.default)
} else {
None
}
}
2019-01-02 19:22:01 +00:00
#[inline(always)]
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
fn grp_idx(pos: Vec3<i32>) -> u32 {
let grp_pos = pos.map2(Self::GROUP_SIZE, |e, s| e as u32 / s);
(grp_pos.z * (Self::GROUP_COUNT.y * Self::GROUP_COUNT.x))
+ (grp_pos.y * Self::GROUP_COUNT.x)
+ (grp_pos.x)
}
#[inline(always)]
fn rel_idx(pos: Vec3<i32>) -> u32 {
let rel_pos = pos.map2(Self::GROUP_SIZE, |e, s| e as u32 % s);
(rel_pos.z * (Self::GROUP_SIZE.y * Self::GROUP_SIZE.x))
+ (rel_pos.y * Self::GROUP_SIZE.x)
+ (rel_pos.x)
}
#[inline(always)]
fn idx_unchecked(&self, pos: Vec3<i32>) -> Option<usize> {
let grp_idx = Self::grp_idx(pos);
let rel_idx = Self::rel_idx(pos);
let base = u32::from(self.indices[grp_idx as usize]);
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let num_groups = self.vox.len() as u32 / Self::GROUP_VOLUME;
if base >= num_groups {
2019-01-02 19:22:01 +00:00
None
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
} else {
Some((base * Self::GROUP_VOLUME + rel_idx) as usize)
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
}
}
#[inline(always)]
fn force_idx_unchecked(&mut self, pos: Vec3<i32>) -> usize
where
V: Clone,
{
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let grp_idx = Self::grp_idx(pos);
let rel_idx = Self::rel_idx(pos);
let base = &mut self.indices[grp_idx as usize];
let num_groups = self.vox.len() as u32 / Self::GROUP_VOLUME;
if u32::from(*base) >= num_groups {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
*base = num_groups as u8;
self.vox
.extend(std::iter::repeat(self.default.clone()).take(Self::GROUP_VOLUME as usize));
2019-01-02 19:22:01 +00:00
}
(u32::from(*base) * Self::GROUP_VOLUME + rel_idx) as usize
2019-01-02 19:22:01 +00:00
}
2019-06-05 15:22:06 +00:00
#[inline(always)]
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
fn get_unchecked(&self, pos: Vec3<i32>) -> &V {
match self.idx_unchecked(pos) {
Some(idx) => &self.vox[idx],
None => &self.default,
}
}
#[inline(always)]
fn set_unchecked(&mut self, pos: Vec3<i32>, vox: V) -> V
where
V: Clone + PartialEq,
{
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
if vox != self.default {
let idx = self.force_idx_unchecked(pos);
core::mem::replace(&mut self.vox[idx], vox)
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
} else if let Some(idx) = self.idx_unchecked(pos) {
core::mem::replace(&mut self.vox[idx], vox)
} else {
self.default.clone()
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
}
2019-06-05 15:22:06 +00:00
}
2019-01-02 19:22:01 +00:00
}
impl<V, S: VolSize, M> BaseVol for Chunk<V, S, M> {
common: Rework volume API See the doc comments in `common/src/vol.rs` for more information on the API itself. The changes include: * Consistent `Err`/`Error` naming. * Types are named `...Error`. * `enum` variants are named `...Err`. * Rename `VolMap{2d, 3d}` -> `VolGrid{2d, 3d}`. This is in preparation to an upcoming change where a “map” in the game related sense will be added. * Add volume iterators. There are two types of them: * _Position_ iterators obtained from the trait `IntoPosIterator` using the method `fn pos_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `Vec3<i32>`. * _Volume_ iterators obtained from the trait `IntoVolIterator` using the method `fn vol_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `(Vec3<i32>, &Self::Vox)`. Those traits will usually be implemented by references to volume types (i.e. `impl IntoVolIterator<'a> for &'a T` where `T` is some type which usually implements several volume traits, such as `Chunk`). * _Position_ iterators iterate over the positions valid for that volume. * _Volume_ iterators do the same but return not only the position but also the voxel at that position, in each iteration. * Introduce trait `RectSizedVol` for the use case which we have with `Chonk`: A `Chonk` is sized only in x and y direction. * Introduce traits `RasterableVol`, `RectRasterableVol` * `RasterableVol` represents a volume that is compile-time sized and has its lower bound at `(0, 0, 0)`. The name `RasterableVol` was chosen because such a volume can be used with `VolGrid3d`. * `RectRasterableVol` represents a volume that is compile-time sized at least in x and y direction and has its lower bound at `(0, 0, z)`. There's no requirement on he lower bound or size in z direction. The name `RectRasterableVol` was chosen because such a volume can be used with `VolGrid2d`.
2019-09-03 22:23:29 +00:00
type Error = ChunkError;
type Vox = V;
2019-01-02 19:22:01 +00:00
}
impl<V, S: VolSize, M> RasterableVol for Chunk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
const SIZE: Vec3<u32> = S::SIZE;
2019-01-02 19:22:01 +00:00
}
impl<V, S: VolSize, M> ReadVol for Chunk<V, S, M> {
2019-01-02 19:22:01 +00:00
#[inline(always)]
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
fn get(&self, pos: Vec3<i32>) -> Result<&Self::Vox, Self::Error> {
if !pos
.map2(S::SIZE, |e, s| 0 <= e && e < s as i32)
.reduce_and()
{
Err(Self::Error::OutOfBounds)
} else {
Ok(self.get_unchecked(pos))
}
2019-01-02 19:22:01 +00:00
}
}
impl<V: Clone + PartialEq, S: VolSize, M> WriteVol for Chunk<V, S, M> {
2019-01-02 19:22:01 +00:00
#[inline(always)]
fn set(&mut self, pos: Vec3<i32>, vox: Self::Vox) -> Result<Self::Vox, Self::Error> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
if !pos
.map2(S::SIZE, |e, s| 0 <= e && e < s as i32)
.reduce_and()
{
Err(Self::Error::OutOfBounds)
} else {
Ok(self.set_unchecked(pos, vox))
}
2019-01-02 19:22:01 +00:00
}
}
pub struct ChunkPosIter<V, S: VolSize, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
// Store as `u8`s so as to reduce memory footprint.
lb: Vec3<i32>,
ub: Vec3<i32>,
pos: Vec3<i32>,
phantom: PhantomData<Chunk<V, S, M>>,
}
impl<V, S: VolSize, M> ChunkPosIter<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
fn new(lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> Self {
// If the range is empty, then we have the special case `ub = lower_bound`.
let ub = if lower_bound.map2(upper_bound, |l, u| l < u).reduce_and() {
upper_bound
} else {
lower_bound
};
2019-01-02 19:22:01 +00:00
Self {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
lb: lower_bound,
ub,
pos: lower_bound,
2019-01-02 19:22:01 +00:00
phantom: PhantomData,
}
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
}
2019-01-02 19:22:01 +00:00
impl<V, S: VolSize, M> Iterator for ChunkPosIter<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type Item = Vec3<i32>;
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
if self.pos.z >= self.ub.z {
return None;
}
let res = Some(self.pos);
self.pos.x += 1;
if self.pos.x != self.ub.x && self.pos.x % Chunk::<V, S, M>::GROUP_SIZE.x as i32 != 0 {
return res;
}
self.pos.x = std::cmp::max(
self.lb.x,
(self.pos.x - 1) & !(Chunk::<V, S, M>::GROUP_SIZE.x as i32 - 1),
);
self.pos.y += 1;
if self.pos.y != self.ub.y && self.pos.y % Chunk::<V, S, M>::GROUP_SIZE.y as i32 != 0 {
return res;
}
self.pos.y = std::cmp::max(
self.lb.y,
(self.pos.y - 1) & !(Chunk::<V, S, M>::GROUP_SIZE.y as i32 - 1),
);
self.pos.z += 1;
if self.pos.z != self.ub.z && self.pos.z % Chunk::<V, S, M>::GROUP_SIZE.z as i32 != 0 {
return res;
}
self.pos.z = std::cmp::max(
self.lb.z,
(self.pos.z - 1) & !(Chunk::<V, S, M>::GROUP_SIZE.z as i32 - 1),
);
self.pos.x = (self.pos.x | (Chunk::<V, S, M>::GROUP_SIZE.x as i32 - 1)) + 1;
if self.pos.x < self.ub.x {
return res;
}
self.pos.x = self.lb.x;
self.pos.y = (self.pos.y | (Chunk::<V, S, M>::GROUP_SIZE.y as i32 - 1)) + 1;
if self.pos.y < self.ub.y {
return res;
}
self.pos.y = self.lb.y;
self.pos.z = (self.pos.z | (Chunk::<V, S, M>::GROUP_SIZE.z as i32 - 1)) + 1;
res
2019-01-02 19:22:01 +00:00
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
}
2019-01-02 19:22:01 +00:00
pub struct ChunkVolIter<'a, V, S: VolSize, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
chunk: &'a Chunk<V, S, M>,
iter_impl: ChunkPosIter<V, S, M>,
}
impl<'a, V, S: VolSize, M> Iterator for ChunkVolIter<'a, V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type Item = (Vec3<i32>, &'a V);
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
self.iter_impl
.next()
.map(|pos| (pos, self.chunk.get_unchecked(pos)))
}
}
impl<V, S: VolSize, M> Chunk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
/// It's possible to obtain a positional iterator without having a `Chunk`
/// instance.
pub fn pos_iter(lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ChunkPosIter<V, S, M> {
ChunkPosIter::<V, S, M>::new(lower_bound, upper_bound)
}
}
impl<'a, V, S: VolSize, M> IntoPosIterator for &'a Chunk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type IntoIter = ChunkPosIter<V, S, M>;
fn pos_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> Self::IntoIter {
Chunk::<V, S, M>::pos_iter(lower_bound, upper_bound)
}
}
impl<'a, V, S: VolSize, M> IntoVolIterator<'a> for &'a Chunk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type IntoIter = ChunkVolIter<'a, V, S, M>;
fn vol_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> Self::IntoIter {
ChunkVolIter::<'a, V, S, M> {
chunk: self,
iter_impl: ChunkPosIter::<V, S, M>::new(lower_bound, upper_bound),
}
2019-01-02 19:22:01 +00:00
}
}