veloren/common/src/terrain/chonk.rs

361 lines
13 KiB
Rust
Raw Normal View History

use crate::{
2019-09-03 23:00:50 +00:00
vol::{
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
BaseVol, IntoPosIterator, IntoVolIterator, ReadVol, RectRasterableVol, RectVolSize,
VolSize, WriteVol,
2019-09-03 23:00:50 +00:00
},
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
volumes::chunk::{Chunk, ChunkError, ChunkPosIter, ChunkVolIter},
};
use core::{hash::Hash, marker::PhantomData};
use serde::{Deserialize, Serialize};
use vek::*;
#[derive(Debug)]
pub enum ChonkError {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
SubChunkError(ChunkError),
OutOfBounds,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
pub struct SubChunkSize<ChonkSize: RectVolSize> {
phantom: PhantomData<ChonkSize>,
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
// TODO (haslersn): Assert ChonkSize::RECT_SIZE.x == ChonkSize::RECT_SIZE.y
impl<ChonkSize: RectVolSize> VolSize for SubChunkSize<ChonkSize> {
const SIZE: Vec3<u32> = Vec3 {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
x: ChonkSize::RECT_SIZE.x,
y: ChonkSize::RECT_SIZE.x,
// NOTE: Currently, use 32 instead of 2 for RECT_SIZE.x = 128.
z: ChonkSize::RECT_SIZE.x / 2,
};
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type SubChunk<V, S, M> = Chunk<V, SubChunkSize<S>, M>;
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chonk<V, S: RectVolSize, M: Clone> {
z_offset: i32,
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
sub_chunks: Vec<SubChunk<V, S, M>>,
below: V,
above: V,
meta: M,
phantom: PhantomData<S>,
}
impl<V, S: RectVolSize, M: Clone> Chonk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
pub fn new(z_offset: i32, below: V, above: V, meta: M) -> Self {
Self {
z_offset,
sub_chunks: Vec::new(),
below,
above,
meta,
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
phantom: PhantomData,
}
}
pub fn meta(&self) -> &M { &self.meta }
2019-06-11 18:39:25 +00:00
#[inline]
pub fn get_min_z(&self) -> i32 { self.z_offset }
#[inline]
2019-06-04 17:19:40 +00:00
pub fn get_max_z(&self) -> i32 {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
self.z_offset + (self.sub_chunks.len() as u32 * SubChunkSize::<S>::SIZE.z) as i32
}
pub fn sub_chunks_len(&self) -> usize { self.sub_chunks.len() }
2019-09-09 08:46:58 +00:00
pub fn sub_chunk_groups(&self) -> usize {
self.sub_chunks.iter().map(SubChunk::num_groups).sum()
}
2021-07-19 01:42:10 +00:00
/// Iterate through the voxels in this chunk, attempting to avoid those that
/// are unchanged (i.e: match the `below` and `above` voxels). This is
/// generally useful for performance reasons.
pub fn iter_changed(&self) -> impl Iterator<Item = (Vec3<i32>, &V)> + '_ {
self.sub_chunks
.iter()
.enumerate()
.filter(|(_, sc)| sc.num_groups() > 0)
.flat_map(move |(i, sc)| {
let z_offset = self.z_offset + i as i32 * SubChunkSize::<S>::SIZE.z as i32;
2021-07-19 01:42:10 +00:00
sc.vol_iter(Vec3::zero(), SubChunkSize::<S>::SIZE.map(|e| e as i32))
.map(move |(pos, vox)| (pos + Vec3::unit_z() * z_offset, vox))
})
}
// Returns the index (in self.sub_chunks) of the SubChunk that contains
// layer z; note that this index changes when more SubChunks are prepended
#[inline]
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
fn sub_chunk_idx(&self, z: i32) -> i32 {
let diff = z - self.z_offset;
diff >> (SubChunkSize::<S>::SIZE.z - 1).count_ones()
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
// Converts a z coordinate into a local z coordinate within a sub chunk
fn sub_chunk_z(&self, z: i32) -> i32 {
let diff = z - self.z_offset;
diff & (SubChunkSize::<S>::SIZE.z - 1) as i32
}
// Returns the z offset of the sub_chunk that contains layer z
fn sub_chunk_min_z(&self, z: i32) -> i32 { z - self.sub_chunk_z(z) }
/// Compress chunk by using more intelligent defaults.
pub fn defragment(&mut self)
where
V: Clone + Eq + Hash,
{
// First, defragment all subchunks.
self.sub_chunks.iter_mut().for_each(SubChunk::defragment);
// For each homogeneous subchunk (i.e. those where all blocks are the same),
// find those which match `below` at the bottom of the cunk, or `above`
// at the top, since these subchunks are redundant and can be removed.
// Note that we find (and drain) the above chunks first, so that when we
// remove the below chunks we have fewer remaining chunks to backshift.
// Note that we use `take_while` instead of `rposition` here because `rposition`
// goes one past the end, which we only want in the forward direction.
let above_count = self
.sub_chunks
.iter()
.rev()
.take_while(|subchunk| subchunk.homogeneous() == Some(&self.above))
.count();
// Unfortunately, `TakeWhile` doesn't implement `ExactSizeIterator` or
// `DoubleEndedIterator`, so we have to recreate the same state by calling
// `nth_back` (note that passing 0 to nth_back goes back 1 time, not 0
// times!).
let mut subchunks = self.sub_chunks.iter();
if above_count > 0 {
subchunks.nth_back(above_count - 1);
}
// `above_index` is now the number of remaining elements, since all the elements
// we drained were at the end.
let above_index = subchunks.len();
// `below_len` now needs to be applied to the state after the `above` chunks are
// drained, to make sure we don't accidentally have overlap (this is
// possible if self.above == self.below).
let below_len = subchunks.position(|subchunk| subchunk.homogeneous() != Some(&self.below));
let below_len = below_len
// NOTE: If `below_index` is `None`, then every *remaining* chunk after we drained
// `above` was full and matched `below`.
.unwrap_or(above_index);
// Now, actually remove the redundant chunks.
self.sub_chunks.truncate(above_index);
self.sub_chunks.drain(..below_len);
// Finally, bump the z_offset to account for the removed subchunks at the
// bottom. TODO: Add invariants to justify why `below_len` must fit in
// i32.
self.z_offset += below_len as i32 * SubChunkSize::<S>::SIZE.z as i32;
}
}
impl<V, S: RectVolSize, M: Clone> BaseVol for Chonk<V, S, M> {
common: Rework volume API See the doc comments in `common/src/vol.rs` for more information on the API itself. The changes include: * Consistent `Err`/`Error` naming. * Types are named `...Error`. * `enum` variants are named `...Err`. * Rename `VolMap{2d, 3d}` -> `VolGrid{2d, 3d}`. This is in preparation to an upcoming change where a “map” in the game related sense will be added. * Add volume iterators. There are two types of them: * _Position_ iterators obtained from the trait `IntoPosIterator` using the method `fn pos_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `Vec3<i32>`. * _Volume_ iterators obtained from the trait `IntoVolIterator` using the method `fn vol_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `(Vec3<i32>, &Self::Vox)`. Those traits will usually be implemented by references to volume types (i.e. `impl IntoVolIterator<'a> for &'a T` where `T` is some type which usually implements several volume traits, such as `Chunk`). * _Position_ iterators iterate over the positions valid for that volume. * _Volume_ iterators do the same but return not only the position but also the voxel at that position, in each iteration. * Introduce trait `RectSizedVol` for the use case which we have with `Chonk`: A `Chonk` is sized only in x and y direction. * Introduce traits `RasterableVol`, `RectRasterableVol` * `RasterableVol` represents a volume that is compile-time sized and has its lower bound at `(0, 0, 0)`. The name `RasterableVol` was chosen because such a volume can be used with `VolGrid3d`. * `RectRasterableVol` represents a volume that is compile-time sized at least in x and y direction and has its lower bound at `(0, 0, z)`. There's no requirement on he lower bound or size in z direction. The name `RectRasterableVol` was chosen because such a volume can be used with `VolGrid2d`.
2019-09-03 22:23:29 +00:00
type Error = ChonkError;
type Vox = V;
common: Rework volume API See the doc comments in `common/src/vol.rs` for more information on the API itself. The changes include: * Consistent `Err`/`Error` naming. * Types are named `...Error`. * `enum` variants are named `...Err`. * Rename `VolMap{2d, 3d}` -> `VolGrid{2d, 3d}`. This is in preparation to an upcoming change where a “map” in the game related sense will be added. * Add volume iterators. There are two types of them: * _Position_ iterators obtained from the trait `IntoPosIterator` using the method `fn pos_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `Vec3<i32>`. * _Volume_ iterators obtained from the trait `IntoVolIterator` using the method `fn vol_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> ...` which returns an iterator over `(Vec3<i32>, &Self::Vox)`. Those traits will usually be implemented by references to volume types (i.e. `impl IntoVolIterator<'a> for &'a T` where `T` is some type which usually implements several volume traits, such as `Chunk`). * _Position_ iterators iterate over the positions valid for that volume. * _Volume_ iterators do the same but return not only the position but also the voxel at that position, in each iteration. * Introduce trait `RectSizedVol` for the use case which we have with `Chonk`: A `Chonk` is sized only in x and y direction. * Introduce traits `RasterableVol`, `RectRasterableVol` * `RasterableVol` represents a volume that is compile-time sized and has its lower bound at `(0, 0, 0)`. The name `RasterableVol` was chosen because such a volume can be used with `VolGrid3d`. * `RectRasterableVol` represents a volume that is compile-time sized at least in x and y direction and has its lower bound at `(0, 0, z)`. There's no requirement on he lower bound or size in z direction. The name `RectRasterableVol` was chosen because such a volume can be used with `VolGrid2d`.
2019-09-03 22:23:29 +00:00
}
impl<V, S: RectVolSize, M: Clone> RectRasterableVol for Chonk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
const RECT_SIZE: Vec2<u32> = S::RECT_SIZE;
}
impl<V, S: RectVolSize, M: Clone> ReadVol for Chonk<V, S, M> {
#[inline(always)]
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
fn get(&self, pos: Vec3<i32>) -> Result<&V, Self::Error> {
if pos.z < self.get_min_z() {
// Below the terrain
Ok(&self.below)
} else if pos.z >= self.get_max_z() {
// Above the terrain
Ok(&self.above)
} else {
// Within the terrain
let sub_chunk_idx = self.sub_chunk_idx(pos.z);
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let rpos = pos
- Vec3::unit_z()
* (self.z_offset + sub_chunk_idx * SubChunkSize::<S>::SIZE.z as i32);
self.sub_chunks[sub_chunk_idx as usize]
.get(rpos)
.map_err(Self::Error::SubChunkError)
}
}
}
impl<V: Clone + PartialEq, S: RectVolSize, M: Clone> WriteVol for Chonk<V, S, M> {
#[inline(always)]
fn set(&mut self, pos: Vec3<i32>, block: Self::Vox) -> Result<V, Self::Error> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let mut sub_chunk_idx = self.sub_chunk_idx(pos.z);
if pos.z < self.get_min_z() {
// Make sure we're not adding a redundant chunk.
if block == self.below {
return Ok(self.below.clone());
}
// Prepend exactly sufficiently many SubChunks via Vec::splice
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let c = Chunk::<V, SubChunkSize<S>, M>::filled(self.below.clone(), self.meta.clone());
let n = (-sub_chunk_idx) as usize;
self.sub_chunks.splice(0..0, std::iter::repeat(c).take(n));
self.z_offset += sub_chunk_idx * SubChunkSize::<S>::SIZE.z as i32;
sub_chunk_idx = 0;
} else if pos.z >= self.get_max_z() {
// Make sure we're not adding a redundant chunk.
if block == self.above {
return Ok(self.above.clone());
}
// Append exactly sufficiently many SubChunks via Vec::extend
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let c = Chunk::<V, SubChunkSize<S>, M>::filled(self.above.clone(), self.meta.clone());
let n = 1 + sub_chunk_idx as usize - self.sub_chunks.len();
self.sub_chunks.extend(std::iter::repeat(c).take(n));
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let rpos = pos
- Vec3::unit_z() * (self.z_offset + sub_chunk_idx * SubChunkSize::<S>::SIZE.z as i32);
self.sub_chunks[sub_chunk_idx as usize] // TODO (haslersn): self.sub_chunks.get(...).and_then(...)
.set(rpos, block)
.map_err(Self::Error::SubChunkError)
}
}
struct ChonkIterHelper<V, S: RectVolSize, M: Clone> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
sub_chunk_min_z: i32,
lower_bound: Vec3<i32>,
upper_bound: Vec3<i32>,
phantom: PhantomData<Chonk<V, S, M>>,
}
impl<V, S: RectVolSize, M: Clone> Iterator for ChonkIterHelper<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type Item = (i32, Vec3<i32>, Vec3<i32>);
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
if self.lower_bound.z >= self.upper_bound.z {
return None;
}
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
let mut lb = self.lower_bound;
let mut ub = self.upper_bound;
let current_min_z = self.sub_chunk_min_z;
lb.z -= current_min_z;
ub.z -= current_min_z;
ub.z = std::cmp::min(ub.z, SubChunkSize::<S>::SIZE.z as i32);
self.sub_chunk_min_z += SubChunkSize::<S>::SIZE.z as i32;
self.lower_bound.z = self.sub_chunk_min_z;
Some((current_min_z, lb, ub))
}
}
pub struct ChonkPosIter<V, S: RectVolSize, M: Clone> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
outer: ChonkIterHelper<V, S, M>,
opt_inner: Option<(i32, ChunkPosIter<V, SubChunkSize<S>, M>)>,
}
impl<V, S: RectVolSize, M: Clone> Iterator for ChonkPosIter<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type Item = Vec3<i32>;
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
loop {
if let Some((sub_chunk_min_z, ref mut inner)) = self.opt_inner {
if let Some(mut pos) = inner.next() {
pos.z += sub_chunk_min_z;
return Some(pos);
}
}
match self.outer.next() {
None => return None,
Some((sub_chunk_min_z, lb, ub)) => {
self.opt_inner = Some((sub_chunk_min_z, SubChunk::<V, S, M>::pos_iter(lb, ub)))
},
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
}
}
}
}
enum InnerChonkVolIter<'a, V, S: RectVolSize, M: Clone> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
Vol(ChunkVolIter<'a, V, SubChunkSize<S>, M>),
Pos(ChunkPosIter<V, SubChunkSize<S>, M>),
}
pub struct ChonkVolIter<'a, V, S: RectVolSize, M: Clone> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
chonk: &'a Chonk<V, S, M>,
outer: ChonkIterHelper<V, S, M>,
opt_inner: Option<(i32, InnerChonkVolIter<'a, V, S, M>)>,
}
impl<'a, V, S: RectVolSize, M: Clone> Iterator for ChonkVolIter<'a, V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type Item = (Vec3<i32>, &'a V);
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
#[inline(always)]
fn next(&mut self) -> Option<Self::Item> {
loop {
if let Some((sub_chunk_min_z, ref mut inner)) = self.opt_inner {
let got = match inner {
InnerChonkVolIter::<'a, V, S, M>::Vol(iter) => iter.next(),
InnerChonkVolIter::<'a, V, S, M>::Pos(iter) => iter.next().map(|pos| {
if sub_chunk_min_z < self.chonk.get_min_z() {
(pos, &self.chonk.below)
} else {
(pos, &self.chonk.above)
}
}),
};
if let Some((mut pos, vox)) = got {
pos.z += sub_chunk_min_z;
return Some((pos, vox));
}
}
match self.outer.next() {
None => return None,
Some((sub_chunk_min_z, lb, ub)) => {
let inner = if sub_chunk_min_z < self.chonk.get_min_z()
|| sub_chunk_min_z >= self.chonk.get_max_z()
{
InnerChonkVolIter::<'a, V, S, M>::Pos(SubChunk::<V, S, M>::pos_iter(lb, ub))
} else {
InnerChonkVolIter::<'a, V, S, M>::Vol(
self.chonk.sub_chunks
[self.chonk.sub_chunk_idx(sub_chunk_min_z) as usize]
.vol_iter(lb, ub),
)
};
self.opt_inner = Some((sub_chunk_min_z, inner));
},
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
}
}
}
}
2019-09-03 23:00:50 +00:00
impl<'a, V, S: RectVolSize, M: Clone> IntoPosIterator for &'a Chonk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type IntoIter = ChonkPosIter<V, S, M>;
2019-09-03 23:00:50 +00:00
fn pos_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> Self::IntoIter {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
Self::IntoIter {
outer: ChonkIterHelper::<V, S, M> {
sub_chunk_min_z: self.sub_chunk_min_z(lower_bound.z),
lower_bound,
upper_bound,
phantom: PhantomData,
},
opt_inner: None,
}
2019-09-03 23:00:50 +00:00
}
}
impl<'a, V, S: RectVolSize, M: Clone> IntoVolIterator<'a> for &'a Chonk<V, S, M> {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
type IntoIter = ChonkVolIter<'a, V, S, M>;
2019-09-03 23:00:50 +00:00
fn vol_iter(self, lower_bound: Vec3<i32>, upper_bound: Vec3<i32>) -> Self::IntoIter {
common: Rework `Chunk` and `Chonk` implementation Previously, voxels in sparsely populated chunks were stored in a `HashMap`. However, during usage oftentimes block accesses are followed by subsequent nearby voxel accesses. Therefore it's possible to provide cache friendliness, but not with `HashMap`. The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469) proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ). This provided excellent cache friendliness. However, benchmarks showed that the required indexing calculations are quite expensive. Particular results on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were: | Benchmark | Before this commit @ d322384becac | Morton Order @ ec8a7caf42ba | This commit | | ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- | | `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel | | `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel | | `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel | | `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel | | `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel | | `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel | | `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel | | `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel | This commit (instead of utilizing morton order) replaces `HashMap` in the `Chunk` implementation by the following data structure: The volume is spatially subdivided into groups of `4*4*4` blocks. Since a `Chunk` is of total size `32*32*16`, this implies that there are `8*8*4` groups. (These numbers are generic in the actual code such that there are always `256` groups. I.e. the group size is chosen depending on the desired total size of the `Chunk`.) There's a single vector `self.vox` which consecutively stores these groups. Each group might or might not be contained in `self.vox`. A group that is not contained represents that the full group consists only of `self.default` voxels. This saves a lot of memory because oftentimes a `Chunk` consists of either a lot of air or a lot of stone. To track whether a group is contained in `self.vox`, there's an index buffer `self.indices : [u8; 256]`. It contains for each group * (a) the order in which it has been inserted into `self.vox`, if the group is contained in `self.vox` or * (b) 255, otherwise. That case represents that the whole group consists only of `self.default` voxels. (Note that 255 is a valid insertion order for case (a) only if `self.vox` is full and then no other group has the index 255. Therefore there's no ambiguity.) Rationale: The index buffer should be small because: * Small size increases the probability that it will always be in cache. * The index buffer is allocated for every `Chunk` and an almost empty `Chunk` shall not consume too much memory. The number of 256 groups is particularly nice because it means that the index buffer can consist of `u8`s. This keeps the space requirement for the index buffer as low as 4 cache lines.
2019-09-06 13:23:38 +00:00
Self::IntoIter {
chonk: self,
outer: ChonkIterHelper::<V, S, M> {
sub_chunk_min_z: self.sub_chunk_min_z(lower_bound.z),
lower_bound,
upper_bound,
phantom: PhantomData,
},
opt_inner: None,
}
2019-09-03 23:00:50 +00:00
}
}