Currently we only do this when no players are in range of the chunk. We
also send the first client who posted the chunk a message indicating
that it's canceled, the hope being that this will be a performance win
in single player mode since you don't have to wait three seconds to
realize that the server won't generate the chunk for you.
We now check an atomic flag for every column sample in a chunk. We
could probably do this less frequently, but since it's a relaxed load it
has essentially no performance impact on Intel architectures.
Previously, voxels in sparsely populated chunks were stored in a `HashMap`.
However, during usage oftentimes block accesses are followed by subsequent
nearby voxel accesses. Therefore it's possible to provide cache friendliness,
but not with `HashMap`.
The previous merge request [!469](https://gitlab.com/veloren/veloren/merge_requests/469)
proposed to order voxels by their morton order (see https://en.wikipedia.org/wiki/Z-order_curve ).
This provided excellent cache friendliness. However, benchmarks showed that
the required indexing calculations are quite expensive. Particular results
on my _Intel(R) Core(TM) i7-7500U CPU @ 2.70 GHz_ were:
| Benchmark | Before this commit @ d322384bec | Morton Order @ ec8a7caf42 | This commit |
| ---------------------------------------- | --------------------------------- | --------------------------- | -------------------- |
| `full read` (81920 voxels) | 17.7ns per voxel | 8.9ns per voxel | **3.6ns** per voxel |
| `constrained read` (4913 voxels) | 67.0ns per voxel | 40.1ns per voxel | **14.1ns** per voxel |
| `local read` (125 voxels) | 17.5ns per voxel | 14.7ns per voxel | **3.8ns** per voxel |
| `X-direction read` (17 voxels) | 17.8ns per voxel | 25.9ns per voxel | **4.2ns** per voxel |
| `Y-direction read` (17 voxels) | 18.4ns per voxel | 33.3ns per voxel | **4.5ns** per voxel |
| `Z-direction read` (17 voxels) | 18.6ns per voxel | 38.2ns per voxel | **5.4ns** per voxel |
| `long Z-direction read` (65 voxels) | 18.0ns per voxel | 37.7ns per voxel | **5.1ns** per voxel |
| `full write (dense)` (81920 voxels) | 17.9ns per voxel | **10.3ns** per voxel | 12.4ns per voxel |
This commit (instead of utilizing morton order) replaces `HashMap` in the
`Chunk` implementation by the following data structure:
The volume is spatially subdivided into groups of `4*4*4` blocks. Since a
`Chunk` is of total size `32*32*16`, this implies that there are `8*8*4`
groups. (These numbers are generic in the actual code such that there are
always `256` groups. I.e. the group size is chosen depending on the desired
total size of the `Chunk`.)
There's a single vector `self.vox` which consecutively stores these groups.
Each group might or might not be contained in `self.vox`. A group that is
not contained represents that the full group consists only of `self.default`
voxels. This saves a lot of memory because oftentimes a `Chunk` consists of
either a lot of air or a lot of stone.
To track whether a group is contained in `self.vox`, there's an index buffer
`self.indices : [u8; 256]`. It contains for each group
* (a) the order in which it has been inserted into `self.vox`, if the group
is contained in `self.vox` or
* (b) 255, otherwise. That case represents that the whole group consists
only of `self.default` voxels.
(Note that 255 is a valid insertion order for case (a) only if `self.vox` is
full and then no other group has the index 255. Therefore there's no
ambiguity.)
Rationale:
The index buffer should be small because:
* Small size increases the probability that it will always be in cache.
* The index buffer is allocated for every `Chunk` and an almost empty `Chunk`
shall not consume too much memory.
The number of 256 groups is particularly nice because it means that the index
buffer can consist of `u8`s. This keeps the space requirement for the index
buffer as low as 4 cache lines.