Its available to `api` and `metrics` and can be used to slow down msg send in veloren.
It uses a tokio::watch for now, as i plan to have a watch job in the scheduler that recalculates prio on change.
Also cleaning up participant metrics after a disconnect
When a stream is opened we are searching for the best (currently) available channel.
The stream will then be keept on that channel.
Adjusted the rest of the algorithms that they now respect this rule.
improved a HashMap for Pids to be based on a Vec. Also using this for Sid -> Cid relation which is more performance critical
WARN: our current send()? error handling allows it for some close_stream messages to get lost.
Instead of keeping Runtime and manually spawn a task on `drop` this task is spawned at start and will wait to be triggered.
The `drop` methods then wait for completion, UNLESS they are in a async context, then they MUST NOT BLOCK (deadlock potential), so they defer it to the Runtime and HOPE for the runtime to exist long enough.
This get rid of the weird `block_in_place` which is only accessable with `rt-multi-threaded` and has some disadvantages.
We also wont requiere the runtime to be active all the time. Though its needed for a clean shutdown
- Timeout for Participant::drop, it will stop eventually
- Detect tokio runtime in Participant::drop and no longer use std::sleep in that case (it could hang the thread that is actually doing the shutdown work and deadlock
- Parallel Shutdown in Scheduler: Instead of a slow shutdown locking up everything we can now shutdown participants in parallel, this should reduces `WARN` part took long for shutdown dramatically
- now last digit version is compatible 0.6.0 will connect to 0.6.1
- the TCP DATA Frames no longer contain START field, as it's not needed
- the TCP OPENSTREAM Frames will now contain the BANDWIDTH field
- MID is not Protocol internal
Update network
- update API with Bandwidth
Update veloren
- introduce better runtime and `async` things that are IO bound.
- Remove `uvth` and instead use `tokio::runtime::Runtime::spawn_blocking`
- remove futures_execute from client and server use tokio::runtime::Runtime instead
- give threads a Name
- a channel was stale and wasn't shut down propertly AS WELL AS
- the msg ingoing pipe was bounded, so it could fill up
To mitigate this we
a) unbounded the pipe
b) stoped spam the log in no channel case
c) instead of ever reaching "no channel" state we actually shutdown participant
d) when send_mgr is closed it will no longer be able to SEND on streams
- completly switch to Bytes, even in api. speed up TCP by fak 2
- improve benchmarks
- speed up mpsc metrics
- gracefully handle shutdown by interpreting Ok(0) as tokio::tcpstream closed now.
- fix hotloop in participants by adding `Some(n)` to fix endless handing.
- fix closing bug by closing streams after `recv_mgr` is shutdown even if now shutdown is triggered locally.
- fix prometheus
- no longer throw when a `Stream` is dropped while participant still receives a msg for it.
- fix the bandwith handling, TCP network send speed is up to 1.5GiB/s while recv is 150MiB/s
- add documentation
- tmp require rt-multi-threaded in client for tokio, to not fail cargo check
this is prob stable, i tested over 1 hour.
after that some optimisations in priomgr.
and impl. propper bandwith.
Speed is up to 2GB/s write and 150MB/s recv on a single core
sync add documentation
- better logging in network
- we now notify the send of what happened in recv in participant.
- works with veloren master servers
- works in singleplayer, using a actual mid.
- add `mpsc` in whole stack incl tests
- speed up internal read/write with `Bytes` crate
- use `prometheus-hyper` for metrics
- use a metrics cache
- Implementing a async non-io protocol crate
a) no tokio / no channels
b) I/O is based on abstraction Sink/Drain
c) different Protocols can have a different Drain Type
This allow MPSC to send its content without splitting up messages at all!
It allows UDP to have internal extra frames to care for security
It allows better abstraction for tests
Allows benchmarks on the mpsc variant
Custom Handshakes to allow sth like Quic protocol easily
- reduce the participant managers to 4: channel creations, send, recv and shutdown.
keeping the `mut data` in one manager removes the need for all RwLocks.
reducing complexity and parallel access problems
- more strategic participant shutdown. first send. then wait for remote side to notice recv stop, then remote side will stop send, then local side can stop recv.
- metrics are internally abstracted to fit protocol and network layer
- in this commit network/protocol tests work and network tests work someway, veloren compiles but does not work
- handshake compatible to async_std
switch to `tokio` and `async_channel` crate.
I wanted to do tokio first, but it doesnt feature Sender::close(), thus i included async_channel
Got rid of `futures` and only need `futures_core` and `futures_util`.
Tokio does not support `Stream` and `StreamExt` so for now i need to use `tokio-stream`, i think this will go in `std` in the future
Created `b2b_close_stream_opened_sender_r` as the shutdown procedure does not need a copy of a Sender, it just need to stop it.
Various adjustments, e.g. for `select!` which now requieres a `&mut` for oneshots.
Future things to do:
- Use some better signalling than oneshot<()> in some cases.
- Use a Watch for the Prio propergation (impl. it ofc)
- Use Bounded Channels in order to improve performance
- adjust tests coding
bring tests to work
But i have the feeling that maybe something with the channel is wrong which leads to this behavior (or maybe did i made a copy somewhere, though i dobt this).
Again, not sure if this is a fix, but i think it doesn't hurt
And the guard generated here in the if let Some() will live through the WHOLE statement, which is definitly critical.
The fix is quite easy. We just move it out in a seperate line.
There are still some participants where the dropping will fail, and followup disconnects will NOT get triggered and be queued up till ethernity, but at least we are not blocking new logins
So first off all we had to update the toolchain, i think everything in september is okay, but we got with this current version.
Then we had to update several dependencies, which broke:
- need a specific fix of winit, i think we want to get rid of this with iced, hopefully, because its buggy as hell. update wayland client to 0.27
- use a updated version of glutin which has wayland-client 0.27 and no longer the broke version 0.23
- update conrod to use modern copypasta 0.7
- use `packed_simd_2` instead of `packed_simd` as the owner of the create abandoned the project.
- adjust all the coding to work with the newer glutin and winit version. that also includes fixing a macro in one of the dependencies that did some crazy conversion from 1 event type to another event type.
It was called `convert_event`
- make a `simd` feature which is default and introduce conditional compiling.
AS I HAVE NO IDEA OF SIMD AND THE CODE. AND I DIDN'T INTRODUCE THE ERROR IN THE FIRST PLACE, WE PANIC FOR NON SIMD CASE FOR NOW. BUT IT WORKS FOR TESTS.
- tarpaulin doesnt support no-default-features. so we have to `sed` them away. semms fair.
Refactor the `send_raw` and `recv_raw` completly. We now expost `Message` which has a public `serialize` and `deseialize` fn for the first time.
This makes using the `raw` methods of a stream much easier and is a requierement for using "copy_less" sending to multiple streams
There is a rare bug that recently got triggered more often with the release of xMAC94x/netfixA
if the bug triggeres, a Participant never gets cleaned up gracefully.
Reason:
When `participant_shutdown_mgr` was called it stopped all managers at once.
Especially stream_close_mgr and send_mgr.
The problem with stream_close_mgr is, it's responsible for gracefully flushing streams when the Participant is dropped locally.
So when it was interupted self.streams where no longer flushed gracefully.
The next problem was with send_mgr.
It is triggering the PrioManager, and the PrioManager is responsible for notifying once a stream is completly flushed.
This lead to the problem, that a stream flush could be requested, but was actually never executed (as send_mgr was already down).
Solution:
1. when stream_close_mgr is stopped it MUST flush all remaining streams
2. wait for stream_close_mgr to finish before shutting down the send_mgr
3. no longer delete streams when closing the API (this also wasn't tracked in metrics so far)
Additionally i added a dependency, so that the network/examples compile again, fixed some spelling.
I created a `delete_stream` fn that basically just moved the code over.
- first remove participant AND channel in same metric. this caused a matrix full of 0 values which bloated alot.
- then did the cid cache to be lazy loading to no longer generate that much 0 values
- possible would also be no longer keeping metrics for INIT, HANDSHAKE and PARTICIPANTID
- Compression is no longer enabled always but can now be enabled per Stream.
If a Stream is Compression enabled it will compress and decompress all msg (except for `raw` access) before handling them internally.
You need to handle compression yourself for `raw` fn.
- added a new feature to the network crate to enable or disable the compression
- switched to `lz-fear` instead of `lz4-compression`
- use `bitflags` to represent the `Promises` struct
- When Participant A was closed by remote side. Then a `disconnect` on `A`
shall return Ok() (instead of ParticipantDisconected) IF:
A was already flushed and no data needs to be sended any more.
so a `disconnect` doesnt differ if the other side initiated the disconnect before or not. it tries to clean things up and returns Ok(()) if both sides agree
- It was possible for a end_receiver to be triggered in the moment while a frame was started by not finished.
This removed bytes from the stream with them getting lost. this almost ever was followed by a RAW frame as the TCP stream was now invalid.
The TCP stream was then detected by participant or caused one or multiple failures
- introduces some simplifications, removed a macro, reuse code
- replace RwLock by Mutex if it's only accessed for insert/delete
- use RwLock<HashMap<Mutex>> pattern otherwise in order to allow concurrent `.read()`
- fixed a deadlock O.o
Till now, we just dropped the TCP connection and registered this as a clean shutdown.
The prodocol reader intereted this and send a Frame::Shutdown frame to it's local processor.
This is ofc wrong.
So now the protocol reader will detect a Frame::Shutdown frame and send it over. if the Tcp connection gets closed it will return an Error up.
The processor will then pick up this error and request a unclear shutdown and notifies the user.
Also when doing a clean shutdown we are sending a Frame::Shutdown now to the remote side to trigger this behavior.
Before we wrongly added the feature of only using a `select` in channel. This is WRONG,
as it could mean that the write maybe fails, but the read still had some Frames buffered which then get dropped.
Its fixed now by the clean shutdown mechanims defined before.
Also when a channel is closed now inside a participant we are closing the whole participant as a protection.
However, we must not close the recv channel as the `handle_frames_mgr` might still be working on them, so we only stop writing/sending.
Debugging this also let me introduce some smaller fixes:
- PID in tests are now 0 and 1+1*64+1*64*64+... this makes the traces appear as AAAAAA and BBBBBB instead of ABAAAA and ACAAAA
- veloren client now better seperates between clean shutdown and unclear shutdown.
- added a new type: C2pFrame for `(cid, Result<Frame, ()>)`
- wrong frames inside the handshare are not counted in metrics
-
added a span for disconnecting on the gameserver side. also added more debug! tracing there
Just keeping a trace! all 10000ms active to have a keep alive feeling.
Participant will now handle a close in the `create_channel_mgr` rather then the `send` fn. Its the better place, which makes a HashMap better for delete lookup
Since tcp_read now broke but tcp_write didn't and the Participant wasnt updated till both were broke, we changed CHANNEL tcp_read and tcp_write in protocols to be a `select` rather than a `join`
However only do this in the CHANNEL, but in the HANDSHAKE. it fails if you try to. Also the handshake will take care of any failed read or write manually and will handle a clear teardown in this case.