Commit Graph

53 Commits

Author SHA1 Message Date
Marcel Märtens
8dccc21125 preparation for multiple-channel participants.
When a stream is opened we are searching for the best (currently) available channel.
The stream will then be keept on that channel.
Adjusted the rest of the algorithms that they now respect this rule.
improved a HashMap for Pids to be based on a Vec. Also using this for Sid -> Cid relation which is more performance critical
WARN: our current send()? error handling allows it for some close_stream messages to get lost.
2021-03-26 08:57:50 +01:00
Marcel Märtens
01c82b70ab network scheduler and rawmsg cleanup 2021-03-26 08:57:42 +01:00
Marcel Märtens
9084ac48f1 defer some trace, so that we wont spam the log. 2021-03-22 09:16:07 +01:00
Marcel Märtens
514d5db038 Update Network Protocol
- now last digit version is compatible 0.6.0 will connect to 0.6.1
 - the TCP DATA Frames no longer contain START field, as it's not needed
 - the TCP OPENSTREAM Frames will now contain the BANDWIDTH field
 - MID is not Protocol internal

Update network
 - update API with Bandwidth

Update veloren
 - introduce better runtime and `async` things that are IO bound.
 - Remove `uvth` and instead use `tokio::runtime::Runtime::spawn_blocking`
 - remove futures_execute from client and server use tokio::runtime::Runtime instead
 - give threads a Name
2021-02-22 17:34:55 +01:00
Marcel Märtens
5a48bffcb0 fix main thread blocking which was a bad combination of
- a channel was stale and wasn't shut down propertly AS WELL AS
 - the msg ingoing pipe was bounded, so it could fill up

To mitigate this we
 a) unbounded the pipe
 b) stoped spam the log in no channel case
 c) instead of ever reaching "no channel" state we actually shutdown participant
 d) when send_mgr is closed it will no longer be able to SEND on streams
2021-02-18 20:00:07 +01:00
Marcel Märtens
03af9937cf Stabelize Network again:
- completly switch to Bytes, even in api. speed up TCP by fak 2
 - improve benchmarks
 - speed up mpsc metrics
 - gracefully handle shutdown by interpreting Ok(0) as tokio::tcpstream closed now.
 - fix hotloop in participants by adding `Some(n)` to fix endless handing.
 - fix closing bug by closing streams after `recv_mgr` is shutdown even if now shutdown is triggered locally.
 - fix prometheus
 - no longer throw when a `Stream` is dropped while participant still receives a msg for it.
 - fix the bandwith handling, TCP network send speed is up to 1.5GiB/s while recv is 150MiB/s
 - add documentation
 - tmp require rt-multi-threaded in client for tokio, to not fail cargo check

this is prob stable, i tested over 1 hour.
after that some optimisations in priomgr.
and impl. propper bandwith.
Speed is up to 2GB/s write and 150MB/s recv on a single core

sync add documentation
2021-02-17 19:37:48 +01:00
Marcel Märtens
ea8ab1ce7a Great improvements to the codebase:
- better logging in network
 - we now notify the send of what happened in recv in participant.
 - works with veloren master servers
 - works in singleplayer, using a actual mid.
 - add `mpsc` in whole stack incl tests
 - speed up internal read/write with `Bytes` crate
 - use `prometheus-hyper` for metrics
 - use a metrics cache
2021-02-17 16:15:00 +01:00
Marcel Märtens
9884019963 COMPLETE REDESIGN of network crate
- Implementing a async non-io protocol crate
    a) no tokio / no channels
    b) I/O is based on abstraction Sink/Drain
    c) different Protocols can have a different Drain Type
       This allow MPSC to send its content without splitting up messages at all!
       It allows UDP to have internal extra frames to care for security
       It allows better abstraction for tests
       Allows benchmarks on the mpsc variant
       Custom Handshakes to allow sth like Quic protocol easily
 - reduce the participant managers to 4: channel creations, send, recv and shutdown.
   keeping the `mut data` in one manager removes the need for all RwLocks.
   reducing complexity and parallel access problems
 - more strategic participant shutdown. first send. then wait for remote side to notice recv stop, then remote side will stop send, then local side can stop recv.
 - metrics are internally abstracted to fit protocol and network layer
 - in this commit network/protocol tests work and network tests work someway, veloren compiles but does not work
 - handshake compatible to async_std
2021-02-17 12:39:47 +01:00
Marcel Märtens
3f85506761 fix most unittests (not all) by a) dropping network/participant BEFORE runtime and by transfering a expect into a warn! in the protocol 2021-02-17 12:38:58 +01:00
Marcel Märtens
5aa1940ef8 get rid of async_std::channel
switch to `tokio` and `async_channel` crate.
I wanted to do tokio first, but it doesnt feature Sender::close(), thus i included async_channel
Got rid of `futures` and only need `futures_core` and `futures_util`.

Tokio does not support `Stream` and `StreamExt` so for now i need to use `tokio-stream`, i think this will go in `std` in the future

Created `b2b_close_stream_opened_sender_r` as the shutdown procedure does not need a copy of a Sender, it just need to stop it.

Various adjustments, e.g. for `select!` which now requieres a `&mut` for oneshots.

Future things to do:
 - Use some better signalling than oneshot<()> in some cases.
 - Use a Watch for the Prio propergation (impl. it ofc)
 - Use Bounded Channels in order to improve performance
 - adjust tests coding

bring tests to work
2021-02-17 12:38:53 +01:00
Marcel Märtens
1b77b6dc41 Initial switch to tokio for network, minimum working example. 2021-02-17 12:37:59 +01:00
Marcel Märtens
7a7c1f6f50 I would except this to be implcitly done by the drop though it doesn't hurt here, as this channel is dropped anyway a line later.
But i have the feeling that maybe something with the channel is wrong which leads to this behavior (or maybe did i made a copy somewhere, though i dobt this).
Again, not sure if this is a fix, but i think it doesn't hurt
2020-11-27 10:47:01 +01:00
Marcel Märtens
ba1299e670 apparently span doesnt work for async, so i replaced it by an instrument version 2020-10-14 17:54:01 +02:00
Marcel Märtens
e914c29728 FIX for hanging participant deletion.
There is a rare bug that recently got triggered more often with the release of xMAC94x/netfixA
if the bug triggeres, a Participant never gets cleaned up gracefully.

Reason:
When `participant_shutdown_mgr` was called it stopped all managers at once.
Especially stream_close_mgr and send_mgr.
The problem with stream_close_mgr is, it's responsible for gracefully flushing streams when the Participant is dropped locally.
So when it was interupted self.streams where no longer flushed gracefully.
The next problem was with send_mgr.
It is triggering the PrioManager, and the PrioManager is responsible for notifying once a stream is completly flushed.
This lead to the problem, that a stream flush could be requested, but was actually never executed (as send_mgr was already down).

Solution:
1. when stream_close_mgr is stopped it MUST flush all remaining streams
2. wait for stream_close_mgr to finish before shutting down the send_mgr
3. no longer delete streams when closing the API (this also wasn't tracked in metrics so far)

Additionally i added a dependency, so that the network/examples compile again, fixed some spelling.
I created a `delete_stream` fn that basically just moved the code over.
2020-10-14 15:03:49 +02:00
Marcel Märtens
24af657fd5 quickfix for closing participants more reliable 2020-10-13 20:06:20 +02:00
Ben Wallis
b3dd8e8a02 Added #![deny(clippy::clone_on_ref_ptr)] to all crates and fixed resulting lint errors 2020-09-27 17:25:33 +01:00
Marcel Märtens
a7b7ae3a2c fix compiling with metrics 2020-08-27 09:35:06 +02:00
Imbris
dcce5641f7 Fix broken features and avoid panic if the client leaves before character data loads 2020-08-26 20:47:39 -04:00
Marcel Märtens
9170622611 reduce load on metrics by ALOT!
- first remove participant AND channel in same metric. this caused a matrix full of 0 values which bloated alot.
 - then did the cid cache to be lazy loading to no longer generate that much 0 values
 - possible would also be no longer keeping metrics for INIT, HANDSHAKE and PARTICIPANTID
2020-08-27 01:55:13 +02:00
notoria
2be4202d01 Corrected some spelling errors 2020-08-25 12:21:25 +00:00
Marcel Märtens
5fe7c05d9c Redefine Close behavior:
- When Participant A was closed by remote side. Then a `disconnect` on `A`
   shall return Ok() (instead of ParticipantDisconected) IF:
   A was already flushed and no data needs to be sended any more.
 so a `disconnect` doesnt differ if the other side initiated the disconnect before or not. it tries to clean things up and returns Ok(()) if both sides agree
2020-08-24 16:22:12 +02:00
Marcel Märtens
91d296cda1 Fixed bug in tcp protocol.rs
- It was possible for a end_receiver to be triggered in the moment while a frame was started by not finished.
   This removed bytes from the stream with them getting lost. this almost ever was followed by a RAW frame as the TCP stream was now invalid.
   The TCP stream was then detected by participant or caused one or multiple failures
 - introduces some simplifications, removed a macro, reuse code
2020-08-24 16:22:06 +02:00
Marcel Märtens
d37ca02913 using Locks a more sensitive way.
- replace RwLock by Mutex if it's only accessed for insert/delete
 - use RwLock<HashMap<Mutex>> pattern otherwise in order to allow concurrent `.read()`
 - fixed a deadlock O.o
2020-08-23 21:43:17 +02:00
Marcel Märtens
1eb126736d workaround for impossible RAW msg 2020-08-22 01:09:07 +02:00
Marcel Märtens
926d334082 Fixed the unclean disconnecting of participants.
Till now, we just dropped the TCP connection and registered this as a clean shutdown.
The prodocol reader intereted this and send a Frame::Shutdown frame to it's local processor.
This is ofc wrong.
So now the protocol reader will detect a Frame::Shutdown frame and send it over. if the Tcp connection gets closed it will return an Error up.
The processor will then pick up this error and request a unclear shutdown and notifies the user.
Also when doing a clean shutdown we are sending a Frame::Shutdown now to the remote side to trigger this behavior.

Before we wrongly added the feature of only using a `select` in channel. This is WRONG,
 as it could mean that the write maybe fails, but the read still had some Frames buffered which then get dropped.
Its fixed now by the clean shutdown mechanims defined before.

Also when a channel is closed now inside a participant we are closing the whole participant as a protection.
However, we must not close the recv channel as the `handle_frames_mgr` might still be working on them, so we only stop writing/sending.

Debugging this also let me introduce some smaller fixes:
 - PID in tests are now 0 and 1+1*64+1*64*64+... this makes the traces appear as AAAAAA and BBBBBB instead of ABAAAA and ACAAAA
 - veloren client now better seperates between clean shutdown and unclear shutdown.
 - added a new type: C2pFrame for `(cid, Result<Frame, ()>)`
 - wrong frames inside the handshare are not counted in metrics
 -
2020-08-21 18:00:28 +02:00
Marcel Märtens
42141b3aa3 remove some trace! in network which a) was only spam and b) could be replaced by a metric way better.
added a span for disconnecting on the gameserver side. also added more debug! tracing there
Just keeping a trace! all 10000ms active to have a keep alive feeling.
2020-08-21 18:00:14 +02:00
Marcel Märtens
12b46250f5 protocols no longer send a Close Frame in case the read fails. They just fail, let participant handle this!
Participant will now handle a close in the `create_channel_mgr` rather then the `send` fn. Its the better place, which makes a HashMap better for delete lookup
Since tcp_read now broke but tcp_write didn't and the Participant wasnt updated till both were broke, we changed CHANNEL tcp_read and tcp_write in protocols to be a `select` rather than a `join`
However only do this in the CHANNEL, but in the HANDSHAKE. it fails if you try to. Also the handshake will take care of any failed read or write manually and will handle a clear teardown in this case.
2020-08-21 18:00:07 +02:00
Marcel Märtens
b59fc2ff0c improve tracing and spans in network crate 2020-08-21 18:00:00 +02:00
Marcel Märtens
e618eeb386 Fix a isse that might occur when a participant is dropped while the remote wants to open a Stream and we get some bad time condition.
increase the slowlorris timeout. for some reason it seems to trigger alot more often since commit: `75c1d440` but i have no idea why.
My guess would be that the initial sync now sends way more data which slows down TCP to be > 10ms and trigger. Note: the fix might cause small lags when slow people try to connect to the server
2020-08-13 12:06:53 +02:00
Marcel Märtens
dd581bc6c0 Participant closure was immeatiatly, even in case a new participant was connected, send a MSG and then dropped immeadiatly.
The remote site should see it connect, be open for 1 single stream and read the message before it's notified that the participant is closed actually.

This caused the faulure in one of our API tests (in lib, with client and server). Where it was possible that all messages were send and one side was dropped before the other side asked for the opened stream

Also introduce better error detection in participant(and scheduler) by removing the std_async::Result and intruduce `Result<(),ParticipantError>` instead
2020-07-22 09:18:15 +02:00
Marcel Märtens
6c59caf8e1 make prometheus optional in network and fix a panic in the server
- an extra interface `new_with_regisitry` was created to make sure the interface doesn't depend on the features
2020-07-15 16:45:49 +02:00
Marcel Märtens
58cb98deaa use type to reduce complexity 2020-07-15 16:45:44 +02:00
Marcel Märtens
187ec42aa2 fix Participant shutdown
- we had the problem that Participants couldn't shutdown them self, only by scheduler, which was controlled by api.
  it's needed e.g. to handle the Schudown Frame
 - my initial solution did a full shutdown, which was a problem if in parallel a 2nd shutdown was requested, no possibility of getting the error
 - new solution will only deactivate Participant and Stream. and then still functions correctly, till the api closes the participant and calls the scheduler which then calls the bparticipant again
 - i experimented with a Mutex<oneshot> or 2 and a `select` but it didn't prove that well
 - also adjusted the Error messages to now either Disconnected when gracefully shutdown or ProtocolFailed when some msg couldn't be delivered
  (note later might not be 100% returned correctly yet)
2020-07-13 13:03:30 +02:00
Marcel Märtens
041349be48 Switch API to return Participant rather than Arc<Participant>
- API behavior switched!
 - the `Network` no longer holds a copy of participant, thus if the return of `connect` (before `Arc<Participant>, now `Participant`) got dropped, the `Participant::Drop` is triggered!
 - you can close a Participant async via `Particiant::disconnect()`, no more need to know the network at this point
 - the `Network::Drop` will check and drop not yet disconnected Participants.
 - you can compare Participants via PartialEq, if they are true they point to the same endpoint (it checks remote_pid)
   - Note: multiple Participants are only supported in theory, wont work yet

Additionally:
 - fix some `debug!`
 - veloren-client will now drop the participant gracefully on shutdown
 - rename `error` to `debug` when 2 times Bparticipant shutdown is called, as it is to be expected in a async runtime
2020-07-13 13:03:14 +02:00
Marcel Märtens
4cefdcefea zests fix - capitalize first letter 2020-07-13 13:03:01 +02:00
Marcel Märtens
5f902b5eab doing a clean shutdown of the BParticipant once the TCP connection is suddenly interrupted 2020-07-13 13:02:55 +02:00
Marcel Märtens
3a6319f2f6 compress everything 2020-07-05 20:14:47 +02:00
Marcel Märtens
6de2eadeb0 make crash -> error for now 2020-07-04 10:32:52 +02:00
Marcel Märtens
1da6f15a43 fixing various smaller network party from issue 657 2020-07-04 02:04:33 +02:00
Marcel Märtens
f9895a7800 crossbeam-channel and log spam
- swap out std::mpsc with crossbeam-channel in networking crate
 - remove log spam by only logging when populating a new cache entry and not on every get
2020-07-03 22:35:29 +02:00
Marcel Märtens
57453291e4 - switched participant to only error!() once a sec instead of every occurence in order to not spam error
- switched the behavior of prio to keep the order of a stream, EVEN if messages of different length are used!
  This needs to be addresses to fulfill PROMISES_ORDERED
  I think we need to make thoughs if PRIO needs to handle this, or we add additional meta info and handle it on receiver site
2020-06-28 22:29:42 +02:00
Marcel Märtens
2e3d5f87db StreamError::Deserialize is now triggered when recv fails because of wrong type
- added PartialEq to StreamError for test purposes (only yet!)
 - removed async_recv example as it's no longer for any use.
   It was created before the COMPLETE REWRITE in order to verify that my own async interface on top of mio works.
   However it's now guaranteed by async-std and futures. no need for a special test
 - remove uvth from dependencies and replace it with a `FnOnce`
 - fix ALL clippy (network) lints
 - basic fix for a channel drop scenario:
   TODO: this needs some further fixes
   up to know only destruction of participant by api was covered correctly.
   we had an issue when the underlying channels got dropped. So now we have a participant without channels.
   We need to buffer the requests and try to reopen a channel ASAP!
   If no channel could be reopened we need to close the Participant, while
    a) leaving the BParticipant in takt, knowing that it only waits for a propper close by scheduler
    b) close the BParticipant gracefully. Notifying the scheduler to remove its stuff (either scheduler schould detect a stopped BParticipant or BParticipant will send Scheduler it's own destruction, and then Scheduler just does the same like when API forces a close)
       Keep the Participant alive and wait for the api to acces BParticipant to notice it's closed and then wait for a disconnect which isn't doing anything as it was already cleaned up in the background
2020-06-09 13:16:39 +02:00
Marcel Märtens
3324c08640 Fixing the DEADLOCK in handshake -> channel creation
- this bug was initially called imbris bug, as it happened on his runners and i couldn't reproduce it locally at fist :)
- When in a Handshake a seperate mpsc::Channel was created for (Cid, Frame) transport
  however the protocol could already catch non handshake data any more and push in into this
  mpsc::Channel.
  Then this channel got dropped and a fresh one was created for the network::Channel.
  These droped Frames are ofc a BUG!
  I tried multiple things to solve this:
   - dont create a new mpsc::Channel, but instead bind it to the Protocol itself and always use 1.
     This would work theoretically, but in bParticipant side we are using 1 mpsc::Channel<(Cid, Frame)>
     to handle ALL the network::channel.
     If now ever Protocol would have it's own, and with that every network::Channel had it's own it would no longer work out
     Bad Idea...
   - using the first method but creating the mpsc::Channel inside the scheduler instead protocol neither works, as the
     scheduler doesnt know the remote_pid yet
   - i dont want a hack to say the protocol only listen to 2 messages and then stop no matter what
  So i switched over to the simply method now:
   - Do everything like before with 2 mpsc::Channels
   - after the handshake. close the receiver and listen for all remaining (cid, frame) combinations
   - when starting the channel, reapply them to the new sender/listener combination
   - added tracing
- switched Protocol RwLock to Mutex, as it's only ever 1
- Additionally changed the layout and introduces the c2w_frame_s and w2s_cid_frame_s name schema
- Fixed a bug in scheduler which WOULD cause a DEADLOCK if handshake would fail
- fixd a but in api_send_send_main, i need to store the stream_p otherwise it's immeadiatly closed and a stream_a.send() isn't guaranteed
- add extra test to verify that a send message is received even if the Stream is already closed
- changed OutGoing to Outgoing
- fixed a bug that `metrics.tick()` was never called
- removed 2 unused nightly features and added `deny_code`
2020-06-09 01:24:21 +02:00
Marcel Märtens
2a7c5807ff overall cleanup, more tests, fixing clashes, removing unwraps, hardening against protocol errors, prepare prio mgr to take commands from scheduler
fix async_recv and double block_on panic on Network::drop and participant::drop
include Cargo.lock from all examples
Found a bug on imbris runners with doc tests of `stream::send` and `stream::recv`
As neither a backtrace, nor tracing on runners in the doc tests seems to help, i disable them and add them as unit tests
2020-06-09 01:24:16 +02:00
Marcel Märtens
6e776e449f fixing all tests and doc tests including some deadlocks 2020-06-09 01:24:05 +02:00
Marcel Märtens
9550da87b8 speeding up metrics by reducing string generation and Hashmap access with a metrics cache for msg/send and msg/recv 2020-06-09 01:24:01 +02:00
Marcel Märtens
8b839afcae move prios from scheduler to participant in oder to fixing closing of stream/participant
however i need to coordinate the prio adjustments in scheduler from now on, so that ParticipantA doesn't get all the network bandwith and ParticipantB nothing
2020-06-09 01:23:58 +02:00
Marcel Märtens
bd69b2ae28 renamed all Channels to new naming scheme and fixing shutting down bparticipant and scheduler correctly. Introducing structs to keep Info in scheduler.rs and participant.rs 2020-06-09 01:23:55 +02:00
Marcel Märtens
a8f1bc178a Experiments with a prometheus bug which actually worked as designed because i had client and server running at the same time
- https://github.com/tikv/rust-prometheus/issues/321
 - split up channel into a hanshake part and channel part.
   The handshake part is non endless and ends when its either done or aborted.
   If its okay i will send a request to the BParticipant which then opens a channel on the existing TCP or UDP connection.
   this streamlines the command chain alot. also the channel is almost empty now, thinking about removing it completly.
   isnt perfect, as shutdown and udp doesnt work yet
 - make PID to print as Base64
 - replace rouille with tiny_http
2020-06-09 01:23:49 +02:00
Marcel Märtens
9074de533a handling frames no longer is channel -> scheduler -> participant, but it's directly channel -> participant, removing a lock and a single bottleneck in the scheduler 2020-06-09 01:23:45 +02:00