Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high memory usage under heavy load #56

Open
btkador opened this issue Aug 5, 2018 · 6 comments
Open

high memory usage under heavy load #56

btkador opened this issue Aug 5, 2018 · 6 comments

Comments

@btkador
Copy link

btkador commented Aug 5, 2018

was running about 5 hours with that settings ... filling in 1000 searches/hashes in parallel and immediately refill it...

I've used a slightly modified dht.go because I had disabled the clientThrottle feature for debugging reason. so use that one regarding line numbers.

                      s, err = dht.New(&dht.Config{
                                Address:                 "",
                                Port:                    sPort,
                                NumTargetPeers:          5000,
                                DHTRouters:              "router.magnets.im:6881,router.bittorrent.com:6881,dht.transmissionbt.com:6881,router.utorrent.com:6881,dht.aelitis.com:6881,dht.libtorrent.org:25401",
                                MaxNodes:                500,
                               CleanupPeriod:           3 * time.Minute,
                                SaveRoutingTable:        false,
                                RateLimit:               -1,
                                MaxInfoHashes:           2,
                                MaxInfoHashPeers:        2,
                                ClientPerMinuteLimit:    50,
                                ThrottlerTrackedClients: 10,
                                UDPProto:                "udp4",

heap2

more detailed

heap2_detail

it seems 2.5gig of the 8 get’s eaten by dht.go:610 which is
node.pastQueries[r.T] = query

also in krpc.go:189 the jackpal unmarshal eats another 1.123gig


CPU looks fine to me, just the unmarshal eats a plenty.. after I got (and still see) many of dht.go:568] DHT: readResponse Error: unexpected EOF, I was trying to replace it with https://github.com/IncSW/go-bencode (with is much less cpu intensive regarding the benchmarks there) to see if it solves the issue, but due to my very limited golang skills I was unable to find the right syntax parameters and value type conversations to make it work.

cpu


time


expvar output

@btkador
Copy link
Author

btkador commented Aug 5, 2018

Why do you have a CleanupPeriod of 3 minutes? That's a high frequency. With those settings, you have to contact your 500 nodes every 300 seconds just to keep them alive in your routing table. That's probably too much.

cleanup is that low, since I was playing with the variables to find out what they do. I was hoping the cleanup kicks out old nodes faster and therefore the pastQueries with them.

do we need all the data in node.pastQueries[r.T] = query or can we limit it to a maximum count somehow?

Setting a RateLimit means you reply less often to incoming packets, which reduces output rate and thus consumes less memory. You probably don't have an infinite network connection anyway :), so if you send too many packets you're probably causing congestion and reducing your overall "useful" throughput.

In’t RateLimit also dropping reply packets to my get_peers? The anacrolix/dht as a “Passiv bool” to disable responding to incoming queries completely… If ratelimit is only for incoming queries, setting it to 1 or 0 is probable the same effect?

I'm running another profiling right now with ratelimit 10000 and Cleanup back to 15 Minutes default.

@nictuku
Copy link
Owner

nictuku commented Aug 5, 2018

Like we discussed, some of the memory usage you're seeing is normal given that GOGC=100, meaning that by default Go uses twice the amount of heap memory that it truly needs.

It looks like pastQueries does not have a limit so it's growing unbounded. It should probably be limited. Nice finding!

A few alternatives:

  1. use a ringer buffer or an LRU.
  2. add a timer in the main loop (or a step in the existing cleanup routine) that removes transactions older than a certain time. This would require tracking the age of each transaction.

It's been a while since I looked at this code so I could be wrong but I think the best approach here is to use a simple ring buffer that recycles the oldest event. See https://golang.org/pkg/container/ring/

It is possible that once we fix pastQueries, the allocations we see inside readResponse might get fixed, too, because pastQueries is causing the "r" to be retained in memory.

Want to submit a PR for doing the ring buffer for pastQueries? 😄

@btkador
Copy link
Author

btkador commented Aug 5, 2018

I've already replaced pastQueries with the lru from peerStore for a test beginning of last week, but discarded it for some reason. Running it again with the profiling enabled right now.

@btkador
Copy link
Author

btkador commented Aug 6, 2018

had it running with max 1000 pastQueries and pendingQueries but went up to 20GB used memory in profiler...

had to run it again because killed my profiler, so screenshot/data below is after about 20 minutes run..

heap3

if I substract totalNodes vs KilledNodes there are still 1.434.393 active nodes, which is quote a lot ... maybe limit the number of nodes, too?

"totalDroppedPackets": 0,
"totalFindNodeDupes": 491,
"totalGetPeersDupes": 59304708,
"totalKilledNodes": 291915,
"totalNodes": 3941162,
"totalNodesReached": 2587206,
"totalPacketsFromBlockedHosts": 4341,
"totalPeers": 13924652,
"totalReadBytes": 2739891552,
"totalRecv": 10306303,
"totalRecvFindNode": 106502,
"totalRecvFindNodeReply": 658,
"totalRecvGetPeers": 119602,
"totalRecvGetPeersReply": 8042017,
"totalRecvPingReply": 1903051,
"totalSelfPromotions": 1943,
"totalSent": 16217619,
"totalSentFindNode": 1031,
"totalSentGetPeers": 13092528,
"totalSentPing": 2834225,
"totalWrittenBytes": 1458361886

@nictuku
Copy link
Owner

nictuku commented Aug 6, 2018 via email

@btkador
Copy link
Author

btkador commented Aug 6, 2018

for testing output I've just implemented

func (d *DHT) GetNumNodes() int {
        return d.routingTable.numNodes()
}

and the number grows way higher than MaxNodes. It's at 500.000 after 5 Minutes. MaxNodes seems only to take effect at bootstrap and we receive a ping from external, to decide if we "learn" the new node. But when sending out get_peers it stores each learned node always via getOrCreateNode .. which we can see in the profiler, too. Buf we have to learn all of them to be able to process the get_peers response, so there can't be any hard limit.

But if I search 100.000 different hashes within one hour and from each search we learn 1000 new nodes (because the fact that get_peers always returns new nearest nodes) that's already 100 million nodes plus their pastQueries.

I've now added a createTime to each new node, and expire them after CleanupPeriod*2 if unused.

                // kill old and currently unused nodes
                if time.Since(n.createTime) > cleanupPeriod*2 && n.pendingQueries.Len() == 0 {
                        log.V(4).Infof("DHT: Old node with 0 pendingQueries. Deleting")
                        r.kill(n, p)
                        continue
                }

and numNodes stays now around 1 million and heap around 4 gigabytes.

I will try another version next were I keep your original code without all the lru-mod and only the code above.

heap5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants