Connections or queries are hanging, and the following error appears in the memsql.log of an aggregator node:
WARN: The ready queue has not decreased (currently <num> elements, <num> pops) for <num> seconds. This workload needs more threads.
The value of max_connection_threads on the aggregator may be set too high for your workload. MemSQL can't allocate new threads to handle new connections and these connections are being queued, causing the aggregator to stop responding. Each query running on your cluster (including internal queries used by the nodes to communicate with each other) requires exactly one thread on an aggregator.
Note that restarting your aggregator causes current connections to be terminated, hence the warning will no longer be reported until the workload resumes.
This workload needs more threads refers to the kernel threads dedicated to MemSQL, specifically the value of max_connection_threads. If you see this error frequently, you may need to increase the value of max_connection_threads.
Elements refers to the the number of queued queries (queries which are waiting for execution threads).
Pops refers to the cumulative number of queries that have been scheduled. The significance of the pops value is that if it stays at the same value it means the scheduler isn't making progress because queries currently executing in the system are still running.
Increase the value of max_connection_threads. The max_connection_threads parameter limits the number of threads allowed on the node and its value needs to be adjusted on the aggregator nodes (default 192 for aggregator nodes, maximum value is 8192.). Note: the value on the leaf nodes does not need to be changed.
Warning: Increasing the value of max_connection_threads on aggregator nodes will allow more traffic to be admitted into the cluster which can cause resource pressure on leaves.
HOW CAN I MONITOR THE THREADS UTILIZATION?
To check how many threads are running and whether queries are being queued because of max_connection_threads, in SHOW STATUS EXTENDED, you can check threads_running (current number of running threads) and ready_queue (number of connections in queue currently waiting).
If your workload runs enough long-running concurrent queries to hit max_connection_threads, and they are sufficiently long-running so that none have finished for many seconds, the aggregator memsql.log will start showing WARN: This workload needs more threads.
Note: It is possible for a BACKUP to indirectly cause this warning. If a BACKUP is executed when there are long-running queries executing in the cluster the BACKUP will have to wait for the long-running queries to complete. During this time new queries entering the cluster will also have to wait for the long-running query to finish. If you are seeing this error in close proximity to BACKUPS in the Master Aggregator memsql.log check to make sure BACKUPS are not being executed when long-running queries are present in the cluster.