2) Learning : Limitation of using Asynchronous CodeBlock

5 min readMay 9, 2021

Background

Asynchronous Programming has gained huge attention from developer community to build client/server side Applications be it Vertx toolkit, Java script, NodeJS. As we continue to write new microservice based applications, usage of async programming will continue to goes up. Reason is simple. When we are breaking large complex application into microservices, more Network IO is needed be it Another microservice, Database in Cloud. We have to build application with minimum possible HW resources. This is possible when we write efficient code using Async constructs. Asynchronous constructs will solve Task T(current) by breaking them into smaller tasks T1, T2, T3. When another new Task T(new) consisting of subtasks T4, T5, T6 arrives in the system, async constructs may pick task T4 while task T1 is doing IO operation. This essentially eliminates the need to wait for IO wait completion for T1. This is how we avoid context switching, effective CPU cycles utilization & hence effective utilization of HW resources.

Async statements ask for requests to process, which in turn will give maximum possible throughput with given resources. What if main problem is something else ? What if we have only X ms to process a single request, can asynchronous statements guarantees condition is met under all circumstances ?

Problem Statement

This is my one experience using Vertx & Reactive PG client library. Postgres database is used to perform Query lookup on X field by this service. Postgres PG Client library is used to submit the Query & retrieve the results using asynchronous constructs. This whole system works as expected. The main goal of this service is to process request end to end with a latency of less than 10ms.

Lookup from Postgres is highly efficient as evident from below query.

QUERY PLAN
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Limit (cost=0.42..4.30 rows=1 width=48) (actual time=0.039..0.040 rows=1 loops=1)
Planning time: 0.065 ms
Execution time: 0.052 ms
(5 rows)

Execution took just 117000 nano seconds. This is really good. Because we have 10ms to process entire request & still almost 8.83ms left to do any other processing logic on the server.

Service used Postgres Reactive Client library to execute database Queries. So the entire vertx architecture looks as shown below.

Request is received from external source for processing through service channels.
Business layer will start executing
Business layer does asynchronous request using Postgres PG client library
Request is added to the Postgres PG client Queue
Process all the requests in Queue & Flush it to DB IO channel in batches. Repeat step 1 through 5 for new request coming to the service from service channels.
Receive asynchronous response from Postgres. Check if any new requests were added to the Queue during step 5 & Flush all those requests to DB IO channel
Receive callback to Business layer on receiving response. Continue execution of business logic
Send final response to external service.

Above execution path is considered Fast Path with extremely low latency operations. If we go by above results, Query Execution on Postgres takes just 117000 nano seconds. We have missed an important metric that can really affect the overall request processing at Business layer(explained later). Within 117000ns(Before response received at Postgres Reactive Client) window, Event Loop thread will try to pick as many requests as possible from service channel. This will builds Up Queue(Per Socket Connection) maintained by Reactive PG Client. You might ask that we can have multiple connections established per Event Loop thread. In which case DB events are distributed evenly across n Socket connections with each Queue accounting for (117000/n) ns time events from service channel. This is still a problem as it creates cascading effect on Postgres server execution thread. To illustrate this problem, I have modified Reactive Postgres Client library to get some more additional metrics.

Class: https://github.com/vietj/reactive-pg-client/blob/master/src/main/java/io/reactiverse/pgclient/impl/SocketConnection.java

We have exposed above collected metrics via REST API & following are the results. Integer leaf node keys in below metrics represents Hashed Socket Identifier & values represent time measured in nanoseconds. In this case, we have multiple DB connections created per Event Loop thread(4) & multiple Events Loops are created(One Event Loop thread per core).

{
   "queryTime":{
      "947139684":64098,
      "870138099":64257,
      "571609198":2044564,
      "1177822618":8803116,
      "357940160":2731546,
      "691144663":66074,
      "49291374":88167,
      "580673589":16244104,
      "72272856":4542941,
      "429055876":129759,
      "2071082229":3208680,
      "712993563":8499832,
      "357421376":2466769,
      "518436688":5003819,
      "558694584":2857451,
      "1134990246":7039653,
      "1210362425":68165,
      "534895687":2945543,
      "1763592356":887891,
      "248069024":92260,
      "309707454":5802216,
      "1579201053":103880,
      "665078391":6535659,
      "484531463":94994,
      "1969575800":94518,
      "952529257":5560390,
      "40058837":8800273,
      "461168027":161708,
      "444841905":2307114,
      "1996322984":2509461,
      "1672521436":115658,
      "1381936783":129586,
      "819122427":1443221,
      "1940741597":14087894,
      "916535950":7116106,
      "1783950511":74509,
      "1899696145":5352965,
      "959053225":1065476,
      "1708503277":8239912,
      "1366257489":7386405,
      "646920182":200915168,
      "1970541773":8507163,
      "420868872":3725199,
      "1662195666":2067353,
      "1308379134":4435508,
      "72437234":103708,
      "2119817703":1767256,
      "1125344897":8149647,
      "247238406":5367199,
      "1944225625":3768147,
      "1221362774":2740672,
      "1893796209":2145710,
      "949874290":1393597,
      "971824796":1102277,
      "1906609465":13083518,
      "884330563":6922936
   },
   "inflight":{
      "947139684":34,
      "870138099":24,
      "571609198":0,
      "1177822618":35,
      "357940160":0,
      "691144663":24,
      "49291374":34,
      "580673589":26,
      "72272856":0,
      "429055876":0,
      "2071082229":0,
      "712993563":0,
      "357421376":0,
      "518436688":0,
      "558694584":0,
      "1134990246":0,
      "1210362425":24,
      "534895687":0,
      "1763592356":0,
      "248069024":37,
      "309707454":0,
      "1579201053":0,
      "665078391":0,
      "484531463":0,
      "1969575800":37,
      "952529257":0,
      "40058837":0,
      "461168027":0,
      "444841905":0,
      "1996322984":0,
      "1672521436":34,
      "1381936783":0,
      "819122427":0,
      "1940741597":13,
      "916535950":0,
      "1783950511":24,
      "1899696145":0,
      "959053225":0,
      "1708503277":0,
      "1366257489":0,
      "646920182":35,
      "1970541773":0,
      "420868872":0,
      "1662195666":0,
      "1308379134":0,
      "72437234":0,
      "2119817703":0,
      "1125344897":0,
      "247238406":0,
      "1944225625":0,
      "1221362774":8,
      "1893796209":0,
      "949874290":0,
      "971824796":0,
      "1906609465":26,
      "884330563":0
   },
   "connect-failures":{
      "947139684":0,
      "870138099":0,
      "571609198":0,
      "1177822618":0,
      "357940160":0,
      "691144663":0,
      "49291374":0,
      "580673589":0,
      "72272856":0,
      "429055876":0,
      "2071082229":0,
      "712993563":0,
      "357421376":0,
      "518436688":0,
      "558694584":0,
      "1134990246":0,
      "1210362425":0,
      "534895687":0,
      "1763592356":0,
      "248069024":0,
      "309707454":0,
      "1579201053":0,
      "665078391":0,
      "484531463":0,
      "1969575800":0,
      "952529257":0,
      "40058837":0,
      "461168027":0,
      "444841905":0,
      "1996322984":0,
      "1672521436":0,
      "1381936783":0,
      "819122427":0,
      "1940741597":0,
      "916535950":0,
      "1783950511":0,
      "1899696145":0,
      "959053225":0,
      "1708503277":0,
      "1366257489":0,
      "646920182":0,
      "1970541773":0,
      "420868872":0,
      "1662195666":0,
      "1308379134":0,
      "72437234":0,
      "2119817703":0,
      "1125344897":0,
      "247238406":0,
      "1944225625":0,
      "1221362774":0,
      "1893796209":0,
      "949874290":0,
      "971824796":0,
      "1906609465":0,
      "884330563":0
   },
   "pending":{
      "947139684":0,
      "870138099":0,
      "571609198":0,
      "1177822618":0,
      "357940160":0,
      "691144663":0,
      "49291374":0,
      "580673589":0,
      "72272856":0,
      "429055876":0,
      "2071082229":0,
      "712993563":0,
      "357421376":0,
      "518436688":0,
      "558694584":0,
      "1134990246":0,
      "1210362425":0,
      "534895687":0,
      "1763592356":0,
      "248069024":0,
      "309707454":0,
      "1579201053":0,
      "665078391":0,
      "484531463":0,
      "1969575800":0,
      "952529257":0,
      "40058837":0,
      "461168027":0,
      "444841905":0,
      "1996322984":0,
      "1672521436":0,
      "1381936783":0,
      "819122427":0,
      "1940741597":0,
      "916535950":0,
      "1783950511":0,
      "1899696145":0,
      "959053225":0,
      "1708503277":0,
      "1366257489":0,
      "646920182":0,
      "1970541773":0,
      "420868872":0,
      "1662195666":0,
      "1308379134":0,
      "72437234":0,
      "2119817703":0,
      "1125344897":0,
      "247238406":0,
      "1944225625":0,
      "1221362774":0,
      "1893796209":0,
      "949874290":0,
      "971824796":0,
      "1906609465":0,
      "884330563":0
   }
}

From above metrics we can see that

QueryTime has gone up to 200ms on worst case scenarios. This is due to large number of inflight requests from single client socket.
If a single flush results in too many DB requests, it will slows down execution of requests for later Query statements in that batch since there is only one server thread spawned per connection.
Inflight metrics are as high as 34 on a single batch. It will have further cascading effect on batch size on single Flush to DB.

Above metrics indicate problem quite clearly now. While async constructs increase HW resource utilization effectively, they may end up introducing Queues in between. This makes them some time unpredictable to measure the end to end latency of request processing on Latency Critical Applications.

My intention was here to explain downside of using Async constructs. So obvious solution requires an efficient lookup mechanism where Synchronous communication is made.

2) Learning : Limitation of using Asynchronous CodeBlock

Background

Problem Statement

Written by Bhargav Kulkarni