Adventures with Python’s Asyncio

I finally got the chance to work with Python’s Asyncio package in a semi-real world project recently. I’ve had a scraping program that scrapes investment fund data and inserts it into a database. As the number of funds grew, the number of investment vehicles grew. In revisiting the app, I decided it was a good time to finally try rewriting it with asyncio.

Asyncio has been out for years now, but it’s gone through some criticisms and knocks. Instead of trying to wrap my head around it, my strategy has been to wait things out and use Go for data engineering work instead. Go makes concurrency wonderfully easy, but I didn’t want to redo this app in Go because of this app’s reliance on BeautifulSoup for parsing, which is pretty incredible. Through a few Python releases, asyncio has been honed, simplified, and abstracted out for mortals to use, so I thought I’d give it a shot.

There are two key steps in this app that would benefit from async – the act of retrieving and parsing of a page, and the act of saving the data to a database. I read a few tutorials and gave it a first pass. The results were pretty surprising.

Trial 1: Synchronous, Parsing and Clean Database (832 Parsings)
Trial Milliseconds
1 366083
2 298531
3 258822
4 245793

Average: 392307.25, SD 54056.86

Trial 2: Asyncio, Parsing and Clean Database (832 Parsings)
Trial Milliseconds
1 530063
2 456371
3 429044
4 438373

Average: 463462.75, SD 45825.87

There was a noticeable “warm-up” with the first runs, but subsequent runs were usually faster. The key takeaway, though, was that while asyncio was more consistent, on average, it was almost 60% slower than synchronous!

Clearly I was doing something wrong. I suspected three things:
1) I did not fully understand the package and wrote the whole thing incorrectly. (High suspicion)
2) Database operations were not being asynced. (High suspicion)
3) BeautifulSoup and downloaded was not being asynced. (Low suspicion)

I attacked #1 and #2 simultaneously.

I started to re-read tutorials and documentation. I made some changes to the code, re-read a lot of things. I mainly learned that a lot of blog posts and documentation out there is outdated, needlessly complex, and/or did not simulate real world situations. Sleep() does not replace blocking operations IRL! Defining a simple async function that is awaited definitely fit my needs, and thinking back on work projects, maybe 80-90% of my historical data engineering needs.

After this, subsequent runs were similar numbers. The problem must be elsewhere.

I went down the rabbit hole of asyncing the database operations. The project uses SQLAlchemy for database operations. The result of this dive was not good. SQLAlchemy’s async features are in beta. That’s fine. However, the drivers need to be async, and the only reliable one right now is for PostGres. There is for MariaDB/MySQL in aiomysql, but I ran into so many bugs that are still open, I had to abandon that effort.

So, I was ready to just ditch asyncio for now and stick with synchronous code. Just to test, though, I removed parts of the code that wrote to the database. My code would just download and parse. Guess what? Similar results. The async version was much slower. Database operations were not the blocking factor.

At this point, I felt more confident with how asyncio worked. I thought this through. The main workhorse functions did parsing and database insertions. Since they were defined as coroutines, they all had to run and complete. The parsing had to occur before the database operations. There was no way around that. Whether BeautifulSoup or SQLAlchemy was async or not didn’t matter. Both had to run, so both were in the same async’ed function.

What else could it be? Reading more and thinking about it, there’s a lot of overhead in asyncio, and have noted it is slower in certain use cases. Maybe there was an economies of scale issue here.

I took this opportunity to update my dataset. I was parsing about 832 investment vehicles with each run, but those were out of date. I updated the funds I was monitoring and this jumped it to 1,049. I then re-ran both the async and sync versions.

Trial 3: Synchronous, Parsing and Clean Database (1,049 Parsings)
Trial Milliseconds
1 338347
2 256635
3 249421
4 241273

Average: 271419, SD 45057.80

Trial 4: Asyncio, Parsing and Clean Database (1,049 Parsings)
Trial Milliseconds
1 295613
2 267927
3 245697
4 231022

Average: 260064.75, SD 28138.98

Both methods sped up, but the async version finally decreased below the synchronous version! And it was way more consistent!

Takeaways

  • By increasing the amount of work, the amount of time in running async went from 7.5 minutes to about 4.3 minutes. This leads me to believe there is an overhead cost with asyncio, and you will not see the benefits on small datasets and projects.
  • Multiple blog posts say “async/await creates a coroutine.” This is not true. async creates the coroutine. Await just means run this call until it’s done.
  • That means it probably doesn’t matter very much if libraries you use to make blocking operations are not async-ready. Don’t stress it. You can just place them in a coroutine (defined by “async def”), await the call, and it will be ran until complete for you.
  • Only coroutines, again, defined by async def, can be awaited. (Except generators, which is another can of worms.) Unlike Node, it’s fine to define an async function without any awaits because the whole thing becomes just an awaitable coroutine.

All in all, good fun, good practice, complicated, and a lot easier to control than Javascript’s async operations.

Leave a Reply