Tuesday, March 31, 2009

lusca release - rev 13894

I've just put the latest Lusca-HEAD release up for download on the downloads page. This is the version which is currently running on the busiest Cacheboy CDN nodes (> 200mbit each) with plenty of resources to spare.

The major changes from Lusca-1.0 (and Squid-2 / Squid-3, before that):

  • The memory pools code has been gutted so it now acts as a statistics-keeping wrapper around malloc() rather than trying to cache memory allocations; this is in preparation for finding and fixing the worst memory users in the codebase!
  • The addition of reference counted buffers and some support framework has appeared!
  • The server-side code has been reorganised somewhat in preparation for copy-free data flow from the server to the store (src/http.c)
  • The asynchronous disk IO code has been extracted out from the AUFS codebase and turned into its own (mostly - one external variable left..) standalone library - it should be reusable by other parts of Lusca now
  • Some more performance work across the board
  • Code reorganisation and tidying up in preparation for further IPv6 integration (which was mostly completed in another branch, but I decided it moved along too quickly and caused some stability issues I wasn't willing to keep in Lusca for now..)
  • More code has been shuffled into separate libraries (especially libhttp/ - the HTTP code library) in preparation for some widescale performance changes.
  • Plenty more headerdoc-based code documentation!
  • Support for FreeBSD-current full transparent interception and Linux TPROXY-4 based full transparent interception
The next few weeks should be interesting. I'll post a TODO list once I'm back in Australia.

Documentation pass - disk IO

I've begun documenting the disk IO routines in Lusca-HEAD. I've started with the legacy disk routines in libiapp/disk.c - these are used by some of the early code (errorpages, icon loading, etc) and the UFS based swap log (ufs, aufs, diskd).

This code is relatively straight forward but I'm also documenting the API shortcomings so others (including myself!) will realise why it isn't quite good enough in its current form for processing fully asynchronous disk IO.

I'll make a pass through the async IO code on the flight back to Australia (so libasyncio/aiops.c and libasyncio/async_io.c) to document how the existing async IO code works and its shortcomings.

Sunday, March 29, 2009

Async IO changes to Lusca

I've just cut out all of the mempool'ed buffers from Lusca and converted said buffers to just use xmalloc()/xfree(). Since memory pools in Lusca now don't "cache" memory - ie, they're just for statistics keeping - the only extra thing said memory pooled buffers were doing was providing NUL'ed memory for disk buffers.

So now, for the "high hit rate, large object" workload which the mirror nodes are currently doing, the top CPU user is memcpy() - via aioCheckCallbacks(). At least it wasn't -also- memset() as well.

That memcpy() is taking ~ 17% of the total userland CPU used by Lusca in this particular workload.

I have this nagging feeling that said memcpy() is the one done in storeAufsReadDone(), where the AUFS code copies the result from the async read into the supplied buffer. It does this because its entirely possible the caller has disappeared between the time the storage manager read was scheduled and the time the filesystem read() was scheduled.

Because the Squid codebase doesn't explicitly cancel or wait for completion of async events - and instead relies on this "locking" and "invalidation" semantics provided by the callback data scheme - trying to pass buffers (and structs in general) into threads is pretty much plainly impossible to do correctly.

In any case, the performance should now be noticably better.

(obnote: I tried explaining this to the rest of the core Squid developers last year and somehow I don't quite think I convinced them that the current approach, with or without the AsyncCallback scheme in Squid-3, is going to work without significant re-engineering of the source tree. Alas..)

Saturday, March 28, 2009

New Lusca-head features to date

I've been adding in a few new features to Lusca.

Firstly is the config option "n_aiops_threads" which allows configuration / runtime tuning of the number of IO threads. I got fed up recompiling Lusca every time I wanted to fiddle with the number of threads, so I made it configurable.

Next is a "client_socksize" - which overrides the compiled and system default TCP socket buffer sizes for client-side sockets. This allows the admin to run Lusca with restricted client side socket buffers whilst leaving the server side socket buffers (and the default system buffer sizes) large. I'm using this on my Cacheboy CDN nodes to help scale load by having large socket buffers to grab files from the upstream servers, but smaller buffers to not waste memory on a few fast clients.

The async IO code is now in a separate library rather than in the AUFS disk storage module. This change is part of a general strategy to overhaul the disk handling code and introduce performance improvements to storage and logfile writing. I also hope to include asynchronous logfile rotation. The change breaks the "auto tuning" done via various hacks in the AUFS and COSS storage modules. Just set "n_aiops_threads" to a sensible amount (say, 8 * the number of storage directories you have, up to about 128 or so threads in total) and rely on that instead of the auto-tuning. I found the auto-tuning didn't quite work as intended anyway..

Finally, I've started exporting some of the statistics from the "info" cachemgr page in a more easier computer-parsable format. The cachemgr page "mgr:curcounters" page includes current client/server counts, current hit rates and disk/memory storage size. I'll be using these counters as part of my Cacheboy CDN statistics code.