Wednesday, September 30, 2009

Just a few Lusca related updates!

  • All of the Cacheboy CDN nodes are running Lusca-HEAD now and are nice and stable.
  • I've deployed Lusca at a few customer sites and again, it is nice and stable.
  • The rebuild logic changes are, for the most part, nice and stable. There seems to be some weirdness with 32 vs 64 bit compilation options which I need to suss out but everything "just works" if you compile Lusca with large file/large cache file support regardless of the platform you're using. I may make that the default option.
  • I've got a couple of small coding projects to introduce a couple of small new features to Lusca - more on those when they're done!
  • Finally, I'm going to be migrating some more of the internal code over to use the sqinet_t type in preparation for IPv4/IPv6 agnostic support.
Stay Tuned!

Sunday, August 16, 2009

Squid-3 isn't a rewrite!


There seems to be this strange misconception that Squid-3 is a "rewrite" of Squid in C++. I am not sure where this particular little tidbit gets copy/pasted from but just for the record:

Squid-3 is the continuation of Squid-2.5, made to compile using the GNU C++ compiler. It is not a rewrite.

If Squid-3 -were- a rewrite, and the resultant code -was- as much of a crappy-performing, bastardised C/C++ hybrid, then I'd have suggested the C++ coders in question need to relearn C++. Luckily for them, the codebase is a hybrid of C and C++ because it did just start as a C codebase with bits and pieces part-migrated to C++.

Thursday, August 6, 2009

Preparation for next release; IPv6 checklist

I've been slowly working on tidying up the codebase before the next snapshot release. I've been avoiding doing further large scale code reorganisation until I'm confident that this codebase is as stable and performs as well as it should.

I'll hopefully have the next stable snapshot online tonight. I'll then re-evaluate where things are at right now and come up with a short-list of things to do over the next couple of weeks. It'll almost certainly be the remainder of the IPv6 preparation work - I'd like to prepare the last few bits of infrastructure for IPv6 - and make certain that is all stable before I start converting the client-side and server-side code to actively using the IPv6 routines.

The current IPv6 shortlist, if I decide to do it:
  1. client database code - convert to a radix tree instead of a hash on the IP address; make IPv4/IPv6 agnostic.
  2. persistent connection code - up the pconn hash key length to fit the text version of the IPv6 address. I'll worry about migrating the pconn code to a tree later on.
  3. Importing the last remaining bits of the IPv6 related code into the internal DNS code.
  4. Make sure the internal and external DNS choices both function properly when handling IPv6 addresses for forward and reverse lookups.
  5. Import the IP protocol ACL type and IPv6 address ACL types - src6 and dst6.
  6. Modify the ACL framework to use the IPv6 datatype instead of "sockaddr_in" and "inaddr" structs; then enable src6/dst6.
  7. Make certain the source and destination hostname ACLs function correctly for both IPv4 and IPv6.
  8. Test, test, test!
The last time I did a "hack" conversion to support IPv6 client side code I found a number of places which expected a newly-allocated struct to be zero'ed, and thus the "in_addr" embedded inside it to be INADDR_ANY. This caused some crashes to occur in production testing. I'm thus going to hold off on pushing through the IPv6 client side changes (which are actually surprisingly simple once the above is done!) until I've enumerated and fixed all of those particular nightmares.

The IPv6 server-side stuff is a whole different barrel of fun. I'm going to ignore a lot of that for now until I've made certain the client-side code is stable and performing as well as the current IPv4-only code.

I don't even want to think about the FTP related changes that need to occur. I may leave the FTP support IPv4 only until someone asks (nicely) about it. The FTP code is rife with C string pointer manipulations which need to be rewritten to use the provided string primitives. I'd really like to do -that- before I consider upgrading it to handle IPv6.

Anyway. Lots to do, not enough spare time to do it all in.

Tuesday, July 28, 2009

Updates - rebuild logic, peering and COSS work

I've committed the initial modifications to the storage rebuilding code. The changes mostly live in the AUFS and COSS code - the rest of Lusca isn't affected.

The change pushes the rebuild logic itself into external helpers which simply stream swaplog entries to the main process. Lusca doesn't care how the swaplog entries are generated.

The external helper method is big boost for AUFS. Each storedir creates a single rebuild helper process which can block on disk IO without blocking anything else. The original code in Squid will do a little disk IO work at a time - which almost always involved blocking the process until said disk IO completed.

The main motivation of this work was the removal of a lot of really horrible, twisty code and further modularisation of the codebase. The speedups to the rebuild process are a nice side-effect. The next big improvement will be sorting out how the swap logs are written. Fixing that will be key to allowing enormous caches to properly function without log rotation potentially destroying the proxy service.

Monday, July 13, 2009

Caching Windows Updates

There are two issues with caching windows updates in squid/lusca:

* the requests for data themselves are all range requests, which means the content is never cached in Squid/Lusca;
* the responses contain validation information (eg ETags) but the object is -always- returned regardless of whether the validators match or not.

This feels a lot like Google Maps who did the same thing with revalidation. Grr.

I'm not sure why Microsoft (and Google!) did this with their web services. I'll see if I can find someone inside Microsoft who can answer questions about the Windows Update related stuff to see if it is intentional (and document why) or whether it is an oversight which they would be interested in fixing.

In any case, I'm going to fix it for the handful of commercial supported customers which I have here.

Wednesday, July 8, 2009

Storage rebuilding / logging project - proposal

I've put forward a basic proposal to the fledgling Lusca community to get funding to fix up the storage logging and rebuilding code.

Right now the storage logging (ie, "swap.state" logging) is done using synchronous IO and this starts to lag Lusca if there is a lot of disk file additions/deletions. It also takes a -long- time to rotate the store swap log (which effectively halts the proxy whilst the logs are rotated) and an even longer time to rebuild the cache index at startup.

I've braindumped the proposal here - .

Now, the good news is that I've implemented the rebuild helper programs and the results are -fantastic-. UFS cache dirs will still take forever to rebuild if the logfile doesn't exist or is corrupt but the helper programs speed this up by a factor of "LOTS". It also parallelises correctly - if you have 15 disks and you aren't hitting CPU/bus/controller limits, all the cache dirs will rebuild at full speed in parallel.

Rebuilding from the log files takes seconds rather than minutes.

Finally, I've sketched out how to solve the COSS startup/rebuild times and the even better news is that fixing the AUFS rebuild code will give me about 90% of what I need to fix COSS.

The bad news is that integrating this into the Lusca codebase and fixing up the rebuild process to take advantage of this parallelism is going to take 4 to 6 weeks of solid work. I'm looking for help from the community (and other interested parties) who would like to see this work go in. I have plenty of testers but nothing to help -coding- along and I unfortunately have to focus on projects that provide me with some revenue.

Please contact me if you're able to help with either coding or funding for this.

Friday, June 26, 2009

Lusca in Production, #2

Here's some 24 hour statistics from a Lusca-HEAD forward proxy install:

Request rate:
File descriptor count (so clients, servers and disk FDs) :

Byte and Request hit ratio:

Traffic to clients (blue) and from the internet (red):

CPU Utilisation (1.0 is "100% of one core):

Lusca-head pushing >100mbit in production..

Here's the Lusca HEAD install under FreeBSD-7.2 + TPROXY patch. This is a basic configuration with minimal customisation. There's ~ 600gig of data in the cache and there's around 5TB of total disk storage.

You can see when I turned on half of the users, then all of the users. I think there's now around 10,000 active users sitting behind this single Lusca server.

Tuesday, June 23, 2009

Lusca-head in production with tproxy!

I've deployed Lusca head (the very latest head revision) in production for a client who is using a patched FreeBSD-7.2-STABLE to implement full transparency.

I'm using the patches and ipfw config available at .

The latest Lusca fixes some of the method_t related crashes due to some work done last year in Squid-2.HEAD. It seems quite stable now. The bugs only get tickled with invalid requests - so they show up in production but not with local testing. Hm, I need to "extend" my local testing to include generating a wide variety of errors.

Getting back on track, I've also helped another Lusca user deploy full transparency using the TPROXY4 support in the latest Linux kernel (I believe under Debian-unstable?) He helped me iron out some of the bugs which I've just not seen in my local testing. The important ones (method_t in particular) have been fixed; he's been filing Lusca issues in the google code tracker so I don't forget them. Ah, if all users were as helpful. :)

Anyway. Its nice to see Lusca in production. My customer should be turning it on for their entire satellite link (somewhere between 50 and 100mbit I think) in the next couple of days. I believe the other user has enabled it for 5000 odd users. I'll be asking them both for some statistics to publish once the cache has filled and has been tuned.

Stay tuned for example configurations and tutorials covering how this all works. :)

Tuesday, May 19, 2009

More profiling results

So whilst I wait for some of the base restructuring in LUSCA_HEAD to "bake" (ie, settle down, be stable, etc) I've been doing some more profiling.

Top CPU users on my P3 testing box:

root@jennifer:/home/adrian/work/lusca/branches/LUSCA_HEAD/src# oplist ./squid
CPU: PIII, speed 634.464 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
1832194   8.7351            _int_malloc
969838    4.6238            memcpy
778086    3.7096            malloc_consolidate
647097    3.0851            vfprintf
479264    2.2849            _int_free
468865    2.2354            free
382189    1.8221            calloc
326540    1.5568  squid                    memPoolAlloc
277866    1.3247            re_search_internal
256727    1.2240            strncasecmp
249860    1.1912  squid                    httpHeaderIdByName
248918    1.1867  squid                    comm_select
238686    1.1380            strtok
215302    1.0265  squid                    statHistBin

Sigh. snprintf() leads to this:

root@jennifer:/home/adrian/work/lusca/branches/LUSCA_HEAD/src# opsymbol ./squid snprintf
CPU: PIII, speed 634.464 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 90000
samples  %        image name               symbol name
  15840     1.7869  squid                    safe_inet_addr
  29606     3.3398  squid                    pconnPush
  54021     6.0940  squid                    httpBuildRequestHeader
  64916     7.3230            inet_ntoa
  105692   11.9229  squid                    clientSendHeaders
  127929   14.4314  squid                    urlCanonicalClean
  196481   22.1646  squid                    pconnKey
  265474   29.9476  squid                    urlCanonical
36879    100.000            snprintf
  813284   91.7449            vsnprintf
  36879     4.1602            snprintf [self]
  12516     1.4119            vfprintf
  9209      1.0388            _IO_no_init

Double sigh. Hi, the 90's called, they'd like their printf()-in-performance-critical code back.

Wednesday, May 13, 2009

LUSCA_HEAD (as of today) is in production

I'm deploying the current LUSCA_HEAD build over the Cacheboy CDN network. So far so good.

Now, need more traffic.. :)

Tuesday, April 28, 2009

More brain-damage code..

I queued up another test run of LUSCA_HEAD on the 600mhz celeron box and discovered that performance had dropped slightly.

It turns out that there's an increase in malloc/free calls which I traced back to the ACL code and further back to the HTTP logging code.

The HTTP request logging code is using ACLs to determine what to log. I have the access log disabled AND I'm not even -using- the log_access ACL directive to restrict what is being logged but it still is taking a trip through the ACL checklist code to create an empty check list, check it, and return NULL. This is aside from all the rest of the logging code which will create all the duplicate strings for the logging layer, then not log anything, then free them.

Sigh. The more I look at the code the more I want to stab out my own eyes. Oh well, I'll add "fix the damned logging code" to my TODO list for later.

Monday, April 13, 2009

Changing tack - more modularisation?

I've decided that tackling the storage manager codebase in its current shape is a bit too much right now. I may end up going down the path of using refcounted buffers as part of the data store to temporarily work around the shortcomings in the code. I don't like the idea of doing that long-term because, to be quite honest, that area of the codebase needs a whole lot of reorganisation and sanity to make it consistent and understandable. I think it should be done before any larger-scale changes are done and this includes the obvious performance boosts available by avoiding the copying.

In any case, I'm going to move onto my next short-term goal - a very, very basic module framework. I'd like to at least shoehorn in some very simple dynamic module loading and unloading into the core, not unlike what TMF did for Squid-3 as part of their initial eCAP work. My initial plans are to do the bare minimum necessary to start breaking out a very small chunk of code into modules - namely the request rewriters (url, storeurl and location) so they don't have to be compiled in. It will also force a little bit of tidying up around the HTTP and client-side code.

The initial aim is purely code reorganisation. Instead of having a nested list of callbacks forming the request processing chain, I plan on having a simple iterative process finite state machine which will route the request through different modules as required before passing it along to the rest of the client-side code. I hope that I can (slowly!) unwind a large part of the request path hairiness and enumerate it with said state engine.

In any case, I won't be going anywhere near as far with this as I'd like to in the first pass. There are plenty of problems with this (the biggest being parsing compound configuration types like ACLs - for example, if I wanted to modularise the "asn" ACL type, the module will need to be loaded far before the rest of the configuration file is parsed; then it needs to hook itself into the ACL code and register (a la what happened with Squid-3) itself in there; then subsequent ACL line parsing needs to direct things at the ACL module; then the ACL module needs to make sure its not unloaded until everything referencing it is gone!) but I'm going to pleasantly ignore them all the first time around.

By keeping the scope low and the changes minimal, I hope that the amount of re-recoding needed later on down the track (once I've established exactly what is needed for all of this) should be limited.

Oh, and as an aside but related project, I'm slowly fixing the SNMP core startup code to -not- use 15-20 nested deep function calls as part of its MIB tree construction. It is a cute functional programming type construct but it is absolutely horrible to try and add something. The related bit is allowing for SNMP mib leaves to be added -and- removed at runtime - so modules can register themselves with the cachemgr and SNMP core to provide their stats. 

Thursday, April 9, 2009

Fixing the disk code, part 4

My changes are now occasionally randomly crashing in malloc() - which means something is modifying memory it shouldn't and confusing things. It's a shame that I can't run this under valgrind - the required traffic load to generate this issue at the moment makes valgrind impossible to use.

My guess is that something is ripping the read buffer out from underneath the pread() in the worker thread. Figuring out exactly where this is happening is going to be tricky.

The prime candidate at the moment is where the read is cancelled indirectly via a call to aioClose(). aioClose() calls aioCancel() which attempts to cancel all of the pending events but I'm absolutely sure that it is possible for a read event to be in progress. The fact this is occurring was hidden by the read() buffer being a temporary local buffer - the read() would complete (error or not) and then the result would be tossed away.

The solution? I'm not sure. I have a feeling I'm going to be introducing something like AsyncCalls in Squid-3 to delay the IO callbacks from occuring - but then, for those keeping score, the AsyncCalls stuff in squid-3 (in its current inclusion) has made some code unstable and other bits of code difficult to debug. I wonder what repercussions there would be to splitting out the generation of IO callbacks and their execution; it would certainly help with diskd..

Wednesday, April 8, 2009

Fixing the disk code, part 3

The current store client code experiment involves this:

  • store clients are created and destroyed as per normal;
  • creation sets "", destruction clears "";
  • destruction doesn't deallocate the store client - it removes it from various lists and only -attempts- to free it via a new function, storeClientComplete()
  • storeClientComplete() (badly named) will only free the store client if it is marked as inactive and the callbacks it is a part of have returned.
  • In the callback functions, storeClientComplete() is called to see whether the store client should be freed, and if so, they terminate the callback
  • In the callback functions, if "" is clear, then the callback is terminated

Since cbdataFree() is not called until the flags are clear (active, disk_io_pending, event_pending) then it is possible that some callbacks will be called when they shouldn't be. In order to preserve the existing behaviour I need to further audit the store client code, find out which functions are passed in as callbacks, and make sure those functions correctly check the active state of the store client and also try calling storeClientComplete(). The problem here is that what controls the callback being called is the validity of the callback -data- - so I am hoping that there are no non-store client callbacks being scheduled with a store client instance as the callback data. Gah.

The other issue at the moment is that there is currently no guarantee (read: I haven't ensured it happens!) from the store API that the FileClosed() and FileNotify() routines get called before the store client is deregistered. I therefore can't delay free'ing the store client until those are also called. It should not be a big deal because the main motivation to delaying the store client -free- is to allow pending async operations using other data (in this case, read buffers) to complete before free'ing the data. In the two above instances the behaviour should be the same as before - the store client pointer will simply go invalid and the callbacks won't be made.

Sunday, April 5, 2009

Fixing the disk code, part 2

I've been reviewing and documenting the store client code to get a feel for how the code works and is used. Sure enough, it relies heavily on the callback data invalidation magic to avoid having to wait around until the store disk file operations complete (or an event scheduled via eventAdd() to complete at a later date.) I added a little bit of logging to print out a warning if a pending disk operation was there and sure enough, it happens often enough in production to probably cause all those random crashes that I remember in Squid-1.2.

Anyway. I have an idea for eliminating the read copy in the async IO code. Something I've done in other C programs is this:
  • Track the pending events which the store client code schedules callbacks into itself (ie, the eventAdd() for later processing, and pending disk IO)
  • Split storeClientUnregister() into two parts - one which sets a "shutting down" flag, and another which does the hard work
  • assert() that API methods aren't called (ie, storeClientRef(), etc) if the shutting down flag is set - ie, they shouldn't be at the present time, so any instance of that happening means that some callback is occuring where it shouldn't be and getting further than it should be!
  • On event/disk callback completion, clear the relevant flags and see if the store client is completely ready to be destroyed. Only then destroy it.
This -should- preserve the current behaviour (callbacks will return immediately instead of not being called, so they effectively behave the same) but it won't prevent non-callback code from running if the callback data pointer is "valid". (Which, hopefully isn't actually happening, but god knows in this codebase..) It means that the callback data "is valid" check in the store client code (and hopefully, the disk IO code) really becomes a debugging check rather than being used to control whether the callback is made or not.

I'm just worried that there's going to be weirdness in how the store client code is being called. Grr.

Anyway. If this is successful and proves itself in production, I'll then change the disk APIs to also enforce that the callback data remains valid for the lifetime of the scheduled IO. Which IMHO it should be. As I said earlier, using tricks like reference counted buffers shouldn't be a crutch for a bad API design..

Saturday, April 4, 2009

Fixing the disk APIs (or, reading Squid-1.2 code)

One of the performance limiting things I've found with the mozilla workload is the memcpy() being done from the temporary read buffer into the reader buffer (ie, what was supplied to storeClientRead*()) and, curious to know the history of this, I decided to go digging into the earlier revisions of all of this code. I went back to Squid-1.2.beta25.

I found that file_read() and file_write() (ie, the current "sync only" disk routines) had the hooks into calling the async IO functions (ie, aioRead() and aioWrite()) if they were compiled in. Like now, aio_read() makes a temporary buffer to read into and then copies it to the original request buffer if the request wasn't cancelled between submission and completion time.

In the 1.2 code, the callback involved copying the data into the supplied buffer and calls the callback. This didn't check to make sure the callback data is still valid. Tsk. In more recent code, the AIO read callback actually supplies the buffer to the completion callback (which is only called if valid!) and its up to the callee to do the copy. tsk.

The problem is neither "API" forced the caller to ensure the buffer stays valid for the length of the call. I'm going to investigate doing this, but I'm not sure whether this is easier or more difficult than shoehorning in reference counted buffers at this point. I want to use refcounted buffers throughout the code but not as a crutch for a badly designed API.

Tuesday, March 31, 2009

lusca release - rev 13894

I've just put the latest Lusca-HEAD release up for download on the downloads page. This is the version which is currently running on the busiest Cacheboy CDN nodes (> 200mbit each) with plenty of resources to spare.

The major changes from Lusca-1.0 (and Squid-2 / Squid-3, before that):

  • The memory pools code has been gutted so it now acts as a statistics-keeping wrapper around malloc() rather than trying to cache memory allocations; this is in preparation for finding and fixing the worst memory users in the codebase!
  • The addition of reference counted buffers and some support framework has appeared!
  • The server-side code has been reorganised somewhat in preparation for copy-free data flow from the server to the store (src/http.c)
  • The asynchronous disk IO code has been extracted out from the AUFS codebase and turned into its own (mostly - one external variable left..) standalone library - it should be reusable by other parts of Lusca now
  • Some more performance work across the board
  • Code reorganisation and tidying up in preparation for further IPv6 integration (which was mostly completed in another branch, but I decided it moved along too quickly and caused some stability issues I wasn't willing to keep in Lusca for now..)
  • More code has been shuffled into separate libraries (especially libhttp/ - the HTTP code library) in preparation for some widescale performance changes.
  • Plenty more headerdoc-based code documentation!
  • Support for FreeBSD-current full transparent interception and Linux TPROXY-4 based full transparent interception
The next few weeks should be interesting. I'll post a TODO list once I'm back in Australia.

Documentation pass - disk IO

I've begun documenting the disk IO routines in Lusca-HEAD. I've started with the legacy disk routines in libiapp/disk.c - these are used by some of the early code (errorpages, icon loading, etc) and the UFS based swap log (ufs, aufs, diskd).

This code is relatively straight forward but I'm also documenting the API shortcomings so others (including myself!) will realise why it isn't quite good enough in its current form for processing fully asynchronous disk IO.

I'll make a pass through the async IO code on the flight back to Australia (so libasyncio/aiops.c and libasyncio/async_io.c) to document how the existing async IO code works and its shortcomings.

Sunday, March 29, 2009

Async IO changes to Lusca

I've just cut out all of the mempool'ed buffers from Lusca and converted said buffers to just use xmalloc()/xfree(). Since memory pools in Lusca now don't "cache" memory - ie, they're just for statistics keeping - the only extra thing said memory pooled buffers were doing was providing NUL'ed memory for disk buffers.

So now, for the "high hit rate, large object" workload which the mirror nodes are currently doing, the top CPU user is memcpy() - via aioCheckCallbacks(). At least it wasn't -also- memset() as well.

That memcpy() is taking ~ 17% of the total userland CPU used by Lusca in this particular workload.

I have this nagging feeling that said memcpy() is the one done in storeAufsReadDone(), where the AUFS code copies the result from the async read into the supplied buffer. It does this because its entirely possible the caller has disappeared between the time the storage manager read was scheduled and the time the filesystem read() was scheduled.

Because the Squid codebase doesn't explicitly cancel or wait for completion of async events - and instead relies on this "locking" and "invalidation" semantics provided by the callback data scheme - trying to pass buffers (and structs in general) into threads is pretty much plainly impossible to do correctly.

In any case, the performance should now be noticably better.

(obnote: I tried explaining this to the rest of the core Squid developers last year and somehow I don't quite think I convinced them that the current approach, with or without the AsyncCallback scheme in Squid-3, is going to work without significant re-engineering of the source tree. Alas..)

Saturday, March 28, 2009

New Lusca-head features to date

I've been adding in a few new features to Lusca.

Firstly is the config option "n_aiops_threads" which allows configuration / runtime tuning of the number of IO threads. I got fed up recompiling Lusca every time I wanted to fiddle with the number of threads, so I made it configurable.

Next is a "client_socksize" - which overrides the compiled and system default TCP socket buffer sizes for client-side sockets. This allows the admin to run Lusca with restricted client side socket buffers whilst leaving the server side socket buffers (and the default system buffer sizes) large. I'm using this on my Cacheboy CDN nodes to help scale load by having large socket buffers to grab files from the upstream servers, but smaller buffers to not waste memory on a few fast clients.

The async IO code is now in a separate library rather than in the AUFS disk storage module. This change is part of a general strategy to overhaul the disk handling code and introduce performance improvements to storage and logfile writing. I also hope to include asynchronous logfile rotation. The change breaks the "auto tuning" done via various hacks in the AUFS and COSS storage modules. Just set "n_aiops_threads" to a sensible amount (say, 8 * the number of storage directories you have, up to about 128 or so threads in total) and rely on that instead of the auto-tuning. I found the auto-tuning didn't quite work as intended anyway..

Finally, I've started exporting some of the statistics from the "info" cachemgr page in a more easier computer-parsable format. The cachemgr page "mgr:curcounters" page includes current client/server counts, current hit rates and disk/memory storage size. I'll be using these counters as part of my Cacheboy CDN statistics code.