Subject Re: [Firebird-Architect] Cost of LRU
Author Jim Starkey
On 11/5/2014 4:28 AM, hvlad@... [Firebird-Architect] wrote:
 

  Jim, Ann

thanks for sharing ideas.

  I currently working on page cache multithreading performance and any ideas and expirience is very welcome.
Jim already wrote about "cycle manager" in fb-devel list (it was at March) and i've read that message and
think on it. Currently i see hash table as a main bottleneck (when we speak about CPU load) in page cache
code and looking how to improve it. LRU queue is my second target.

  Let me speak about hash table first. Jim offer to use "cycle manager"as simple and effective non-blocking
garbage collector for list nodes used to maintain collision chains. Correct ?

  My main concern about using "cycle manager" with hash table chains is that we could have a lot of "to be freed"
nodes during every cycle. When pages are often preempted (i.e. page number of BDB is changed, as well as
its placement in hash table) we will produce a lot of such nodes. During one second (cycle period) there could
be tens thousands of such actions.

Not more than the number i/o ops between cycles, so I don't think that is a problem with BDBs.  But that in itself doesn't preclude the existence of a better solution.

  Also it is not very clear to me when thread should acquire and release cycle lock. Is it should be done on
enter\exit engine (API call) or more often ? I don't think enter\exit engine is a good way as time period between
this two events could be much more longer then 1 second. Another way is to additionally check for current cycle
number in rescheduling points (and before start to wait on IO\sync) and release old lock\acquire new lock if
actual cycle number is not equal to one that was at last acquire time. Am i correct here ?

You are correct.  The enter/exit, if I remember correctly, originally seized the single lock controlling engine data structures, so that would be the logical place (unless it has taken on different semantics over the years).



  Currently i'm trying to implement non-blocking hash table described in "High Performance Dynamic Lock-Free
Hash Tables and List-Based Sets" by Maged M. Michael:
http://www.liblfds.org/mediawiki/images/c/cc/Michael_-_High_Performance_Dynamic_Lock-Free_Hash_Tables_and_List-Based_Sets.pdf

There are some really interesting ideas in that paper.  One is maintaining collision lists in key order, which makes duplicate detect trivial and reduced the cost of a hash lookup miss by 50%.  But the really neat idea is deleting a node by first marked it deleted with a CAS then going back to actually remove it from the collision chain with another CAS.  In the meantime, a searching thread that runs into a node marked for delete also tries to remove the node from the collision chain before continuing.  His algorithm is actually more complex, suggesting I need to think about it some more, but scheme, I believe, is significantly better than using a cycle manager.

That said, I don't think it applied to maintaining an efficient LRU for efficient page replacement.

I really like new ideas.  Thanks for the pointer. 

it is used safe memory reclamation (SMR) technique by the same autor:
http://www.liblfds.org/mediawiki/images/9/99/Michael_-_Hazard_Pointers%3B_Safe_Memory_Reclaimation_for_Lock-Free_Objects.pdf

What do you think about it ?

Thanks,
Vlad

---In Firebird-Architect@yahoogroups.com, <jim@...> wrote :

I used what I call a "cycle manger" that allows worker threads to take a lock on a "cycle" during which certain classes of things are guaranteed to be stable.  It works something like this:
  • The cycle manager controls two read/write locks, which alternately control the current cycle
  • Work threads take a shared lock on the current cycle lock when they start work on behalf of a user request and release it when it finishes
  • The cycle manager thread sleeps for a period of time (one second is good).  When it wakes up, it bumps the cycle number, flips the active and inactive cycle locks, and waits for exclusive lock on inactive cycle lock.
  • When the cycle gets the exclusive lock, it knows that all worker threads that started in the prior cycle have finished and can execute end-of-cycle code.
  • When the end-of-cycle tasks are complete, the cycle manager goes back to sleep and the process repeats

The BDB equivalents in Falcon were managed this way:  When a worker thread touched a BDB, it wrote the number of the current cycle into the BDB.  One of the end-of-cycle tasks was to scream through the BDB list moving BDBs referenced in the prior or current cycle to the front of the LRU.  Since BDBs could be added to the end of the BDB list with an interlocked compare-and-swap (CAS) instruction, the LRU chain was completely non-blocking.

Another thing the cycle manager can be used for is managing a non-interlocked hash table.  A node can always be safely added to a hash table with an interlocked CAS.  Removing a node is more troublesome as some thread may be walking the collision chain for the node to be removed.  With a cycle manager, a node is removed from the collision chain with an interlocked CAS keeping its collision pointer intact.  But rather than deleting the node, it gets put into a "node purgatory".  Another one of the end-of-cycle tasks is to grab the purgatory list head (another interlocked CAS, setting to new value to NULL) before waiting on the prior cycle lock, then deleting the nodes formerly in purgatory.

The combination of cycle locks and interlocked CAS managed data structures can make all sorts of operations completely non-blocking.  Neat stuff.



On 11/2/2014 1:46 PM, Ann Harrison aharrison@... [Firebird-Architect] wrote:
 
Thinking about the conference and V3 and a customer's comment that InterBase isn't as fast as it might be on a multi-processors and doesn't seem to consume much CPU, I remembered one of the things that came up in the face-off between Falcon and InnoDB.  There's nothing like a hot performance race to uncover problems lurking in old code.

As Falcon increased the number of pages it cached, the cost of maintaining an LRU queue grew to about 10% of the CPU.  Jim backed off to a much less precise algorithm for determining which pages to drop from cache.  We found that the extra I/O cost was minimal.   I won't try to explain what he did (he probably will), but my recollection was that Falcon was spending a lot of time reordering the top 200 pages - which were never going to be dropped.

Cheers,

Ann