SSL Documentation Analaysis
This question is pertaining the usage of the HMAC routines in OpenSSL.
Since Openssl documentation is a tad on the weak side in certain areas, profiling has revealed that using the:
 unsigned char *HMAC(const EVP_MD *evp_md, const void *key,
               int key_len, const unsigned char *d, int n,
               unsigned char *md, unsigned int *md_len);
From here, shows 40% of my library runtime is devoted to creating and taking down **HMAC_CTX's behind the scenes.
There are also two additional function to create and destroy a HMAC_CTX explicetly:
  HMAC_CTX_init() initialises a HMAC_CTX
  before first use. It must be called.
  
  HMAC_CTX_cleanup() erases the key and
  other data from the HMAC_CTX and
  releases any associated resources. It
  must be called when an HMAC_CTX is no
  longer required.
These two function calls are prefixed with:
  The following functions may be used if
  the message is not completely stored
  in memory
My data fits entirely in memory, so I choose the HMAC function -- the one whose signature is shown above.
The context, as described by the man page, is made use of by using the following two functions:
  HMAC_Update() can be called repeatedly
  with chunks of the message to be
  authenticated (len bytes at data).
  
  HMAC_Final() places the message
  authentication code in md, which must
  have space for the hash function
  output.
The Scope of the Application
My application generates a authentic (HMAC, which is also used a nonce), CBC-BF encrypted protocol buffer string. The code will be interfaced with various web-servers and frameworks Windows / Linux as OS, nginx, Apache and IIS as webservers and Python / .NET and C++ web-server filters. 
The description above should clarify that the library needs to be thread safe, and potentially have resumeable processing state -- i.e., lightweight threads sharing a OS thread (which might leave thread local memory out of the picture).
The Question
How do I get rid of the 40% overhead on each invocation in a (1) thread-safe / (2) resume-able state way ? (2) is optional since I have all of the source-data present in one go, and can make sure a digest is created in place without relinquishing control of the thread mid-digest-creation. So,
(1) can probably be done using thread local memory -- but how do I resuse the CTX's ? does the HMAC_final() call make the CTX reusable ?. 
(2) optional: in this case I would have to create a pool of CTX's.
(3) how does the HMAC function do this ? does it create a CTX in the scope of the function call and destroy it ? 
Psuedocode and commentary will be useful.