Optimal Data Structure for our own API
- by vermiculus
I'm in the early stages of writing an Emacs major mode for the Stack Exchange network; if you use Emacs regularly, this will benefit you in the end.
In order to minimize the number of calls made to Stack Exchange's API (capped at 10000 per IP per day) and to just be a generally responsible citizen, I want to cache the information I receive from the network and store it in memory, waiting to be accessed again. I'm really stuck as to what data structure to store this information in.
Obviously, it is going to be a list. However, as with any data structure, the choice must be determined by what data is being stored and what how it will be accessed. What, I would like to be able to store all of this information in a single symbol such as stack-api/cache. So, without further ado, stack-api/cache is a list of conses keyed by last update:
`(<csite> <csite> <csite>)
where <csite> would be
(1362501715 . <site>)
At this point, all we've done is define a simple association list. Of course, we must go deeper.
Each <site> is a list of the API parameter (unique) followed by a list questions:
`("codereview" <cquestion> <cquestion> <cquestion>)
Each <cquestion> is, you guessed it, a cons of questions with their last update time:
`(1362501715 <question>) (1362501720 . <question>)
<question> is a cons of a question structure and a list of answers (again, consed with their last update time):
`(<question-structure> <canswer> <canswer> <canswer>
and `
`(1362501715 . <answer-structure>)
This data structure is likely most accurately described as a tree, but I don't know if there's a better way to do this considering the language, Emacs Lisp (which isn't all that different from the Lisp you know and love at all). The explicit conses are likely unnecessary, but it helps my brain wrap around it better. I'm pretty sure a <csite>, for example, would just turn into
(<epoch-time> <api-param> <cquestion> <cquestion> ...)
Concerns:
Does storing data in a potentially huge structure like this have any performance trade-offs for the system? I would like to avoid storing extraneous data, but I've done what I could and I don't think the dataset is that large in the first place (for normal use) since it's all just human-readable text in reasonable proportion. (I'm planning on culling old data using the times at the head of the list; each inherits its last-update time from its children and so-on down the tree. To what extent this cull should take place: I'm not sure.)
Does storing data like this have any performance trade-offs for that which must use it? That is, will set and retrieve operations suffer from the size of the list?
Do you have any other suggestions as to what a better structure might look like?