Algorithm to split an article without breaking the reading flow or HTML code

Posted by Victor Stanciu on Stack Overflow See other posts from Stack Overflow or by Victor Stanciu
Published on 2010-06-16T14:30:53Z Indexed on 2010/06/16 14:32 UTC
Read the original article Hit count: 231

Filed under:
|

Hello,

I have a very large database of articles, of varying lengths. The articles have HTML elements in them. I have to insert some ads (simple <script> elements) in the body of each article when it is displayed (I know, I hate ads that interrupt my reading too).

Now, the problem is that each ad must be inserted at about the same position in each article. The simplest solution is to simply split the article on a fixed number of characters (without breaking words), and insert the ad code. This, however, runs the risk of inserting the ad in the middle of a HTML tag.

I could go the regex way, but I was thinking about the following solution, using JS:

  1. Establish a character count threshold. For example, "the add should be inserted at about 200 words"
  2. Set accepted deviations in each direction, say -20, +20 characters.
  3. Loop through each text node inside the article, and while doing so, keep count of the total number of characters so far
  4. Once the count exceeds the threshold, make the following decision:

    4.1. If count exceeds the threshold by a value lower that the positive accepted deviation (for example, 17 characters), insert the ad code just after the current text node.

    4.2. If the count is greater than the sum of the threshold and the deviation, roll back to the previous text node, and make the same decision, only this time use the previous count and check if it's lower than the difference between the threshold and the deviation, and if not, insert the ad between the current node and the previous one.

    4.3. If the 4.1 and 4.2 fail (which means that the previous node reached a too low character count and the current node a too high one), insert the ad after whatever character count is needed inside the current element.

I know it's convoluted, but it's the first thing out of my mind and it has the advantage that, by trying to insert the ad between text nodes, perhaps it will not break the flow of the article as bad as it would if I would just stick it in (like the final 4.3 case)

Here is some pseudo-code I put together, I don't trust my english-explaining skills:

threshold = 200
deviation = 20
current_count = 0

for each node in article_nodes {
    previous_count = current_count
    current_count = current_count + node.length
    if current_count < threshold {
        continue // next interation
    }

    if current_count > threshold + deviation {
        if previous_count < threshdold - deviation {
            // insert ad in current node
        } else {
            // insert ad between the current and previous nodes
        }
    } else {
        // insert ad after the current node
    }

    break;
}

Am I over-complicating stuff, or am I missing a simpler, more elegant solution?

© Stack Overflow or respective owner

Related posts about algorithm

Related posts about split