Wiki-fying a text using LPeg

Posted by Stigma on Stack Overflow See other posts from Stack Overflow or by Stigma
Published on 2010-04-24T22:41:43Z Indexed on 2010/04/24 22:53 UTC
Read the original article Hit count: 456

Filed under:

Long story coming up, but I'll try to keep it brief. I have many pure-text paragraphs which I extract from a system and re-output in wiki format so that the copying of said data is not such an arduous task. This all goes really well, except that there are no automatic references being generated for the 'topics' we have pages for, which end up needing to be added by reading through all the text and adding it in manually by changing Topic to [[Topic]].

First requirement: each topic is only to be made clickable once, which is the first occurrence. Otherwise, it would become a really spammy linkfest, which would detract from readability. To avoid issues with topics that start with the same words

Second requirement: overlapping topic names should be handled in such a way that the most 'precise' topic gets the link, and in later occurrences, the less precise topics do not get linked, since they're likely not correct.

Example:

topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
input = "Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project."
output = function_to_be_written(input)
-- "[[Mary]] and [[Mr. Moore]] work together on [[Project Omega]]. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the [[Project]]."

Now, I quickly figured out a simple or complicated string.gsub() could not get me what I need to satisfy the second requirement, as it provides no way to say 'Consider this match as if it did not happen - I want you to backtrack further'. I need the engine to do something akin to:

input = "abc def ghi"
-- Looping over the input would, in this order, match the following strings:
-- 1) abc def ghi
-- 2) abc def
-- 3) abc
-- 4) def ghi
-- 5) def
-- 6) ghi

Once a string matches an actual topic and has not been replaced before by its wikified version, it is replaced. If this topic has been replaced by a wikified version before, don't replace, but simply continue the matching at the end of the topic. (So for a topic "abc def", it would test "ghi" next in both cases.)

Thus I arrive at LPeg. I have read up on it, played with it, but it is considerably complex, and while I think I need to use lpeg.Cmt and lpeg.Cs somehow, I am unable to mix the two properly to make what I want to do work. I am refraining from posting my practice attempts as they are of miserable quality and probably more likely to confuse anyone than assist in clarifying my problem.

(Why do I want to use a PEG instead of writing a triple-nested loop myself? Because I don't want to, and it is a great excuse to learn PEGs.. except that I am in over my head a bit. Unless it is not possible with LPeg, the first is not an option.)

© Stack Overflow or respective owner

Related posts about lua