Help with specific Regex: need to match multiple instances of multiple formats in a single string.

Posted by KevenK on Stack Overflow See other posts from Stack Overflow or by KevenK
Published on 2010-05-14T19:01:56Z Indexed on 2010/05/14 19:04 UTC
Read the original article Hit count: 280

Filed under:

I apologize for the terrible title...it can be hard to try to summarize an entire situation into a single sentence.

Let me start by saying that I'm asking because I'm just not a Regex expert. I've used it a bit here and there, but I just come up short with the correct way to meet the following requirements.

The Regex that I'm attempting to write is for use in an XML schema for input validation, and used elsewhere in Javascript for the same purpose.

There are two different possible formats that are supported. There is a literal string, which must be surrounded by quotation marks, and a Hex-value string which must be surrounded by braces.

Some test cases:

"this is a literal string" <-- Valid string, enclosed properly in "s
"this should " still be correct" <-- Valid string, "s are allowed within (if possible, this requirement could be forgiven if necessary)
"{00 11 22}" <-- Valid string, {}'s allow in strings. Another one that can be forgiven if necessary
I am bad output <-- Invalid string, no "s
"Some more problemss"you know <-- Invalid string, must be fully contained in "s
{0A 68 4F 89 AC D2} <-- Valid string, hex characters enclosed in {}s
{DDFF1234} <-- Valid string, spaces are ignored for Hex strings
DEADBEEF <-- Invalid string, must be contained in either "s or {}s
{0A 12 ZZ} <-- Invalid string, 'Z' is not a valid Hex character

To satisfy these general requirements, I had come up with the following Regex that seems to work well enough. I'm still fairly new to Regex, so there could be a huge hole here that I'm missing.:

&quot;.+&quot;|\{([0-9]|[a-f]|[A-F]| )+\}

If I recall correctly, the XML Schema regex automatically assumes beginning and end of line (^ and $ respectively). So, essentially, this regex accepts any string that starts and ends with a ", or starts and ends with {}s and contains only valid Hexidecimal characters. This has worked well for me so far except that I had forgotten about another (although less common, and thus forgotten) input option that completely breaks my regex.



Where I made my mistake:
Valid input should also allow a user to separate valid strings (of either type, literal/hex) by a comma. This means that a single string should be able to contain more than one of the above valid strings, separated by commas. Luckily, however, a comma is not a supported character within a literal string (although I see that my existing regex does not care about commas).

Example test cases:
"some string",{0A F1} <-- Valid
{1122},{face},"peanut butter" <-- Valid
{0D 0A FF FE},"string",{FF FFAC19 85} <-- Valid (Spaces don't matter in Hex values)
"Validation is allowed to break, if a comma is found not separating values",{0d 0a} <-- Invalid, comma is a delimiter, but "Validation is allowed to break" and "if a comma..." are not marked as separate strings with "s
hi mom,"hello" <-- Invalid, String1 was not enclosed properly in "s or {}s

My thoughts are that it is possible to use commas as a delimiter to check each "section" of the string to match a regex similar to the original, but I just am not that advanced in regex yet to come up with a solution on my own. Any help would be appreciated, but ultimately a final solution with an explanation would just stellar.

Thanks for reading this huge wall of text!

© Stack Overflow or respective owner

Related posts about regex