Skip to content

Compile Verbose, but Readable XML definitions of regular expressions into the real deal.

Notifications You must be signed in to change notification settings

mitchskiba/XMLEx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WHY?!?

I'll start with the biggest WTF on this project. Why would I want to use somthing as verbose as XML to define somthing usually known to be sleek like regex?

Maintainability.

If You were asked to figure out why the below regex didn't match something you expected it to, where would you start? It isn't impossible to understand, but it takes longer than it should. (SH|RE|MF)-((?:197[1-9]|19[89]\d|[2-9]\d{3})-(?:0[1-9]|1[012])-(?:0[1-9]|[12]\d|3[01]))-((?!0{5})\d{5})

That is the example regex I built with this project. It is one of the three two digit codes, a dash, a YYYY-MM-DD date (after Jan 1, 1971) a dash, then a non 00000 5 digit number.

Business Knowledge

I have heard some smart people say they shy away from using regex to enforce anything that could be considered business logic. In the dense form above, there isn't room to explain why anything is the way it is. Even an inline regex comment or two wouldn't save it.

This is a real shame given how useful regex can be for enforcing some constraints.

How

First, A disclaimer: Currently there is not a formal definition for the XML schema I use (and the code is quite hacked together and subject to change as I feel like revisiting this project). Ths whole thing is more or less a proof of concept anyway.

XML Tags Used

lit Short for literal. Contains text that will appear as represented in the XML document. That means you have to use XML entities for things like & (&) and < (>) etc.

seq Short for sequence. Contains a list of sub-expressions to be matched in order

or Contains a list of sub-expressions that may appear next

mult Short for Multiplicity. Makes the single sub expression have a different multiplicity. Attributes min and max are used to control what kind of multiplicity it has.

whitespace, not_whitespace, digit, not_digit, word_char, not_word_char, any, word_boundary, line_start, line_end, string_start, string_end These are all tags for the built in character classes and positions.

class Defines a character class. The values attribute will be escaped, so place a list of literal characters in there. If you wish to use an exisiting character class as well, you can put it inside the body. Setting the negative attribute to true will make it a negated character class.

range Define a character range. Make an empty tag with a min and max attribute. These can be children of a class tag, or stand alone

macro Defines a named regular expression that can be inserted with the use tag. The macro tag expects a name attribute and a single child that will be put in pace of the use tag. The tag itself will not output anything. It is recommended that all macros go inside a sequence in a file included as a library. Also, don't put capture groups inside macros. Bad things may happen.

use Substituted with the regular expression defined with macro that shares the name attribute's value with this tag.

set, clear Used to set/clear the imx flags of the single sub expression. Use the flags attribute to specify which flags should be changed.

capture Used to define a capture group sorrounding the single sub expression. Optionally has a name attribute to make it a nammed capture group

backref Used to make a backreference to a capture. Use either the name or number attribute. I recommend using name for both capture and backref. Even if your language does not support nammed backreferences, the compiler has the --force-numeric option that will do the position counting for you.

group Makes a special grouping for the sorrounded subgroup. Valid types are "negative lookahead", "positive lookahead" and "nest"

Compiler Options

Usage: inputfile outputfile [Options]

Options:
  -h, --help            show this help message and exit
  -l LIBRARY, --lib=LIBRARY
						library to consult for macros
  --force-numeric       Force use of numeric capture groups only
  -a LANGUAGE, --lang=LANGUAGE
						Language to target. Options are
						[ruby|python|.NET|pcre]

To Run the example I included: ./xexcomp.py ticketnum.xml ticketnum.re.txt -l lib.xml

About

Compile Verbose, but Readable XML definitions of regular expressions into the real deal.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages