HEML - Human Editable Markup Language (yet another)

1 Presentation
2 Java implentation
- 2.1 Implementation status
- 2.2 TODO
3 More features, XML mapping and examples
4 Similar formats discussion

1 Presentation

1.1 Why HEML

XML files, while growing, can quickly become very difficult to edit using an average text editor or even a more advanced software development advanced editor. Graphical XML editors are very efficient to update a few attributes or elements but are not comfortable for long writing sessions.

Rich editors and modern word processing applications are very efficient to edit some well defined XML formats such as docbook and xhtml but are quite difficult to configure to any user format. Wiki formats (wkiktext, confluence, and others) are by designed efficient for on line ASCII editing then easy to handle with any text editor. But those formats are not as generic than XML and are not very helpful for anything else than document editing.

HEML design is inspired from wikitext but targets to be functionnaly equivalent to XML and allow editing of any kind of data, not only documents for human readers.

More than just a document format, combination of the include and table feature will help to migrate legacy data format (for now limited to ASCII format) to XML. Then the HEML processor performs as a revers XLS processor able to support easy translation from ASCII more or less delimited files to XML.

Because readable, it is as easy to track change, merge, than programming language source code, and software development tools can support HEML management, in particular build tools (make) and SCM (subversion, git, ...).

1.2 Yet another markup language (that is not YAML)

Main features are:

More concise, then less characters to type

As generic as XML: can express rich structured (hierarchical) information in a same manner.

Handles simple layouts (paragraph, indents and bullets) as wiki text.

Big documents are still readable directly from ASCII tools. Consider using your preferred syntax highlighter to gain more readability (a vim syntax file example is include in source HEML distribution).

1.3 Drawbacks

While processing, some layout information may be lost. Then, after automatic processing, it is not always possible to restore a document that is still optimised for human edition.

In a same way, some layout and special features are lost when translating document to XML format and the exact reverse process can't be realized.

Those two drawbacks make HEML a human editable Only markup language.

2 Java implentation

2.1 Implementation status

HEML to XML command line filter.

SAX like callback API: use HEML as your native data format.

Full changelog is available from EGPI tasks and issues management.

2.2 TODO

Improve error handling and reporting.

Special characters handling

Improve HEML writer (try to get closer to some reversible HEML/XML translation)

Android port including XSL processor to turn any android device to a rich notepad producing publishable structured data.

Advanced tables: add fixed size field without delimiter and binary file support to the table feature. This will turn HEML to a generic legacy data format to XML converter.

3 More features, XML mapping and examples

3.1 Elements and attributes

HEML	XML
{root {elem %attr=value} {elem %attr=value %attr2=value2 {subelem text} } }	<root> <elem attr="value"/> <elem attr="value" attr2="value2"> <subelem>text</subelem> </elem> </root>

3.2 Paragraphs, indentations and bullets

HEML	XML
{section %title=paragraph layout example paragraph paragraph - bullet - bullet - sub bullet }	<section title="paragraph layout example"> <p>paragraph</p> <p>paragraph</p> <li>bullet</li> <li>bullet</li> <ul> <li>bullet</li> </ul> </section>

HEML

XML

{section %title=paragraph layout example
paragraph
paragraph
- bullet
- bullet
	- sub bullet
}

<section title="paragraph layout example">
	<p>paragraph</p>
	<p>paragraph</p>
	<li>bullet</li>
	<li>bullet</li>
	<ul>
		<li>bullet</li>
	</ul>
</section>

3.3 Parser commands

Element's names starting with '?' character are handled as parser command. Currently supported commands are:

?set : changes parser settings, possible related attributes are:

%encoding: specify which character encoding to use to parse the file.
%tab: defines the character count for one tab character while text computing indentation levels (see paragraph layout examples above).

?include: includes (and parse) another HEML file in place of the include command. File to include is specified using the %src attribute with relative path or absolute URL.

?table: includes or defines in-line tabular data (that is ASCII delimited data such as CSV format). More table examples are provided below.

3.4 Includes

HEML	XML
FileA.heml: {section %title=include example paragraph {?include %src=FileB.heml} paragraph } FileB.heml: included paragraph included paragraph	<?xml version="1.0"?> <section title="include example"> <p>paragraph</p> <p>included paragraph</p> <p>included paragraph</p> <p>paragraph</p> </section>

HEML

XML

FileA.heml:

{section %title=include example
paragraph
{?include %src=FileB.heml}
paragraph
}

FileB.heml:

included paragraph
included paragraph

<?xml version="1.0"?>		
<section title="include example">
	<p>paragraph</p>
	<p>included paragraph</p>
	<p>included paragraph</p>
	<p>paragraph</p>
</section>

3.5 Tables

The HEML table feature let concise tabular data writing and also import legacy CSV files. Some attributes are available to configure the parser behavior to handle the table content:

%encoding: name of character encoding to use when reading imported table. This parameter must be be use in combination with %src attribute.

%fields: name of the fields container elements or attributes. When only one name is provided, all fields have the same name unless %style is set to "attr". Default field name is "td".

%fieldSep: list of characters to consider as field separator in the following table definition. Default separator character is "%".

%record: name of the record container elements. Default record name is "tr".

%recordSep: list of characters to consider as record separator in the following table definition. Default separator character is "\n" (carriage return).

%src: if set, its value is used as path or URL to specify file to import as table content. Default is undefined, then table content is expected.

%style: if value is "attr", fields are expanded as record element's attribute, other wise as sub elements. Default behaviour is to create sub elements.

%token: when set to "true", fields are extracted like StringTokenizer java class does. E.g. Adjacent separator characters are considered as a single one. Default value of this attribute is "false".

%trim: when "true", leading and ending blank characters (space, tabs, ...) are removed from extracted fields. Default value of this attribute is "true".

HEML	XML
{?table %record=tableRowName %fields=field1,field2,field3 un % un1 % un2 deux % deux1 % % deux3 trois % trois1 % trois2 % trois3 }	<tableRowName> <field1>un</field1> <field2>un1</field2> <field3>un2</field3> </tableRowName> <tableRowName> <field1>deux</field1> <field2>deux1</field2> <field3/> <f1>deux3</f1> </tableRowName> <tableRowName> <field1>trois</field1> <field2>trois1</field2> <field3>trois2</field3> <f1>trois3</f1> </tableRowName>
{?table %record=tableRowName %style=attr %fields=field1,field2,field3 un % un1 % un2 deux %% deux2 % deux3 trois % trois1 % trois2 % trois3 quatre % quatre1 }	<tableRowName field1="un" field2="un1" field3="un2"/> <tableRowName field1="deux" field2="" field3="deux2">deux3</tableRowName> <tableRowName field1="trois" field2="trois1" field3="trois2">trois3</tableRowName> <tableRowName field1="quatre" field2="quatre1"/>
{?table %fieldSep=" \t" %token=true a b c d e 1 2 3 4 }	<tr> <td>a</td> <td>b</td> <td>c</td> <td>d</td> <td>e</td> </tr> <tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> </tr>
From one HEML file: {?table %src=test.csv %fieldSep=;} Included csv file: 1;2;3;4;5; ;;;;; ;;III;;V; a;b;c;d;e;f	<tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> <td>5</td> </tr> <tr/> <tr> <td/> <td/> <td>III</td> <td/> <td>V</td> </tr> <tr> <td>a</td> <td>b</td> <td>c</td> <td>d</td> <td>e</td> <td>f</td> </tr>

3.6 Miscellaneous features

comments: text between {# and #} is ignored by the parser.

as-is copy: text between {! and !} is not interpreted and forwarded unchanged to the output inside au <pre> element.

escape character: any character following \ is forwarded to output as-is. Useful to include some of the HEML control char such as { into one element's text.

SAX-like callback API: the current java implementation can be used as a library and programmer can define its own HEML handler by implementing the ParserCallback interface.

See this page's HEML source file to see a complete use example.

syntax highlight: markup is simple then syntax highlighting configuration should be easy for common editors. A syntax highlight vim script is included in HEML parser source distribution as an example.

4 Similar formats discussion

4.1 Wiki markups

Wiki markups are designed to allow easy in line edition from a simple text field. But the purpose is first to support documents writing. Trying to edit any kind of data like with XML is probably possible but leads to some kind of brainfuck like naming and coding convention above the standard markups to make any data processable.

See Wiki markup wikipedia page to know more about the many wiki markup languages.

4.2 YAML

YAML's basic purpose is very similar to HEML's one: enabling easy manual data edition. But it's primary focus is data, and promote lists to handle data. May be I didn't realized what YAML is before I started coding HEML. I just found YAML was not what I searched and hope HEML is more convenient particularly for documents writing. At least it is the case for me!

4.3 txt2tags

Txt2tags can be view as wiki markup precursor and suffers the same limitation when used to edit data that are not readable documents.

4.4 JSON

JSON features are very close to YAML and like YAML seems more focused to complex data serialization support and not best suited to direct and manual document edition. It is directly related to javascript, even it is usable with any programming language thanks to the many libraries available.

4.5 XML

Of course, you can edit directly XML, HTML, XHTML... But it's quite boring because of its heavy markup (opening and ending tags, < and > for each, quotes for attributes). HEML is the same but requires fewer markups then less characters to type. Moreover, paragraph layout feature inspired from wiki markups and txt2tags keeps HEML document accessible to the human eye.