Metalanguages!

A language to describe languages.

Introduction

A metalanguage is a means of describing the format or grammar of another language. Typically in computing metalanguages are used to describe the syntax of a programming language or the format of data for storing in a file or transferring between applications. This may seem a little abstract but after seeing a few examples it's function should become clear.

There are quite a few metalanguages available. Here we will look at two of them, Railroad diagrams and EBNF (Extended Backus-Naur Form). Both of these allow for sequence, choice, option, repetition and recursion and as such are powerful enough for the required tasks. In this sense, it is quite similar to programming languages so you should be familiar with the concepts if you know how to program.

There are several different styles for formatting both Railroad diagrams and EBNF. I've chosen a style I like, your company or educational institution may use a different style. The underlying theory of both will still be the same however so the content below will still be valid, just change the way you write them out.

Terminology

Before we get started, let's look at some of the terminology you will encounter on the rest of the page. Some of these won't really make sense now but they will as you read through the rest of the page and see them in action.

Description - a collection of rules which together define the language.

Rule - a collection of terminal and non terminal symbols describing a part of the language.

Terminal symbol - a part of the rule which is displayed exactly as it is shown.

Non Terminal - a part of the rule which is further defined as another rule.

Railroad Diagrams

We will look at railroad diagrams first as they are visual and a bit easier to understand.

Railroad diagrams are named as such because they function similar to a railroad in how you progress through a rule. Lines indicate the path through the rule.

A terminal symbol is indicated as the symbol within a circle or a rounded rectangle. A terminal symbol is just anything which shows up in the language literally as it is.

Railroad terminals

A rule will always have just a single beginning line which begins at the far left with the name of the rule just above it. It will also have just a single ending line which ends at the far right.

This could also have been split up into individual sections. Normally you would do this to parts of your descriptions to allow for more flexibility in terms of describing patterns. You'll see what we mean by this below.

Railroad terminals split up

Spaces have been indicated by putting quotes around the space. This has been done here merely to indicate how spaces would be shown. If spaces are important in the formatting of your language you may need to do this, normally you wouldn't need to do this though. Spaces between words are implied (as seen in the example below).

A decision is indicated by splitting the line into multiple paths.

Railroad decisions

It is important that the split is formatted such that an imaginary train could travel down each of the paths.

Sometimes an item is optional. It is not a choice between one or another, it is a choice between one or none. We may achieve that like this:

Railroad optional

A loop is created by a line which bends back around and re enters the track earlier on. The following example will describe an amount of dollars beginning with a 1 followed by one or more zero's.

Railroad loops

Valid items would be $10, $100, $1000 etc.

There should always be a decision combined with a loop so there is a means of breaking out of the loop.

A loop doesn't only have to go over one item. It may extend over as many as you like. Let's say we want the letter A follwed by the letter B repeated.

Railroad loops multiple

Vaild items would be: AB, ABAB, ABABAB etc.

A non terminal is indicated by a rectangle. There will be a rule with the same name which further defines it.

Let's say we want to extend our previous description for Dollars and instead of only numbers beginning with a 1 we also want to include 2 and 3. We can do this by creating a new rule for those digits and referring to it in our Dollars rule like so:

Railroad non terminal

You may look at this and ask "Why not just put the 3 terminals from Digit straight into Dollars?". For something as simple as this, yes you probably could do that. As the languages you need to describe get more complex, readability will be greatly improved by breaking things up. What you will also find is that you end up creating rules for generic things such as digits, letters, operators etc which by creating as a separate rule you may define once and reuse many times.

There are two things to note from this example. The first is that rules which are used as non terminals in other rules should appear before those rules to make them easier to find. Secondly, you should make sure you name your rules as descriptively as possible. This makes it easy to understand what they represent.

Here is an example of what a rule might look like putting these concepts together.

We will make up an item called a Blob. A Blob is described as any string of characters beginning with a # and followed by an Upper case letter. After the Upper case letter we may have either a single Digit or one or more Lower case letters. For example #U5 would be a valid Blob. So would #Ryan. #H44 and #FgH are not. We would represent this description as follows:

Digit Rule Railroad outline

Upper and Lower would be definied similarly to digit but for every letter. I haven't included them here just to save space.

EBNF

EBNF is a text based means of representing exactly the same as what Railroad diagrams do. Any language that can be represented using a Railroad diagram can be represented with EBNF and vice versa. Being text based EBNF is a little more compact and easier to write but is not as easy to read. With a little practice it's not that bad though.

A rule is defined by the name of the rule, followed by an equals sign "=" followed by the rule itself. Terminal symbols are written exactly as they would normally be. So for example if a Blah is defined as "The cat sat on the mat" then it would look like this:

Blah = The cat sat on the mat

A rule which is only made up of terminal symbols is not very useful so lets now expand on this.

Square brackets [ ] indicate something which is optional. It may be there or it may not. Maybe our friend is Smith but when referring to him formally we say Mr Smith. We could represent either like this:

Friend = [Mr ]Smith

So a Friend could be either Smith or Mr Smith.

Curly brackets { } indicate zero or more of an item. Lets say we want a rule which has a 1 followed by any number of zero's. We would represent it like this:

Cost = 1{0}

Valid Cost's would be 1, 10, 100 etc.

It might seem odd that 1 is a valid Cost but remember the curly brackets indicate zero or more. This actually gives us more power than one or more. See the Patterns below to see how to reproduce one or more.

A pipe symbol "|" indicates a choice. It can either be all the things on the left or all the things on the right. Let's say we want either Mr or Mrs Green. We would represent it like this:

Person = Mr Green|Mrs Green

A Person could be either Mr Green or Mrs Green.

Note that it's all the items on the left or all the items on the right included in the decision. If we want to have the decision act upon only a portion of the items then we may use brackets ( ) to be more specific.

We could use brackets to rewrite the above rule a little more efficiently:

Person = (Mr|Mrs) Green

The final piece of EBNF is how we represent Non terminals in our rules. This is done using the greater than and less than symbols < >. Let's say a drink can be either Small, Medium or Large and has a flavour which is Apple, Orange or Pineapple. We would represent it like this:

Size = Small|Medium|Large

Flavour = Orange|Apple|Pineapple

Drink = <Size> <Flavour>

Valid Drinks would be Small Apple, Large Orange etc.

When listing a series of EBNF rules which make up a description it is best practice to order them incrementally. Rules which are included as non terminals should be defined before the rules in which they appear.

This way, as someone reads the description they don't need to go searching ahead to find rules required to understand the current rule they are reading.

The above rules for Size and Flavour also demonstrate how multiple pipe symbols may be used within a rule to have a choice between more than two options.

Finally, let's put all of these concepts together (except for [ ]) and look at how the example used for Railroad diagrams would be represented using EBNF:

Digit = 0|1|2|3|4|5|6|7|8|9

Blob = #<Upper>(<Lower>{<Lower>}|<Digit>)

Patterns and Examples

Here are a series of examples to illustrate commonly used patterns for both Railroad diagrams and EBNF.

Repetition

Let's say you want zero or more of a thing, or alternatively, one or more of a thing. The first is easy in EBNF and a little less obvious in Railroad diagrams. The latter is easy in Railroad diagrams and a little less obvious in EBNF.

First off, zero or more of a thing.

Railroad zero or more

Foo = {<Thing>}

And secondly, one or more of a thing.

Railroad one or more

Foo = <Thing>{<Thing>}

Recursion

This one is related to repetition. Recursion is a powerful concept and involves a rule having itself as a non terminal within it. Alternatively a set of rules may have non terminals within themselves such that they form a loop. We will look at an example of each.

Let's say that a Mirror is a series of b's followed by the same number of d's. We could not just use repetition as this would not control how many of each there is. We can however use recursion like this:

Railroad recursion bd

Mirror = bd|b<Mirror>d

Sometimes recursion is hard to get your head around. Working through an example on paper is generally a good way to help better grasp what is going on. See Proving a description below for help.

When using recursion it is important that at least one of your rules has an decision within it allowing you to break out of the recursion. Otherwise you will create an endless loop.

Here's another interesting example. Let's say we want to a series of brackets such that opening and closing brackets are always balanced. Here is a rule which will do just that:

Railroad recursion brackets

Brackets = '('[<Brackets>]')'<Brackets>|''

This would match:

(), (()), ()(), (())(), (()())

But would not match:

((), ())(

'' in the rule above is used to indicate an empty string or nothing.

Lists

A common thing to describe is a list. A list is typically multiple items, one after the other, separated by a character such as a comma. Let's say we want a list of Digits separated by a comma. eg. 4,5,7,2,6

Railroad list

 

Randoms = Digit{,<Digit>}

I didn't include the rule for a Digit, just to save space.

Proving a description

When given an item, you want to know if it is valid or not according to a particular description. There are a few ways you can go about this.

In your head

For simple descriptions and simple items to check it is possible to validate it in your head. It is quick and simple but is prone to mistakes. It is also not practical once the item or description reaches a certain level of complexity. So we need something a bit more formal to help us out.

Repeated rule replacement

With Repeated rule replacement we take our item which we want to find out is valid or invalid and repeatedly replace part of it with a rule. We repeat this until we are left with a single rule. If this is possible then it is a valid item. If we get stuck then it is not a valid item. It's easiest to do this with EBNF.

Let's look at some examples from above.

Size = Small|Medium|Large

Flavour = Orange|Apple|Pineapple

Drink = <Size> <Flavour>

And we want to prove that Medium Pineapple is valid.

Medium Pineapple → Size Pineapple

Size Pineapple → Size Flavour

Size FlavourDrink

We ended up with a single rule so the item is valid.

Another example:

Foo = {Great} Grandfather

And we want to prove that Great Great Great Grandfather is valid.

Great Great Great Grandfather → {Great} Grandfather

{Great} Grandfather → Foo

And again it is valid.

Parse trees

Parse trees are effectively just a visual way of doing Repeated rule replacement. Write your item out then repeatedly draw lines below (or above if you prefer) joining the parts of the item into rules. If the lines converge on a single rule then it is valid.

Let's look at some example.

Digit = 0|1|2|3|4|5|6|7|8|9

Blob = #<Upper>(<Lower>{<Lower>}|<Digit>)

Assume Upper and Lower are defined as all letters of the alphabet.

We want to discover if #Reginald is a valid Blob.

Parse tree example 1

You will need to space out your item enough to allow for your rule names to fit into the branches of the tree.

Another example:

Brackets = '('[<Brackets>]')'<Brackets>|''

We want to discover if (()()(())) is a valid Brackets.

Parse tree example 2

Some theory on Metalanguages

It's important to realise that a metalanguage will describe what a language looks like (syntax) but not it's meaning (semantics). For example the sentence "furiously sleep ideas green colourless" is grammatically correct yet makes no actual sense. Something being correct as per your metalanguage description doesn't guarantee that it will actually make sense.

The metalanguages discussed on this page are considered Type 2 in the Chomsky Heirarchy of notations. The Heirarchy looks at the complexity of the syntactic structure of languages. Type 0 is the most complex, going to down to Type 3 which is the simplest.

Both Railroad diagrams and EBNF seem able to describe a great many things but they do have their limitations. For example, as seen above, it is possible to define a rule which only allows for two equal amounts of things. It is not possible to define a rule which has three equal amounts of things however.

It is possible to define several descriptions which have different rules but effectively describe the exact same languages. For example, if we take an example used above:

Size = Small|Medium|Large

Flavour = Orange|Apple|Pineapple

Drink = <Size> <Flavour>

We could rewrite it as:

Drink = (Small|Medium|Large) (Orange|Apple|Pineapple)

or:

Flavour = Orange|Apple|Pineapple

Drink = (Small|Medium|Large) <Flavour>

These are all equivalent. We should always aim to have our description be as easy to read as possible. The second example is more compact as it is a single rule but it is one complex rule as opposed to 3 simple rules. The third example is mixed and doesn't really offer any benefit.

Summary

Railroad diagrams

Railroad summary track
The name of the rule sits above a line which forms the track for the Railroad diagram.
Railroad summary terminals
Terminal symbols. They appear literally as they are.
Railroad summary non terminals
A non terminal. foo will be further described elsewhere.

EBNF

Item =
The beginning of a rule. Item will be defined as whatever is to the right of the =
blah, "|"
Terminal symbols. They appear literally as they are. Quotes are put around characters which are part of EBNF.
|
Alternative items.
<foo>
A non terminal. foo will be further described elsewhere.
[blob]
Optional. blob may or may not appear.
{abc}
Repetition. abc may appear zero or more times.
( )
Grouping. Group several parts of the rule together.
Design for clarity and readability
Whether defining yoru description through railroad diagram or EBNF you should always strive to organise your rules such that they are as clean and easy to read as possible.
Order your rules
Rules should be ordered so that rules which appear as non terminals are listed before the rules they appear in.
Name your rules descriptively
Rule names should always be as descriptive as possible to make it obvious what they are defining.
Test your descriptions
Don't just assume they work and don't just test with the simple obvious test cases. Make sure you try the obscure and difficult ones too.

Activities

Let's define a language for a specific type of mathematical expression:

  • The language will make use of the following characters only + - * / ( ) and digits (0 - 9).
  • Operators are + - * /, numbers are made up of digits and ( ) are used for grouping.
  • An expression will consist of basic expressions which are a value followed by an operation followed by a value.
  • A value may be either a number or an expression (grouped together with brackets).

Here are some examples of valid expressions:

  • 5 + 3
  • (4 * 2) + 8
  • (5 - 3) / (3 + 7)
  • ((7 * 3) - 4) + (2 * 4)

Here are some examples of expressions which are not valid:

  • 8 + 4 + 2
  • (4*2)(8 / 5)

Your task is to create a description for this type of expression using both Railroad diagrams and EBNF.