Parser derivation

Parsers and Functional Programming

Parsers in functional languages have been studied extensively
- Functional Parsers by J. Fokker
- Monadic Parser Combinatiors by G. Hutton
A parser can be modeled as a function of, for instance, the following type
```
type Parser a = String -> (a, String)
```
A parser takes an string to parse, parses the string, and as a result returns a structure of type a and the unconsumed suffix of the input string.

Functional programming is a natural fit to build parsers. They are functions!
In fact, parsers are DSL in Haskell.

Monadic parsers

Monadic parsers are powerful enough to describe context-sensitive grammars

The grammar itself can depend on the input!
```
(>>=) :: Parser a -> (a -> Parser b) -> Parser b
```
Can you see the dependency in the type of bind?
Efficiency of parsers often revolves around the implementation of the choice operator
```
(+++) :: Parser a -> Parser a -> Parser a
```
The parser does not know which option to follow, so it is common to simple try all the possibilities and backtracking when needed.

In a monadic parser, we need to wait for one of the parsers to succeed or fail (memory leak?).
Related work
- Weaker notion of sequencing than monads (Arrows by John Hughes)
  - It only supports context-free grammars
- Special designed choice combinators: asymmetric choice where the right hand-side parser only runs if the left hand-side fails; deterministic choice where a symbol look ahead can resolved which parser to use.
  - Deterministic parsers: Deterministic, Error-Correcting Combinator Parsers by S. Swierstra and L. Duponcheel
  - Non-deterministic parsers: Combinator Parsers: From Toys to Tools by S. Swierstra
In all of the work mentioned above, there is a general choice combinator which still suffers from inefficiencies.

This lecture

It is based on the article FUNCTIONAL PEARL Parallel Parsing Processes by K. Claessen
Programming monadic parsing library
- Efficient choice combinator (breadth-first rather than deep-first search)
- Non-deterministic grammars
No special annotation
Derived!
Used by GHC Read type class via Text.ParserCombinators.ReadP.

A simple EDSL for parsing

A parser Parser s a takes a stream of symbols of type s, parses it, and produces a value of type a

API for the parsers

{-- Type --}
data Parser s a
{-- Constructors --}
symbol :: Parser s s
fail   :: Parser s a
return :: a -> Parser s a
{-- Combinators --}
(+++)  :: Parser s a -> Parser s a -> Parser s a
(:>>=) :: Parser s a -> (a -> Parser s b) -> Parser s b

Function symbol returns the next symbol. Function fail aborts parsing. Functions return and (:>>=) are the monadic primitives. Function (+++) is the choice operator.

Before we get into the implementation details, let us describe some laws that we expect from the API — in that manner, we can detect early on if we are doing things right!

This is the **domain knowledge**, which we will use later to gain performance!

Monads

Monadic primitives	Laws
L1 (Left Identity):	return a >>= f ≡ f a
L2 (Right Identity):	p >>= return ≡ p
L3 (Associativity):	(p >>= f) >>= g ≡ p >>= (\ x -> f x >>= g)

Bind distributes over choice
`fail`, `(+++)`, and `(>>=)` Laws
**L4**:
fail >>= f ≡ fail
**L5**:
(p +++ q) >>= f ≡ (p >>= f) +++ (q >>= f)

`fail`, `(+++)`, and `(>>=)`	Laws
L4:	fail >>= f ≡ fail
L5:	(p +++ q) >>= f ≡ (p >>= f) +++ (q >>= f)

Choice forms a commutative monoid with unit fail

More on `fail` and `(+++)`	Laws
L6 (Left unit):	fail +++ q ≡ q
L7 (Right unit):	p +++ fail ≡ p
L8 (Associativity):	(p +++ q) +++ r ≡ p +++ (q +++ r)
L9 (Commutativity):	p +++ q ≡ q +++ p

Key law for efficiency!

On `(>>=)`, `(+++)`, and `symbol`	Laws
L10:	(symbol >>= f) +++ (symbol >>= g) ≡ symbol >>= (\ c -> f c +++ g c)

Reference semantics

Any semantics we associate to elements of type Parser s a must obey the laws shown above.
We take a reference semantics, i.e., a semantics that we will compare our implementation against in order to see if our implementation is correct.

The semantic function [| _ |], also called run, is defined as follows (we use {| |} to denote multisets and \/ for multiset union).

 [| _ |] :: Parser s a -> [s] -> {| (a, [s]) |}
 [| symbol   |] (c : s) = {| (c, s) |}
 [| symbol   |] []      = {| |}
 [| fail     |] s       = {| |}
 [| p +++ q  |] s       = [| p |] s  \/  [| q |] s
 [| return a |] s       = {| (a, s) |}
 [| p >>= f  |] s       = {| (b, s_f) | (a, s_p)  <- [| p   |] s
                                      , (b, s_f)  <- [| f a |] s_p
                          |}

Using this semantics we can prove (exercise) the laws about parsers given before.

For instance, here is the proof of L10 for the case of a non-empty input string:

 ==  { Def. of [| p +++ q |] }
   [| symbol >>= f |] (c:s)  \/  [| symbol >>= g |] (c:s)
 ==  { Def. of [| p >>= f |] and [| symbol |] }
   [| f c |] s  \/  [| g c |] s
 ==  { Def. of [| p +++ q |] "backwards" }
   [| f c +++ g c |] s
 ==  { Def. of [| p >>= f |] and [| symbol |] "backwards" }
   [| symbol >>= (\ c -> f c +++ g c) |] (c:s)

Exercise: prove or test the rest of the laws

The reference semantics is useful for reasoning, but inefficient.
There are three sources of possibly inefficiency that we can identify:

Source Reason

Definition of `(+++)` Union of bags

Definition of `(>>=)` Creation of many intermediate results (e.g., `[| p |] s` and `[| f a |]`)

Source	Reason
Definition of `(+++)`	Union of bags
Definition of `(>>=)`	Creation of many intermediate results (e.g., `[\| p \|] s` and `[\| f a \|]`)

Parser0: our first implementation

code

Every constructor and combinator is a constructor in the Parser data type.

data Parser0 s a where
  {-- Constructors --}
  Symbol  ::  Parser0 s s
  Fail    ::  Parser0 s a
  {-- Combinators --}
  Choice  ::  Parser0 s a -> Parser0 s a -> Parser0 s a
  Return  ::  a -> Parser0 s a
  (:>>=)  ::  Parser0 s a -> (a -> Parser0 s b) -> Parser0 s b

We call it Parser0 since it is our first attempt.

Constructors and combinators (trivial)

 {- | Constructors -}
 symbol = Symbol
 pfail  = Fail

 {- | Combinators -}
 (+++)  = Choice

Monadic operations (trivial)

 instance Monad (Parser0 s) where
   return = Return
   (>>=)  = (:>>=)

What about our run function?

To start with, and for simplicity, we use lists instead of bags to denote the semantics of parsers.

 type Semantics s a = [s] -> [(a,[s])]

The run function maps the constructors to their semantics.

 run0 :: Parser0 s a -> Semantics s a
 run0 Symbol          [] = []
 run0 Symbol      (s:ss) = [(s,ss)]
 run0 Fail            _  = []
 run0 (Choice p q)    ss = (run0 p ss) ++ (run0 q ss)
 run0 (Return x)      ss = [(x,ss)]
 run0 (p :>>= f)      ss = [(y,s2) | (x,s1) <- run0 p ss,
                                     (y,s2) <- run0 (f x) s1]

We have the same sources of inefficiency as the reference semantics! (i.e., definition of `Choice` and `(:>>=)`)

Parser1: removing bind

code

The use of list comprehension in run0 (p :>>= f) builds a lot of intermediate lists which might be costly
How do we simplify it?

We move towards an intermediate representation, where the bind takes place when constructing the program — not when running it!
Methodology:
- Remove (:>>=) from the data type
- Try to define (>>=) anyways, and analyze the usage patterns which we cannot write
- Introduce new constructors to capture such cases
- Simplify the data type with the new constructors and derive the definition for (>>=)
Let us try to define (>>=)
```
 Fail >>= k
```
By L4, we know that
```
 Fail >>= k ≡ Fail
```
Success! (Nothing to be done)

 Choice p q >>= f

By L5, we know that

 Choice p q >>= f ≡ Choice (p >>= f) (q >>= f)

Success! (Nothing to be done)

```
 Return x >>= f
```
The first monad law already tells us that this is just (f x).

Success! (Nothing to be done)

 Symbol >>= k

There is no L-rule for this case! Let us capture this usage pattern in a new constructor

 SymbolBind k ≡ Symbol >>= k

Observe that

 k :: s -> Parser0 s a

Therefore, we have that

 SymbolBind :: (s -> Parser0 s a) -> Parser0 s a

We obtain Parser1 from the definition of Parser0, where SymbolBind gets introduced and (:>>=) gets removed

 data Parser1 s a where
     SymbolBind ::  (s -> Parser1 s a) -> Parser1 s a
     Fail       ::  Parser1 s a
     Choice     ::  Parser1 s a -> Parser1 s a -> Parser1 s a
     Return     ::  a -> Parser1 s a

Observe that there is no (:>>=)

What about the run function?

run1 :: Parser1 s a -> Semantics s a
run1 Fail                _ = []
run1 (Choice p q)       ss = run1 p ss ++ run1 q ss
run1 (Return x)         ss = [(x,ss)]
run1 (SymbolBind k)     ss = ?

It is mainly as before, but the intermediate results generated by (:>>=) are not there.

In the definition of run1, the new interesting case is

run1 (SymbolBind k) ss = ?

We are going to derive it by using the reference semantics.

We know, by the definition of SymbolBind, that

[| SymbolBind k |] = [| symbol >>= k |]

By the reference semantics of (>>=), we have that

[| symbol >>= k |] ss = {| (b, s_k) | (a, ss_p)  <- [| symbol |] ss
                                    , (b, ss_k)  <- [| k a |] ss_p
                        |}

So, we have two cases:

[| symbol >>= k |] [] = {| (b, s_k) | (a, ss_p)  <- [| symbol |] []
                                    , (b, ss_k)  <- [| k a |] ss_p
                        |}

By the reference semantics of symbol, we have

[| symbol >>= k |] [] = {| (b, ss_k) | (a, ss_p)  <- {| |}
                                     , (b, ss_k)  <- [| k a |] ss_p
                        |}

By multi-set comprehension, we conclude that

[| symbol >>= k |] [] = {| |}

On the other hand, we have the following equation

[| symbol >>= k |] (s:ss) = {| (b, s_k) | (a, ss_p)  <- [| symbol |] (s:ss)
                                        , (b, ss_k)  <- [| k a |] ss_p
                            |}

By applying the reference semantics of symbol, we have that

[| symbol >>= k |] (s:ss) = {| (b, s_k) | (a, ss_p)  <- {| (s, ss) |}
                                        , (b, ss_k)  <- [| k a |] ss_p
                            |}

By multi-set comprehension, we conclude that

[| symbol >>= k |] (s:ss) = {| (b, ss_k) | (b, ss_k)  <- [| k s |] ss |}

which, by multi-set comprehension, is equivalent to

[| symbol >>= k |] (s:ss) = [| k s |] ss

To summarize, we obtain

[| symbol >>= k |] []     = {| |}
[| symbol >>= k |] (s:ss) = [| k s |] ss

Therefore, we conclude that

run1 (SymbolBind k) []     = []
run1 (SymbolBind k) (s:ss) = run1 (k s) ss

Constructors and combinators? (the non-proper morphisms are mainly as before)
```
{- | Constructors -}
symbol = SymbolBind Return
pfail  = Fail

{- | Combinators -}
(+++)  = Choice
```
Observe that symbol is defined as SymbolBind Return. Is it true that symbol only extracts a symbol from the input?
```
SymbolBind Return ≡ Symbol >>= Return
```
By L2 (Right Identity), we know that
```
SymbolBind Return ≡ Symbol
```
So, SymbolBind corresponds to the notion of Symbol in Parser0!
What about return?

Function return is just as before.
```
 return = Return
```

The interesting case is (>>=)

How are we going to define it?

So far, we have that

Fail       >>= k = Fail
Choice p q >>= f = Choice (p >>= f) (q >>=f)
Return a   >>= k = k a

What about our recently introduced constructor (SymbolBind)?

SymbolBind f >>= k = ?

By definition of SymbolBind, we know that

SymbolBind f >>= k ≡ Symbol >>= f >>= k

By L3 (Associativity of monads), we have that

Symbol >>= f >>= k ≡ Symbol >>= (\ s -> f s >>= k)

By our definition of SymbolBind, we have that

Symbol >>= (\ s -> f s >>= k) ≡ SymbolBind (\ s -> f s >>= k)

So, we finally have that

SymbolBind f >>= k = SymbolBind (\ s -> f s >>= k)

We can now define Parser1 as a monad

{- | Monadic instance for Parser1 -}
instance Monad (Parser1 s) where
   return = Return
  Fail         >>= k = Fail
  Choice p q   >>= k = Choice (p >>= k) (q >>= k)
  Return x     >>= k = k x
  SymbolBind f >>= k = SymbolBind (\ s -> f s >>= k)

Observe that the definition of `(>>=)` was derived from the domain knowledge and monadic laws. We cannot get it wrong!

Transforming parsers of type `Parser0` into parsers of type `Parser1`

Is Parser1 as expressive as Parser0? In other words, can any parser you wrote of type Parser0 be reformulated as a parser of type Parser1?

Yes! We can write a function which transform a Parser0 into a Parser1

cast :: P0.Parser0 s a -> Parser1 s a

To avoid name crashes, all the constructors and types from Parser0 are qualified as P0

Let us see the easy cases.

cast :: P0.Parser0 s a -> Parser1 s a
cast P0.Symbol       = SymbolBind Return -- L1
cast P0.Fail         = Fail
cast (P0.Choice p q) = Choice (cast p) (cast q)
cast (P0.Return x)   = Return x

The core of the translation is bind!

cast (P0.Symbol P0.:>>= k)       = SymbolBind (cast . k)
                                   -- def of SymbolBind

cast (P0.Fail P0.:>>= _)         = Fail
                                   -- Parser law, L4.

cast ((P0.Choice p q) P0.:>>= k) = Choice (cast (p P0.:>>= k))
                                          (cast (q P0.:>>= k))
                                   -- Parser law, L5

cast ((P0.Return x) P0.:>>= k)   = cast (k x)
                                   -- monad law, L1

cast ((p P0.:>>= f) P0.:>>= k)   = cast (p P0.:>>= (\ x -> f x P0.:>>= k))
                                   -- monad law, L3

Observe that for every case, there is some law which helps to derive the translation of bind!

Parser2: improving choice

code

If we observe the run1 function again
```
run1 :: Parser1 s a -> Semantics s a
run1 (SymbolBind k)     [] = []
run1 (SymbolBind k) (s:ss) = run1 (k s) ss
run1 Fail                _ = []
run1 (Choice p q)       ss = (run1 p ss) ++ (run1 q ss)
run1 (Return x)         ss = [(x,ss)]
```
We have another source of inefficiency. Can you see it?

The list append (++) is linear in its first argument which means that left nested applications (+++) get a quadratic behaviour, e.g., consider expressions of the form ((s1 ++ s2) ++ s3).
How can we optimize Choice, i.e. (+++)?

Similar as we did with bind, we remove it from our data type. The choice operator (+++) then takes place when building the program — not when running it!

The new data type for parsers

data Parser2 s a where
  SymbolBind ::  (s -> Parser2 s a) -> Parser2 s a
  Fail       ::  Parser2 s a
  Return     ::  a -> Parser2 s a

Let us try to define (+++)

For Fail is easy due to laws L6 and L7.
```
(+++) :: Parser2 s a -> Parser2 s a -> Parser2 s a
Fail +++ _    = Fail
q    +++ Fail = Fail
```
For SymbolBind, we know by L10 that
```
SymbolBind f +++ SymbolBind q = SymbolBind (\ s -> f s +++ q s)
```
If SymbolBind is combined with Fail instead, we know the result (see L6 and L7).

The tricky case is when SymbolBind is pattern-matched with Return.
```
SymbolBind f +++ Return x     = ?
Return x     +++ SymbolBind f = ?
```
It seems that Return is stopping us from defining (+++). In fact, what is the definition of (+++) when it only deals with Return?
```
Return x +++ Return y = ?
```
For these cases, we therefore introduce a new constructor.
```
ReturnChoice x p ≡ Return x +++ p
```
Observe that, by L7,
```
Return x ≡ ReturnChoice x Fail
```
ReturnChoice can encode Return!

Therefore, let us take the definition of Parser2 and replace Return with ReturnChoice

data Parser2 s a where
    SymbolBind   ::  (s -> Parser2 s a) -> Parser2 s a
    Fail         ::  Parser2 s a
    ReturnChoice ::  a -> Parser2 s a -> Parser2 s a

Let us now define (+++) by using parser laws, commutative, and associative laws.

(+++) :: Parser2 s a -> Parser2 s a -> Parser2 s a
SymbolBind f     +++ SymbolBind g     = SymbolBind (\ s -> f s +++ g s)
                                        -- L10
p                +++ Fail             = p
                                        -- L6
Fail             +++ q                = q
                                        -- L7

ReturnChoice x p +++ q                = ?

p                +++ ReturnChoice x q = ?

We derive the tricky cases.

By definition of ReturnChoice, we have that

ReturnChoice x p +++ q ≡ (Return x +++ p) +++ q

By L8 (associativity of (+++)), we have that

(Return x +++ p) +++ q ≡ Return x +++ (p +++ q)

By the definition of ReturnChoice, we obtain

Return x +++ (p +++ q) ≡ ReturnChoice x (p +++ q)

Exercise: derive the definition for `p +++ ReturnChoice x q`

(+++) :: Parser2 s a -> Parser2 s a -> Parser2 s a
SymbolBind f     +++ SymbolBind g     = SymbolBind (\ s -> f s +++ g s)
                                        -- L10
p                +++ Fail             = p
                                        -- L6
Fail             +++ q                = q
                                        -- L7

ReturnChoice x p +++ q                = ReturnChoice x (p +++ q)

p                +++ ReturnChoice x q = ReturnChoice x (p +++ q)

So, we obtain (+++) defined, but we should fix (>>=) since we replaced Return with ReturnChoice

{- | Monadic instance for Parser2 -}
instance Monad (Parser2 s) where
  return x = ReturnChoice x Fail
  Fail             >>= k = Fail
  (SymbolBind f)   >>= k = SymbolBind (\ s -> f s >>= k)
  ReturnChoice x p >>= k = ?

By definition of ReturnChoice, we have that

ReturnChoice x p >>= k ≡ (Return x +++ p) >>= k

By L5, we have that

(Return x +++ p) >>= k ≡ (Return x >>= k) +++ (p >>= k)

By L1 (Left Identity), we conclude that

(Return x >>= k) +++ (p >>= k) ≡ k x +++ (p >>= k)

So, we obtain that

ReturnChoice x p >>= k = k x +++ (p >>= k)

{- | Monadic instance for Parser2 -}
instance Monad (Parser2 s) where
  return x = ReturnChoice x Fail
  Fail             >>= k = Fail
  (SymbolBind f)   >>= k = SymbolBind (\ s -> f s >>= k)
  ReturnChoice x p >>= k = k x +++ (p >>= k)

We have completed the implementation of (+++) and (>>=) which gets computed when constructing parsers — not when running them!

Let us see the run function.

We take run1 for parsers of type Parser1, remove the case for Choice, and see what happens when placing ReturnChoice in the place of Return.

run2 :: Parser2 s a -> Semantics s a
run2 (SymbolBind k)     [] = []
run2 (SymbolBind k) (s:ss) = run2 (k s) ss
run2 Fail                _ = []
run2 (ReturnChoice x p) ss = ?

We are going to try deriving the definition. However, since we are dealing with the run function, we need to consider the ideal semantics of parsers.

By definition of ReturnChoice, we obtain that

[| ReturnChoice x p |] ss ≡  [| Return x +++ p |] ss

By the semantics of (+++), we obtain that

[| Return x +++ p |] ss ≡ [| Return x |] ss \/ [| p |] ss

By the semantics of Return x, we have that

[| Return x |] ss \/ [| p |] ss ≡ {| (x, ss) |} \/ [| p |] ss

Summarizing, we have that

[| ReturnChoice x p |] ss ≡ {| (x, ss) |} \/ [| p |] ss

So, we complete the definition of run2 as follows.

run2 :: Parser2 s a -> Semantics s a
run2 (SymbolBind k)     [] = []
run2 (SymbolBind k) (s:ss) = run2 (k s) ss
run2 Fail                _ = []
run2 (ReturnChoice x p) ss = (x, ss) : run2 p ss

Transforming parsers of type `Parser1` into parsers of type `Parser2`

Is Parser2 as expressive as Parser1? In other words, can any parser you wrote of type Parser1 be reformulated as a parser of type Parser2?

Yes! We can write a function which transforms a Parser1 into a Parser2
```
cast2 :: Parser1 s a -> Parser2 s a
```
Exercise: write `cast2`

Parser3: optimizing `(>>=)`

There is still one remaining source of inefficiency.
If you look at the definition of (>>=), you'll see that it is linear in the size of its first argument.

This means that we get a similar problem to the use of (++), namely a quadratic behaviour for left nested uses of (>>=).
In order to fix this we cannot use the method we've been using so far, there is no constructor to remove to fix the problem. Instead, we have to use another technique, called a "context passing" implementation.
Read more about it in the paper.

Summary

Parsers laws ≡ domain knowledge
Detect inefficiencies and introduce changes
Derivation to synthesize the right code (no hacking!)
- Leveraging domain and monad laws

Parser derivation

Parsers and Functional Programming

Monadic parsers

This lecture

A simple EDSL for parsing

Reference semantics

Parser0: our first implementation

Parser1: removing bind

Transforming parsers of type Parser0 into parsers of type Parser1

Parser2: improving choice

Transforming parsers of type Parser1 into parsers of type Parser2

Parser3: optimizing (>>=)

Summary

Transforming parsers of type `Parser0` into parsers of type `Parser1`

Transforming parsers of type `Parser1` into parsers of type `Parser2`

Parser3: optimizing `(>>=)`