Parser derivation
Parsers and Functional Programming
Parsers in functional languages have been studied extensively
A parser can be modeled as a function of, for instance, the following type
type Parser a = String -> (a, String)
A parser takes an string to parse, parses the string, and as a result returns a structure of type
a
and the unconsumed suffix of the input string.Functional programming is a natural fit to build parsers. They are functions!In fact, parsers are DSL in Haskell.
Monadic parsers
Monadic parsers are powerful enough to describe context-sensitive grammars
The grammar itself can depend on the input!(>>=) :: Parser a -> (a -> Parser b) -> Parser b
Can you see the dependency in the type of bind?
Efficiency of parsers often revolves around the implementation of the choice operator
(+++) :: Parser a -> Parser a -> Parser a
The parser does not know which option to follow, so it is common to simple try all the possibilities and backtracking when needed.
In a monadic parser, we need to wait for one of the parsers to succeed or fail (memory leak?).
Related work
Weaker notion of sequencing than monads (Arrows by John Hughes)
- It only supports context-free grammars
Special designed choice combinators: asymmetric choice where the right hand-side parser only runs if the left hand-side fails; deterministic choice where a symbol look ahead can resolved which parser to use.
Deterministic parsers: Deterministic, Error-Correcting Combinator Parsers by S. Swierstra and L. Duponcheel
Non-deterministic parsers: Combinator Parsers: From Toys to Tools by S. Swierstra
In all of the work mentioned above, there is a general choice combinator which still suffers from inefficiencies.
This lecture
It is based on the article FUNCTIONAL PEARL Parallel Parsing Processes by K. Claessen
Programming monadic parsing library
- Efficient choice combinator (breadth-first rather than deep-first search)
- Non-deterministic grammars
No special annotation
Derived!
Used by GHC
Read
type class viaText.ParserCombinators.ReadP
.
A simple EDSL for parsing
A parser
Parser s a
takes a stream of symbols of types
, parses it, and produces a value of typea
API for the parsers
{-- Type --} data Parser s a {-- Constructors --} symbol :: Parser s s fail :: Parser s a return :: a -> Parser s a {-- Combinators --} (+++) :: Parser s a -> Parser s a -> Parser s a (:>>=) :: Parser s a -> (a -> Parser s b) -> Parser s b
Function
symbol
returns the next symbol. Functionfail
aborts parsing. Functionsreturn
and(:>>=)
are the monadic primitives. Function(+++)
is the choice operator.Before we get into the implementation details, let us describe some laws that we expect from the API — in that manner, we can detect early on if we are doing things right!
This is the **domain knowledge**, which we will use later to gain performance!Monads
Monadic primitives Laws **L1** (Left Identity): return a >>= f ≡ f a
**L2** (Right Identity): p >>= return ≡ p
**L3** (Associativity): (p >>= f) >>= g ≡ p >>= (\x -> f x >>= g)
Bind distributes over choice
`fail`, `(+++)`, and `(>>=)` Laws **L4**: fail >>= f ≡ fail
**L5**: (p +++ q) >>= f ≡ (p >>= f) +++ (q >>= f)
Choice forms a commutative monoid with unit
fail
More on `fail` and `(+++)` Laws **L6** (Left unit): fail +++ q ≡ q
**L7** (Right unit): p +++ fail ≡ p
**L8** (Associativity): (p +++ q) +++ r ≡ p +++ (q +++ r)
**L9** (Commutativity): p +++ q ≡ q +++ p
Key law for efficiency!
On `(>>=)`, `(+++)`, and `symbol` Laws **L10**: (symbol >>= f) +++ (symbol >>= g) ≡ symbol >>= (\c -> f c +++ g c)
Reference semantics
Any semantics we associate to elements of type
Parser s a
must obey the laws shown above.We take a reference semantics, i.e., a semantics that we will compare our implementation against in order to see if our implementation is correct.
The semantic function
[| _ |]
, also calledrun
, is defined as follows (we use{| |}
to denote multisets and\/
for multiset union).[| _ |] :: Parser s a -> [s] -> {| (a, [s]) |} [| symbol |] (c : s) = {| (c, s) |} [| symbol |] [] = {| |} [| fail |] s = {| |} [| p +++ q |] s = [| p |] s \/ [| q |] s [| return a |] s = {| (a, s) |} [| p >>= f |] s = {| (b, s_f) | (a, s_p) <- [| p |] s , (b, s_f) <- [| f a |] s_p |}
Using this semantics we can prove (exercise) the laws about parsers given before.
For instance, here is the proof of L10 for the case of a non-empty input string:
== { Def. of [| p +++ q |] } [| symbol >>= f |] (c:s) \/ [| symbol >>= g |] (c:s) == { Def. of [| p >>= f |] and [| symbol |] } [| f c |] s \/ [| g c |] s == { Def. of [| p +++ q |] "backwards" } [| f c +++ g c |] s == { Def. of [| p >>= f |] and [| symbol |] "backwards" } [| symbol >>= (\c -> f c +++ g c) |] (c:s)
Exercise: prove or test the rest of the lawsThe reference semantics is useful for reasoning, but inefficient.
There are three sources of possibly inefficiency that we can identify:
Source Reason Definition of `(+++)` Union of bags Definition of `(>>=)` Creation of many intermediate results (e.g., `[| p |] s` and `[| f a |]`)
Parser0: our first implementation
Every constructor and combinator is a constructor in the
Parser
data type.data Parser0 s a where {-- Constructors --} Symbol :: Parser0 s s Fail :: Parser0 s a {-- Combinators --} Choice :: Parser0 s a -> Parser0 s a -> Parser0 s a Return :: a -> Parser0 s a (:>>=) :: Parser0 s a -> (a -> Parser0 s b) -> Parser0 s b
We call it
Parser0
since it is our first attempt.Constructors and combinators (trivial)
{- | Constructors -} symbol = Symbol pfail = Fail {- | Combinators -} (+++) = Choice
Monadic operations (trivial)
instance Monad (Parser0 s) where return = Return (>>=) = (:>>=)
What about our
run
function?To start with, and for simplicity, we use lists instead of bags to denote the semantics of parsers.
type Semantics s a = [s] -> [(a,[s])]
The run function maps the constructors to their semantics.
run0 :: Parser0 s a -> Semantics s a run0 Symbol [] = [] run0 Symbol (s:ss) = [(s,ss)] run0 Fail _ = [] run0 (Choice p q) ss = (run0 p ss) ++ (run0 q ss) run0 (Return x) ss = [(x,ss)] run0 (p :>>= f) ss = [(y,s2) | (x,s1) <- run0 p ss, (y,s2) <- run0 (f x) s1]
We have the same sources of inefficiency as the reference semantics! (i.e., definition of `Choice` and `(:>>=)`)
Parser1: removing bind
The use of list comprehension in
run0 (p :>>= f)
builds a lot of intermediate lists which might be costlyHow do we simplify it?
We move towards an intermediate representation, where the bind takes place when constructing the program — not when running it!
Methodology:
- Remove
(:>>=)
from the data type - Try to define
(>>=)
anyways, and analyze the usage patterns which we cannot write - Introduce new constructors to capture such cases
- Simplify the data type with the new constructors and derive the definition
for
(>>=)
- Remove
Let us try to define
(>>=)
Fail >>= k
By L4, we know that
Fail >>= k ≡ Fail
Success! (Nothing to be done)
Choice p q >>= f
By L5, we know that
Choice p q >>= f ≡ Choice (p >>= f) (q >>= f)
Success! (Nothing to be done)
Return x >>= f
The first monad law already tells us that this is just
(f x)
.Success! (Nothing to be done)
Symbol >>= k
There is no L-rule for this case! Let us capture this usage pattern in a new constructor
SymbolBind k ≡ Symbol >>= k
Observe that
k :: s -> Parser0 s a
Therefore, we have that
SymbolBind :: (s -> Parser0 s a) -> Parser0 s a
We obtain
Parser1
from the definition ofParser0
, whereSymbolBind
gets introduced and(:>>=)
gets removeddata Parser1 s a where SymbolBind :: (s -> Parser1 s a) -> Parser1 s a Fail :: Parser1 s a Choice :: Parser1 s a -> Parser1 s a -> Parser1 s a Return :: a -> Parser1 s a
Observe that there is no
(:>>=)
What about the
run
function?run1 :: Parser1 s a -> Semantics s a run1 Fail _ = [] run1 (Choice p q) ss = run1 p ss ++ run1 q ss run1 (Return x) ss = [(x,ss)] run1 (SymbolBind k) ss = ?
It is mainly as before, but the intermediate results generated by
(:>>=)
are not there.In the definition of
run1
, the new interesting case isrun1 (SymbolBind k) ss = ?
We are going to derive it by using the reference semantics.
We know, by the definition of
SymbolBind
, that[| SymbolBind k |] = [| symbol >>= k |]
By the reference semantics of
(>>=)
, we have that[| symbol >>= k |] ss = {| (b, s_k) | (a, ss_p) <- [| symbol |] ss , (b, ss_k) <- [| k a |] ss_p |}
So, we have two cases:
[| symbol >>= k |] [] = {| (b, s_k) | (a, ss_p) <- [| symbol |] [] , (b, ss_k) <- [| k a |] ss_p |}
By the reference semantics of
symbol
, we have[| symbol >>= k |] [] = {| (b, ss_k) | (a, ss_p) <- {| |} , (b, ss_k) <- [| k a |] ss_p |}
By multi-set comprehension, we conclude that
[| symbol >>= k |] [] = {| |}
On the other hand, we have the following equation
[| symbol >>= k |] (s:ss) = {| (b, s_k) | (a, ss_p) <- [| symbol |] (s:ss) , (b, ss_k) <- [| k a |] ss_p |}
By applying the reference semantics of
symbol
, we have that[| symbol >>= k |] (s:ss) = {| (b, s_k) | (a, ss_p) <- {| (s, ss) |} , (b, ss_k) <- [| k a |] ss_p |}
By multi-set comprehension, we conclude that
[| symbol >>= k |] (s:ss) = {| (b, ss_k) | (b, ss_k) <- [| k s |] ss |}
which, by multi-set comprehension, is equivalent to
[| symbol >>= k |] (s:ss) = [| k s |] ss
To summarize, we obtain
[| symbol >>= k |] [] = {| |} [| symbol >>= k |] (s:ss) = [| k s |] ss
Therefore, we conclude that
run1 (SymbolBind k) [] = [] run1 (SymbolBind k) (s:ss) = run1 (k s) ss
Constructors and combinators? (the non-proper morphisms are mainly as before)
{- | Constructors -} symbol = SymbolBind Return pfail = Fail {- | Combinators -} (+++) = Choice
Observe that
symbol
is defined asSymbolBind Return
. Is it true thatsymbol
only extracts a symbol from the input?SymbolBind Return ≡ Symbol >>= Return
By L2 (Right Identity), we know that
SymbolBind Return ≡ Symbol
So,
SymbolBind
corresponds to the notion ofSymbol
inParser0
!What about
return
?Function
return
is just as before.return = Return
The interesting case is
(>>=)
How are we going to define it?
So far, we have that
Fail >>= k = Fail Choice p q >>= f = Choice (p >>= f) (q >>=f) Return a >>= k = k a
What about our recently introduced constructor (
SymbolBind
)?SymbolBind f >>= k = ?
By definition of
SymbolBind
, we know thatSymbolBind f >>= k ≡ Symbol >>= f >>= k
By L3 (Associativity of monads), we have that
Symbol >>= f >>= k ≡ Symbol >>= (\s -> f s >>= k)
By our definition of
SymbolBind
, we have thatSymbol >>= (\s -> f s >>= k) ≡ SymbolBind (\s -> f s >>= k)
So, we finally have that
SymbolBind f >>= k = SymbolBind (\s -> f s >>= k)
We can now define
Parser1
as a monad{- | Monadic instance for Parser1 -} instance Monad (Parser1 s) where return = Return Fail >>= k = Fail Choice p q >>= k = Choice (p >>= k) (q >>= k) Return x >>= k = k x SymbolBind f >>= k = SymbolBind (\s -> f s >>= k)
Observe that the definition of `(>>=)` was derived from the domain knowledge and monadic laws. We cannot get it wrong!
Transforming parsers of type Parser0
into parsers of type Parser1
Is
Parser1
as expressive asParser0
? In other words, can any parser you wrote of typeParser0
be reformulated as a parser of typeParser1
?Yes! We can write a function which transform a
Parser0
into aParser1
cast :: P0.Parser0 s a -> Parser1 s a
To avoid name crashes, all the constructors and types from
Parser0
are qualified asP0
Let us see the easy cases.
cast :: P0.Parser0 s a -> Parser1 s a cast P0.Symbol = SymbolBind Return -- L1 cast P0.Fail = Fail cast (P0.Choice p q) = Choice (cast p) (cast q) cast (P0.Return x) = Return x
The core of the translation is bind!
cast (P0.Symbol P0.:>>= k) = SymbolBind (cast . k) -- def of SymbolBind cast (P0.Fail P0.:>>= _) = Fail -- Parser law, L4. cast ((P0.Choice p q) P0.:>>= k) = Choice (cast (p P0.:>>= k)) (cast (q P0.:>>= k)) -- Parser law, L5 cast ((P0.Return x) P0.:>>= k) = cast (k x) -- monad law, L1 cast ((p P0.:>>= f) P0.:>>= k) = cast (p P0.:>>= (\x -> f x P0.:>>= k)) -- monad law, L3
Observe that for every case, there is some law which helps to derive the translation of bind!
Parser2: improving choice
If we observe the
run1
function againrun1 :: Parser1 s a -> Semantics s a run1 (SymbolBind k) [] = [] run1 (SymbolBind k) (s:ss) = run1 (k s) ss run1 Fail _ = [] run1 (Choice p q) ss = (run1 p ss) ++ (run1 q ss) run1 (Return x) ss = [(x,ss)]
We have another source of inefficiency. Can you see it?
The list append
(++)
is linear in its first argument which means that left nested applications(+++)
get a quadratic behaviour, e.g., consider expressions of the form((s1 ++ s2) ++ s3)
.How can we optimize
Choice
, i.e.(+++)
?Similar as we did with bind, we remove it from our data type. The choice operator
(+++)
then takes place when building the program — not when running it!The new data type for parsers
data Parser2 s a where SymbolBind :: (s -> Parser2 s a) -> Parser2 s a Fail :: Parser2 s a Return :: a -> Parser2 s a
Let us try to define
(+++)
For
Fail
is easy due to laws L6 and L7.(+++) :: Parser2 s a -> Parser2 s a -> Parser2 s a Fail +++ _ = Fail q +++ Fail = Fail
For
SymbolBind
, we know by L10 thatSymbolBind f +++ SymbolBind q = SymbolBind (\s -> f s +++ q s)
If
SymbolBind
is combined withFail
instead, we know the result (see L6 and L7).The tricky case is when
SymbolBind
is pattern-matched withReturn
.SymbolBind f +++ Return x = ? Return x +++ SymbolBind f = ?
It seems that
Return
is stopping us from defining(+++)
. In fact, what is the definition of(+++)
when it only deals withReturn
?Return x +++ Return y = ?
For these cases, we therefore introduce a new constructor.
ReturnChoice x p ≡ Return x +++ p
Observe that, by L7,
Return x ≡ ReturnChoice x Fail
ReturnChoice
can encodeReturn
!Therefore, let us take the definition of
Parser2
and replaceReturn
withReturnChoice
data Parser2 s a where SymbolBind :: (s -> Parser2 s a) -> Parser2 s a Fail :: Parser2 s a ReturnChoice :: a -> Parser2 s a -> Parser2 s a
Let us now define
(+++)
by using parser laws, commutative, and associative laws.(+++) :: Parser2 s a -> Parser2 s a -> Parser2 s a SymbolBind f +++ SymbolBind g = SymbolBind (\s -> f s +++ g s) -- L10 p +++ Fail = p -- L6 Fail +++ q = q -- L7 ReturnChoice x p +++ q = ? p +++ ReturnChoice x q = ?
We derive the tricky cases.
By definition of
ReturnChoice
, we have thatReturnChoice x p +++ q ≡ (Return x +++ p) +++ q
By L8 (associativity of
(+++)
), we have that(Return x +++ p) +++ q ≡ Return x +++ (p +++ q)
By the definition of
ReturnChoice
, we obtainReturn x +++ (p +++ q) ≡ ReturnChoice x (p +++ q)
Exercise: derive the definition for `p +++ ReturnChoice x q`(+++) :: Parser2 s a -> Parser2 s a -> Parser2 s a SymbolBind f +++ SymbolBind g = SymbolBind (\s -> f s +++ g s) -- L10 p +++ Fail = p -- L6 Fail +++ q = q -- L7 ReturnChoice x p +++ q = ReturnChoice x (p +++ q) p +++ ReturnChoice x q = ReturnChoice x (p +++ q)
So, we obtain
(+++)
defined, but we should fix(>>=)
since we replacedReturn
withReturnChoice
{- | Monadic instance for Parser2 -} instance Monad (Parser2 s) where return x = ReturnChoice x Fail Fail >>= k = Fail (SymbolBind f) >>= k = SymbolBind (\s -> f s >>= k) ReturnChoice x p >>= k = ?
By definition of
ReturnChoice
, we have thatReturnChoice x p >>= k ≡ (Return x +++ p) >>= k
By L5, we have that
(Return x +++ p) >>= k ≡ (Return x >>= k) +++ (p >>= k)
By L1 (Left Identity), we conclude that
(Return x >>= k) +++ (p >>= k) ≡ k x +++ (p >>= k)
So, we obtain that
ReturnChoice x p >>= k = k x +++ (p >>= k)
{- | Monadic instance for Parser2 -} instance Monad (Parser2 s) where return x = ReturnChoice x Fail Fail >>= k = Fail (SymbolBind f) >>= k = SymbolBind (\s -> f s >>= k) ReturnChoice x p >>= k = k x +++ (p >>= k)
We have completed the implementation of
(+++)
and(>>=)
which gets computed when constructing parsers — not when running them!Let us see the run function.
We take
run1
for parsers of typeParser1
, remove the case forChoice
, and see what happens when placingReturnChoice
in the place ofReturn
.run2 :: Parser2 s a -> Semantics s a run2 (SymbolBind k) [] = [] run2 (SymbolBind k) (s:ss) = run2 (k s) ss run2 Fail _ = [] run2 (ReturnChoice x p) ss = ?
We are going to try deriving the definition. However, since we are dealing with the run function, we need to consider the ideal semantics of parsers.
By definition of
ReturnChoice
, we obtain that[| ReturnChoice x p |] ss ≡ [| Return x +++ p |] ss
By the semantics of
(+++)
, we obtain that[| Return x +++ p |] ss ≡ [| Return x |] ss \/ [| p |] ss
By the semantics of
Return x
, we have that[| Return x |] ss \/ [| p |] ss ≡ {| (x, ss) |} \/ [| p |] ss
Summarizing, we have that
[| ReturnChoice x p |] ss ≡ {| (x, ss) |} \/ [| p |] ss
So, we complete the definition of
run2
as follows.run2 :: Parser2 s a -> Semantics s a run2 (SymbolBind k) [] = [] run2 (SymbolBind k) (s:ss) = run2 (k s) ss run2 Fail _ = [] run2 (ReturnChoice x p) ss = (x, ss) : run2 p ss
Transforming parsers of type Parser1
into parsers of type Parser2
Is
Parser2
as expressive asParser1
? In other words, can any parser you wrote of typeParser1
be reformulated as a parser of typeParser2
?Yes! We can write a function which transforms a
Parser1
into aParser2
cast2 :: Parser1 s a -> Parser2 s a
Exercise: write `cast2`
Parser3: optimizing (>>=)
There is still one remaining source of inefficiency.
If you look at the definition of
(>>=)
, you'll see that it is linear in the size of its first argument.This means that we get a similar problem to the use of
(++)
, namely a quadratic behaviour for left nested uses of(>>=)
.In order to fix this we cannot use the method we've been using so far, there is no constructor to remove to fix the problem. Instead, we have to use another technique, called a "context passing" implementation.
Read more about it in the paper.
Summary
Parsers laws ≡ domain knowledge
Detect inefficiencies and introduce changes
Derivation to synthesize the right code (no hacking!)
- Leveraging domain and monad laws