Introduction
Parsing refers to the process of decomposing a string of characters into a structured representation that can be efficiently processed and analyzed. This series covers the fundamentals of parsing and demonstrates how to implement a simple parser in Scala.
Numerous ready-made parsers exist, such as regular expressions and Scala Parser Combinators. However, certain scenarios necessitate the development of a custom parser. Common situations include:
- When you need to parse a language or a domain-specific language (DSL) that is not supported by existing libraries.
- When you need to parse a specific data format, such as a CSV file, a JSON string, or a custom data format.
- When you need to implement a parser that is optimized for your specific use case and requirements, making it easy to maintain and extend.
Besides, you may also want to have some fun learning how parsers work and how to implement them from scratch.
What’s a Parser?
A Parser is just a wrapper around a function that takes a string and returns a structured representation of the input. The output of a parser indicates whether the input was successfully parsed or not, and if it was, what the parsed result was. Specifically, a successful parsing should tell you what the parsed result was, and the remaining input yet to be parsed.
case class Parser[A](parse: String => ParseResult[A])
enum ParseResult[A] {
case Success(value: A, remaining: String)
case Failure(message: String)
}
Parsing a single character
Given the above Parser defintion, we can write helpers to parse different types of data. For instance, we can write a helper that looks for a single character in the given input by a predicate.
object Parser:
/**
* Parse a single character that satisfies a predicate
* @param predicate the predicate to satisfy
* @param expected the expected character
* @return a parser that parses the character
*/
def charWhere(pred: Char => Boolean, expected: String): ParseResult[Char] =
Parser { input =>
input.headOption match
case Some(c) if pred(c) => ParseResult.Success(c, input.tail)
case Some(c) => ParseResult.Failure(s"Expected $expected, but got '${c}'")
case None => ParseResult.Failure(s"Unexpected end of input, expected $expected")
}
charWhere is a Parser that reads a given single character matching a given predicate. If a character matching the specified predicate was found at the head of the given input text, it means the character was parsed successfully. Or else the parsing failed, which of course might have failed because it was not the expected character, or you reached the end of input.
If you are looking to read a single digit, then you can write a digit parser:
/**
* Parse a single-digit character and return it as a Char
* @return a parser that parses the character
*/
def digit(): Parser[Char] = charWhere(_.isDigit, "a digit")
Say, you want to read a single hex digit:
/**
* Parse a single hex digit and return it as a Char
* @return a parser that parses the hex digit
*/
def hexDigit(): Parser[Char] =
charWhere(c => c.isDigit || (c >= 'a' && c <= 'f') || (c >= 'A' || c <= 'F'), "a hex digit")
Parser Combinators
A parser combinator is a function that takes a parser as input and returns a parser. Typical combinators are map and flatMap.
case class Parser[A](parse: String => ParseResult[A]):
/**
* `map` lets us transform the parsed value.
* If we have a Parser[Int], we can turn it into a Parser[String] etc.
*/
def map[B](f: A => B): Parser[B] =
Parser {
parse(_) match
case ParseResult.Success(value, rest) => ParseResult.Success(f(value), rest)
case ParseResult.Failure(msg) => ParseResult.Failure(msg)
}
/**
* `flatMap` lets us chain parsers: use the result of one to decide
* what to parse next. This is the heart of combining parsers.
*/
def flatMap[B](f: A => Parser[B]): Parser[B] =
Parser {
parse(_) match
case ParseResult.Success(value, rest) => f(value).parse(rest)
case ParseResult.Failure(msg) => ParseResult.Failure(msg)
}
As you can see, map works on the current (this) parser object and returns another parser depending on the specified transformer function f: A => B; likewise, the flatMap.
We can write other combinators:
OrElse
Tries to parse one, and if it fails, parses with the other. See the infix qualifier, which makes us write p1 orElse p2. Also, note the | that lets us write p1 | p2.
/**
* `orElse` lets us try one parser, and if it fails, try another.
* @param other the parser to try
* @return a parser that parses the orElse parser
*/
infix def orElse(other: => Parser[A]): Parser[A] =
Parser { input =>
parse(input) match
case s @ ParseResult.Success(_, _) => s
case _: ParseResult.Failure[?] => other.parse(input)
}
infix def |(other: => Parser[A]): Parser[A] = orElse(other)
~> - Sequence and keep right
/**
* `~>` sequences two parsers but keeps only the RIGHT result.
* Useful for skipping delimiters like '-'.
* @param next the parser to sequence
* @return a parser that parses the sequence
*/
def ~>[B](next: Parser[B]): Parser[B] = flatMap(_ => next)
~> - Sequence and keep left
/**
* `<~` sequences two parsers but keeps only the LEFT result.
* @param next the parser to sequence
* @return a parser that parses the sequence
*/
def <~[B](next: Parser[B]): Parser[A] =
for
a <- this
_ <- next
yield a
We can also write a helper combinator that optionally parses with a given parser and returns an Option[A] in the resultant parser:
object Parser:
/**
* Parse an optional parser
* @param p the parser to parse
* @return a parser that parses the optional parser
*/
def optional[A](p: Parser[A]): Parser[Option[A]] =
Parser { input =>
p.parse(input) match
case ParseResult.Failure(message) => ParseResult.Success(None, input)
case ParseResult.Success(value, rest) => ParseResult.Success(Some(value), rest)
}
Parsing multiple characters
So far, we have been writing parsers that read a single character! How to write a parser that reads multiple characters in one go?
Read n Characters
/**
* Repeat a parser exactly N times, collecting results into a List
* @param n the number of times to repeat the parser
* @param p the parser to repeat
* @return a parser that parses the repeated parser
*/
def repeatN[A](n: Int, p: Parser[A]): Parser[List[A]] =
if n <= 0 then Parser(input => ParseResult.Success(List.empty, input))
else
for
head <- p
tail <- repeatN(n - 1, p)
yield head :: tail
repeatN reads n characters using the parser p. That means the parsing will either succeed in reading n characters or fail. However, there are situations when you do not want to fail but end up with empty parsed output.
Read zero or more characters
/**
* Repeat a parser zero or more times (never fails)
* @param p the parser to repeat
* @return a parser that parses the repeated parser
*/
def zeroOrMore[A](p: Parser[A]): Parser[List[A]] =
Parser { input =>
p.parse(input) match
case ParseResult.Failure(_) => ParseResult.Success(List.empty, input)
case ParseResult.Success(value, rest) =>
zeroOrMore(p).parse(rest) match
case ParseResult.Failure(_) => ParseResult.Success(List(value), rest)
case ParseResult.Success(values, rest) => ParseResult.Success(value :: values, rest)
}
Read one or more characters
When you have zeroOrMore, it is easy to write a oneOrMore:
/**
* Repeat a parser one or more times (fails if zero matches)
* @param p the parser to repeat
* @return a parser that parses the repeated parser
*/
def oneOrMore[A](p: Parser[A]): Parser[List[A]] =
for
head <- p
tail <- zeroOrMore(p)
yield head :: tail
Now, it is trivial to write a parser to parse a series of digits:
/**
* Parse exactly N digits and convert them to an Int
* @param n the number of digits to parse
* @return a parser that parses the digits
*/
def digits(n: Int): Parser[Int] = repeatN(n, digit).map(_.mkString.toInt) // use of map combinator
Likewise, you can write a parser for hex digits:
/**
* Parse exactly N hex digits and return them as a String
* @param n the number of hex digits to parse
* @return a parser that parses the hex digits
*/
def hexDigits(n: Int): Parser[String] = repeatN(n, hexDigit()).map(_.mkString)
If you don’t know beforehand the number of digits to parse, but want to parse all following digits into an integer:
/**
* Parse one or more digits and convert them to an Int
* @return a parser that parses the digits
*/
def digits(): Parser[Int] = oneOrMore(digit).map(_.mkString.toInt)
Writing a string parser
Having gained some experience writing those parsers above, it should now be trivial to write a string parser that looks for a specific string at the head of the given input text.
/**
* Match a literal string, character by character
* @param s the string to match
* @return a parser that parses the string
*/
def string(s: String): Parser[String] =
s.toList match
case Nil => Parser(input => ParseResult.Success("", input))
case c :: rest =>
for
_ <- char(c)
tail <- string(rest.mkString)
yield s
What do we have so far?
So far, we have built a parser from scratch containing all the building blocks for creating other high-level parsers. You can read the complete definition of Parser from Parser.scala.
In the next post, we will look at some concrete high-level parsers. A quick preview is a Date parser that can be created as follows:
val dateParser: Parser[Date] =
for
year <- digits(4)
_ <- char('-')
month <- digits(2)
...
yield Date(year, month, day)
More in the next post.
Parser is Functional and Lazy
Parser is lazy in the sense that it’s a deferred computation. For instance, when you run the definition of the date parser, no input is consumed. No characters are read. Only a parser value is assembled. It is just a definition of what to do. The actual work is done only when you call .parse(input).
flatMap also adds another layer of laziness. f: A => Parser[B] is only called if the first parser succeeds. So the second parser isn’t even constructed until the first one produces a value. For instance, take a look at repeatN.
Until next post, Chao!