Parsing from scratch - Part 2

Views
Article hero image

In the previous post, we built the parser library from scratch, containing all the building blocks for creating other high-level parsers. In this post, let us look at a few concrete high-level parsers.

Hex Color Parser

Parsing hex colors of the format #xxyyzz (where x, y, and z are hexadecimal digits) is relatively simple.

case class HexColor(red: Int, green: Int, blue: Int)

object HexColor:
  def apply(r: String, g: String, b: String): HexColor =
    new HexColor(
      Integer.parseInt(r, 16),
      Integer.parseInt(g, 16),
      Integer.parseInt(b, 16)
    )
      
  val parser: Parser[HexColor] =
    for {
      _ <- char('#')
      r <- hexDigits(2)
      g <- hexDigits(2)
      b <- hexDigits(2)
    } yield HexColor(r, g, b)

The parser for hex color is defined using the for comprehension syntax. The char('#') parser matches the # character at the beginning of the input string. The hexDigits(2) parser matches exactly two hexadecimal digits, which we call three times to match the red, green, and blue components of the hex color. So far, we have only defined the parser for hex color. Here is how to use it:

println( HexColor.parser.parse("f0f0f0") )  // HexColor(240, 240, 240)
println( HexColor.parser.parse("ffffff") )  // Expected '#', but got 'f'
println( HexColor.parser.parse("#fff") )    // Unexpected end of input, expected a hex digit

Date Parser

Let us build a date parser that parses dates of the format YYYY-MM-DD.

case class Date(year: Int, month: Int, day: Int):
  override def toString: String = f"$year%04d-$month%02d-$day%02d"

object Date:
  val parser: Parser[Date] =
    for
      year <- digits(4)
      _    <- char('-')
      month <- digits(2)
      _    <- char('-')
      day  <- digits(2)
    yield Date(year, month, day)

There is something to consider here. The input may be in a valid format, but may not be a real date. For example, 2026-54-46 is in the right format but is not a valid date. The question is, should the parser include validation? Or should validation be external to the parser?

While it depends on context, we will include validation in the parser so that the parsed text is a valid date.

object Date:
  val parser: Parser[Date] =
    for
      year  <- digits(4) // parse 4 digits
      _     <- char('-') // parse literal '-'
      month <- digits(2) // parse 2 digits
      _     <- validate(month >= 1 && month <= 12, "month must be between 1 and 12")
      _     <- char('-') // parse literal '-'
      day   <- digits(2) // parse 2 digits
      _     <- validate(day >= 1 && day <= daysInMonth(month, year), s"Invalid day: $day for month $month")
    yield Date(year, month, day)

  private def validate(condition: Boolean, message: String): Parser[Unit] =
    Parser { input =>
      if condition then ParseResult.Success((), input)
      else ParseResult.Failure(message)
    }

  private def daysInMonth(month: Int, year: Int): Int =
    month match
      case 1 | 3 | 5 | 7 | 8 | 10 | 12 => 31
      case 4 | 6 | 9 | 11              => 30
      case 2                           => if isLeapYear(year) then 29 else 28
      case _                           => 0

  private def isLeapYear(year: Int): Boolean =
    year % 4 == 0 && (year % 100 != 0 || year % 400 == 0)

Let us put our parser to the test:

// NOTE: See toString override in Date
println( Date.parser.parse("2026-02-12") )  // 2026-02-12 

println( Date.parser.parse("2026-1-01") )   // Expected a digit, but got '-'
println( Date.parser.parse("26-02-12") )    // Expected a digit, but got '-'
println( Date.parser.parse("hello") )       // Expected a digit, but got 'h'

// Invalid dates
println( Date.parser.parse("2026-02-30") )  // Invalid day: 30 for month 2
println( Date.parser.parse("2026-14-01") )  // month must be between 1 and 12

That works well. What would you say is the result of parsing 2026-02-15xxx? Pause and think for a moment.

Answer: Our date parser will correctly parse the date from the input text, ignoring the trailing xxx. That’s due to how our parser is implemented. It specifically looks for year, month, and date, and does not instruct/expect that the input ends by the time the parser reads the day part. This is useful when parsing a part of the text for a date.

If you were to update the parser to accommodate strict parsing of dates, meaning the input text is only parsed for dates and nothing else, we would have to add a helper in our Parser to parse EOF.

/**
 * Assert the parse is done β€” no leftover input
 * @return a parser that parses the end of input
 */
def eof: Parser[Unit] =
  Parser { input =>
    if input.isEmpty then ParseResult.Success((), "")
    else ParseResult.Failure(s"Expected end of input, but got '$input'")
  }

Then you could add eof as the final item to parse:

val parser: Parser[Date] =
  for
    year  <- digits(4) // parse 4 digits
    _     <- char('-') // parse literal '-'
    month <- digits(2) // parse 2 digits
    _     <- validate(month >= 1 && month <= 12, "month must be between 1 and 12")
    _     <- char('-') // parse literal '-'
    day   <- digits(2) // parse 2 digits
    _     <- validate(day >= 1 && day <= daysInMonth(month, year), s"Invalid day: $day for month $month")
    _     <- eof       // nothing else should remain
  yield Date(year, month, day)

When you use the above update parser, it would complain that an EOF was expected but found xxx.

println( Date.parser.parse("2026-02-15xxx") // Expected end of input, but got 'xxx'

Until next time …

Those are a couple of simple parsers we wrote using our parser library, which we wrote in the previous post. As an exercise, consider implementing parsers for formats like semantic versions or IPv4 addresses.

Next time, let us write a more involved parser - Configuration key value pairs.

scala parser series