Parsing spreadsheets with Clojure

Reading from an Excel file is surprisingly easy in clojure. We'll see an example in groovy and then compare it with one in clojure.

Groovy

In groovy we can read an Excel sheet like this:

XSSFWorkbook spreadsheet(File spreadsheetFile) {
  FileInputStream file = new FileInputStream(spreadsheetFile)
  new XSSFWorkbook(file)
}

def cellValue(XSSFSheet sheet, XSSFCell cell) {
  if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) {
    cell.getNumericCellValue().intValue()
  } else {
    stringCellValue(sheet, cell)
  }
}

String stringCellValue(XSSFSheet sheet, XSSFCell cell) {
    return cell.getStringCellValue().trim()
}

private List<String> headersFor(String sheetName, File spreadsheetFile) {
  XSSFWorkbook spreadsheet = spreadsheet(spreadsheetFile)
  XSSFSheet sheet
  switch(sheetName) {
    case SHEET_NAME1:
      sheet = spreadsheet.getSheet(SHEET_NAME1)
      break
    case SHEET_NAME2:
      sheet = spreadsheet.getSheet(SHEET_NAME2)
      break
    default:
      return
  }
  sheet.getRow(sheet.getFirstRowNum())
      .findAll{XSSFCell cell ->
    cell.getStringCellValue().trim().length() > 0
      }
      .collect{XSSFCell cell ->
    cell.getStringCellValue().toLowerCase()
      }
}


XSSFSheet sheet = spreadsheet(spreadsheetFile).getSheet(sheetName)
def headers = headersFor(sheetName, spreadsheetFile)
def values = []
sheet.eachWithIndex { XSSFRow row, int index ->
    if(index > 0) {
  def rowFieldValues = [:]
  row.each { XSSFCell cell ->
      def cellValue = cellValue(sheet, cell)
      if(field) {
    rowFieldValues.put(field, cellValue)
      }
  }
  values << rowFieldValues
    }
}

Whew, that is a lot of code. Event though we wrote it in a more terse language than java, it is still quite long and error prone. Granted, we are using poi directly, compared to using a third party library like the one we'll use with clojure.

Clojure

In clojure we can use a third party library that is a wrapper for poi: Docjure. We can include this a dependency in our project.clj file:

[dk.ative/docjure "1.7.0"]

To extract the data from the spreadsheet, we basically need to do three things: 1. Load the file. 2. Select the sheet we want to read. 3. Specify the columns we are interested in.

Pretty straightforward:

(ns spreadsheet-example.core
    (:require [dk.ative.docjure.spreadsheet :as spreadsheet]))

(defn read-rows []
  (->>
    (spreadsheet/load-workbook "resources/sample.xlsx")
    (spreadsheet/select-sheet "Sheet Name")
    (spreadsheet/select-columns {:A :name-for-column-A :B :name-for-column-B})))

Calling read-rows will return a list of maps whose keys correspond to the values of the columns. Eg:

[{:name-for-columnA "A" :name-for-columnB 1} {:name-for-columnA "B" :name-for-columnB "2} {:name-for-columnA nil :name-for-columnB nil]

The last two are empty rows. We can leave those out pretty easily:

(defn valid-row? [col]
  (not (nil? (:name-for-columnA x))))

(defn all-valid-rows []
  (->>
    (read-rows)
    (rest)
    (filter valid-row?)))

This will filter out all rows that do not have a value in column A.

Happy hacking!

Groovy

Clojure

Roberto Guerra