Thursday, November 10, 2011

A maze of twisty little Strings, all alike

This post was spurred by a recent discussion of whether or not Haskell code just works the first time you try it.  As a less-than-expert Haskeller, my answer to that is "not so much."  But even as a less than expert Haskeller, I find that my Haskell code needs much less debugging time than when I write in another language.  It feels like a difference of kind, not one of degree.  So I understand where the original claim comes from, even if I ultimately find it to be overstated.  What's great about Haskell is that it has lots of language features that support robust code.  Immutability of data gets lots of press, and deservedly so, but Haskell has many other features that help. 

What this post is about is one of the Haskell techniques that helps me write code that is more solid.  That technique is using type aliases to differentiate amongst the various kinds of strings that I am using. 

The Motivation
For example, I recently wrote a small web app that used an SQL data store.  My first cut at designing the code resulted in a bunch of functions, many of which took multiple arguments that were of type String.  I've found that this leads to confusion about what exactly a specific String represents.  Is this a full pathname, a relative pathname, or just the filename with all path information stripped off?  Does this routine even care, or it is flexible enough to handle any of those cases?

The Approach
In theory, good comment practices and informative variable names should provide this information.  In practice, these are often not maintained as well as they should be --- and I think that using the type system to differentiate between types of strings is generally both clearer to the reader and faster for the author of the code.  At least, it's faster in Haskell.

Let's take a look at how Haskell does this.  The newtype keyword in Haskell gives a type a new identity.  This new type is incompatible with the original type, so the compiler will catch any attempt to use one in place of the other --- if I have defined both TableName and ColumnName as newtypes of String, I can't use either a String or a ColumnName as a TableName. 

Function Type Signatures, Old and New
So let's look at how this transformed one of my functions. 

-- builds an SQL query string for updating a record
-- arguments

--    table: the name of the table to update
--    columns: the names of the columns to update
--    where_clause: the text of the WHERE clause, should
--       not have the WHERE keyword, i.e. it should 
--       look like   name='cheddar'
-- returns something like
--    UPDATE cheese SET price=? WHERE name='cheddar'
bldUpdate :: String -> [String] -> String -> String
bldUpdate table columns where_clause = . . .

This is a function that takes a String, a list of Strings, and a String, and returns a String.  At first blush, this looks like something that can be readily handled by appropriately naming the function parameters.  But then you find that your calling code has its key names represented both by the bare column name and by this mashup of the column name and the name of the of the HTTP form that it appears on.  Oh, yeah, and there's some other code that deals with mashups of the table name plus the column name.  So that's three different forms of "column name" we've got to keep track of. Additionally, we'd like to have that where_clause comment appear only one place in the code, not every place we pass or receive a WHERE clause.  

By defining TableName, ColumnName, UriColumnName, TableAndColumn, WhereClause, and SqlCmd to be new string types, we clean this up

-- builds an SQL query string for updating a record
-- returns "UPDATE cheese SET price=? WHERE name='cheddar'"
bldUpdate :: TableName -> [ColumnName] -> WhereClause -> SqlCmd
bldUpdate  table columns where_clause = . . .

I find the second version a whole lot easier to digest quickly.  I'm horrified at the thought of a whole file full of functions that look like the first version. Reading through that source file and grokking it is a lot more work than it needs to be -- and that assumes that the first version has been correctly maintained.  You know that the second version has been correctly maintained, because otherwise it wouldn't compile; there are no doubts in the back of your mind.  Additionally, the second version will have the types of all the arguments checked, so you can't mix up the table name and the where clause.  How easy would it be to write bad test cases for the first version?  How about the second version?

Defining TableName
We define TableName like this:

newtype TableName = TableName String deriving (Show, Eq)

class IsStr a where
  toStr :: a -> String

instance IsStr TableName where
  toStr (TableName t) = t

Let's take this from the top.  The newtype command defines TableName as a new type identity for String, and defines a value constructor TableName.  That value constructor is a function that converts a String to a TableName.  The "deriving (Show, Eq)" clause tells the compiler that the TableName type should inherit String's implementations of the functions in the Show and Eq typeclasses (think of typeclasses as interfaces).  TableName could inherit more functionality from String by adding more of its typeclasses to the derivation list. 

The next code section defines a typeclass IsStr with a single function toStr.  toStr will convert any member type of IsStr to a String. We made this a typeclass rather than just defining the function toStr, because we'll want to define toStr for our other string types. 

The final section of code defines the implementation of IsStr for the TableName type.  The function definition uses a pattern match to simply return the string that was used to construct the TableName.  

Using TableName
So  implementing TableName was a piece of cake.  How does this change the usage?  In this particular module, the vast majority of the usage was unchanged; most routines that use TableName don't do anything directly with the value and only pass it on to routines that they call.  Let's look at a routine that's on the bottom of that call tree, bldQueryBasic.  The original version is

bldQueryBasic :: String -> String -> String
bldQueryBasic col table =
  "SELECT " ++ col ++ " FROM " ++ table

The new version uses pattern matching on the value constructors ColumnName and TableName:

bldQueryBasic :: ColumnName -> TableName -> SqlCmd
bldQueryBasic (ColumnName col) (TableName table) =
  SqlCmd ("SELECT " ++ col ++ " FROM " ++ table)

If we weren't in position where pattern matching was so convenient, we could call toStr to convert our TableName to a String.  This example also illustrates the other side of the usage coin: when we have a String and need an SqlCmd, the value constructor SqlCmd converts the String value into a SqlCmd. 

All in all, it's pretty painless to replace String with TableName and friends.  Usages like comparing two TableNames are unchanged, while comparing a TableName with a string will cost you call to either toStr or the TableName constructor. 

The Whole API
The beauty of this solution is that the code and the API is much easier to read, and the plain ole' Strings really stand out now.  Look at the type signatures below.  I think you'll agree the original version really is "a maze of twisty little Strings, all alike." Looking at the original API, it's hard to understand what the arguments are. 

The Original API
buildTable :: TableDef -> String -> IO ()
findTable :: String -> [TableDef] -> Maybe TableDef
getColumns :: TableDef -> [String]
getColumnName :: SqlColumn -> String
getTableName :: TableDef -> String
getSqlVal :: SqlColumn -> String -> SqlValue
bldDelete :: String -> String -> String
bldUpdate :: String -> [String] -> String -> String
existsEntry :: String -> String -> String -> IO Bool
bldQueryWhereEq :: String -> String -> String -> String -> Maybe String
bldQueryMultiEq :: String -> String -> [String] -> [String] -> Maybe String
clauseEQ :: String -> String -> String
bldQueryWhere :: String -> String -> String -> Maybe String
bldQueryBasic :: String -> String -> Maybe String
runQuery :: Maybe String -> IO String
runQueryList :: Maybe String -> IO [String]

The Updated API
buildTable :: TableDef -> IO ()
findTable :: TableName -> [TableDef] -> Maybe TableDef
getColumns :: TableDef -> [ColumnName]
getColumnName :: SqlColumn -> ColumnName
getTableName :: TableDef -> TableName
getSqlVal :: SqlColumn -> String -> SqlValue
bldDelete :: TableName -> WhereClause -> SqlCmd
bldUpdate :: TableName -> [ColumnName] -> WhereClause -> SqlCmd
existsEntry :: TableName -> WhereClause -> IO Bool
bldQueryWhereEq :: ColumnName -> TableName -> ColumnName -> String -> Maybe SqlCmd
bldQueryMultiEq :: ColumnName -> TableName -> [ColumnName] -> [String] -> Maybe SqlCmd
clauseEQ :: [ColumnName] -> [String] -> WhereClause
bldQueryWhere :: ColumnName -> TableName -> WhereClause -> Maybe SqlCmd
bldQueryBasic :: ColumnName -> TableName -> Maybe SqlCmd
runQuery :: Maybe SqlCmd -> IO String
runQueryList :: Maybe SqlCmd -> IO [String]

Final Words
I hope I've shown how Haskell's newtype can help you to write clear code with very little extra work.  If we wanted to put in more work, we could enforce invariants.  For example, our module could hide the value constructor WhereClause and only export a function toWhereClause that validates the input before constructing a WhereClause.  Tom Moertel has an excellent blog post that describes how you can use a similar method with an eye toward security.   

Haskell isn't the only language you can do this in.  But I'm unaware of any mainstream language where you can do it with so little overhead and the guarantees of type safety.  

No comments:

Post a Comment