Dev Reaction: Structured and Unstructured data

I used to work with a guy who would spout off statistics about how much of a business' data was in unstructured form. Sitting in filing cabinets, in people's heads, in their outlook inbox. We were working on a CRM project, where the name of the game is structuring all the data pertinent to customer relationships and how the business runs.

Anway, I was thinking about that when I was coding AR. Here is a typical image that I would be processing:

Here is a result of feeding that image through the OCR processing library:

Conﬁrm Bet
Bet Conﬁrmation
RODNEY SMITH (4292)
Ticket Id: 141 C-B5C9-8229
$4 Straight Bet
31-AUG-2013 BI: 923 MARINERS -103
MONEY LINE
Cost: $4
Payoff: $7.90
Your bet has been placed.
'Ticket Id’ is your conﬁrmation.

One thing that is weird about this is that the "fi" in the word confirmation is actually a single ascii character (ascii code 64257). So if I test for the sub-string "confirm", it returns false. So I have the classic problem of trying to scrub data that I'm not in control of. It is compounded by the fact that the OCR, which does a fantastic job by the way, will not reliably give me back english strings as expected.

I am trying to get this data into a DTO class that looks like this:

public class Ticket

{

    DateTime dateTime;

    Money wagerAmount;

    String sportsBook;

    BetType betType;

    List<String> games;

}

So I am basically writing tedious parsing code that takes raw text and looks for certain keywords. If William Hill decides to change the way they present the data, my parsing code is broken and my app is broken.

Sounds like a perfect case for placing the parsing logic behind an interface, so that I can have multiple implementations of said interface and keep my architecture nice and clean.

So that's what I'm working on now. Also, I had always heard that Joda Time was a superior DateTime library to the native Java api, so I went and added that via Maven (which is like NuGet for Java projects) and it was pretty painless. Also ran across Joda Money class, so figured I'd use that too for the wager amounts rather than straight float or double.

Anyway, I'm going back to coding.

Dev Reaction

Sunday, September 1, 2013

Structured and Unstructured data

No comments:

Post a Comment

About Me