Tuesday, January 28, 2014

Dealing with characterset issues

Overview

Recently I was involved in a project where we had to call a web service (XML post over http).  The data in RedPrairie database was in UTF8.  The special characters in this case were Turkish characters.  In this post I will document the various issues we faced and how I addressed them.

Understanding the environment

When dealing with international charactersets, I have seen that most often the parties involved do not completely understand what they are talking about and that can often hinder progress.  Following are some simple scripts to get some basic data:

  • What is the database characterset?
    • On Oracle execute:
      [SELECT * FROM nls_database_parameters where parameter = 'NLS_CHARACTERSET']
      
      Typically you will see something like AL32UTF8 implying UTF. Refer to oracle documentation for other charactersets.
  • How is it translated within RedPrairie technologist stack?
    • When the string is read into a Java String, you can ignore what Java stores it internally as. You are basically trusting Java that it knows what the string really is. If the string goes from Java to C, C will see it as a UTF8 byte sequence.
  • How does the other system want it?
    • For this project, the other system did not understand what characterset they expect. Something like this can become quite frustrating but in situations like this you may have to simply try the various encodings.

Solution Overview

Our scenario was as follows:
  1. The data is fetched
  2. An XML message is created from the data
  3. The XML is posted to the other sever.
  4. The other system responds.
When data is fetched and converted to XML, at that time I created a file using "write output file".  "write output file" is a C component and that showed the byte sequences as UTF-8.  When that data was posted to the URL, the other side was unable to parse it.  Even though network trace showed the correct byte sequences.  You can also use groovy to write files but in that case you should declare what characterset you want the file to be in.

The other system indicated that their native characterset was ISO-8859-9 so I tried converting the XML message to ISO-8859-9 before sending.

After that is resolved, we still need to understand what characterset their response comes in as.

So the code to write out to the URL and convert the characterset will be:
    /* Note the iso-8859-9 here */
    publish data
    where uc_rp2outside_charset = "iso-8859-9"
    |
    publish data
    where url = 'http://whatever.com'
    and body = 'xml string that has international characters.  Dont care 
    about characterset here because trusting Java to represent correctly'
    and uc_content_type = "text/xml;charset=" || @uc_rp2outside_charset
    |
    [[
    URL my_url = new URL(url);
    HttpURLConnection conn = (HttpURLConnection) my_url.openConnection();
    conn.setRequestMethod("POST");
    conn.setRequestProperty("Content-Type", uc_content_type );
    //
    DataOutputStream wr = new DataOutputStream(conn.getOutputStream());
    // This is where we are declaring that our writer should be writing in 
    // a speciic characterset
    PrintWriter out = new PrintWriter ( new OutputStreamWriter ( wr, 
                                        uc_rp2outside_charset ) );
    //
    // This println will convert body from internal java representation 
    // to the target characterset
    out.println(body);
    out.close();
    
    ]]
    
The response from the webservice will be in a specific characterset as well and we need to interpret it as such.  In this situation, after some analysis I determined that the response from the webservice was sending the data in UTF-8.
    publish data
    where uc_outside2rp_charset = "utf-8"
    |
    [[
    ... code to write on the url ...
    DataInputStream ir = new DataInputStream ( conn.getInputStream() );
    //
    // This declares that the response stream from the request will
    // send UTF-8.  This causes the "ret" to properly convert from
    // utf-8
    //
    BufferedReader in_reader = new BufferedReader( new InputStreamReader( ir, 
                                   uc_outside2rp_charset ) );
    //
    while ((inputLine = in_reader.readLine()) != null)
    {
       ret.append ( inputLine );
    }
    
    
    ]]
    
    

Conclusion

Dealing with characterset issues often becomes a challenge because most people do not fully understand the details and underlying concepts.  In such situations you may have to try some approaches, for example:
  • Understand that whenever you write out or read from a connection, data may need to be converted to a different characterset.  By having some policy or environment variables, you can easily try different conversions.
  • Have the ability to create files with the request and response.  
  • Look at the generated files in hex editor to see what bytes are being sent and received.
  • A network level trace can help as well to ensure that what is on the wire is proper character sequence.
  • Many technologies show "?" to represent characters that they do not understand; for example Oracle will show "?" and so would some java technologies.  MOCA trace will also show "?" - that can throw you off because the other side may be seeing "?" and so does MOCA trace on your side - but the raw data was correct from your side.  So having a raw file helps.  When creating such raw files same concepts apply, i.e. you will have to declare what characteset the file is in.
  • I have seen that ISO-xxxx-y charactersets are used more often that UTF especially on windows based systems, so if other side is not making sense, try out the appropriate ISO characterset.


No comments:

Post a Comment