Thursday, 5 December 2013

UNICODE and Surrogate Characters

I have been attempting to save data to a MySQL db and getting a very puzzling error.

Incorrect string value: '\xF0\x90\x8D\x83\xF0\x90...' for column 'alternatenames'
A bit of searching lead me to believe that this was entirely down to unicode characters, but it is more complex than that.
But first, ensure that your connection string has:
${database.url}?amp;useUnicode=true&characterEncoding=UTF-8"
Then ensure that the columns to store the characters are:

  • CHARACTER SET utf8

    or
  • CHARACTER SET utf8mb4

Finally check the way your characters are represented as you could be attempting to store UTF-16 "surrogate pairs". The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. I wasn't able to get UTF-16 to work with Hibernate so I have resorted to stripping this data.
The following functions assisted me in cleansing the data:

    /**
     * I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.
     * http://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java
     * @param query
     * @return
     * @see #setAlternatenames(String) and {@link #setAlternatenamesSafe(String)}
     */
    public static String removeSurrogatesFromCSV(String query) 
    {
       String parts[] = query.split(",");
    String sep = "";
    StringBuilder sb = new StringBuilder();
    for (String part:parts)
        {
        sb.append(sep);
        part = StringUtils.trimToEmpty(removeSurrogates(part,false)); 
        if (!part.isEmpty()) 
            {
            sb.append(part);
            sep=",";
            }
        }
        
        
       return sb.toString();
    }
    /**
     * I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.
     * http://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java
     * @param query
     * @return
     */
    public static String removeSurrogates(String query) 
    {
        return removeSurrogates(query,false); 

    }
    /**
     * I am facing a situation where i get Surrogate characters in text that i am saving to MySql 5.1. As the UTF-16 is not supported in this, I want to remove these surrogate pairs manually by a java method before saving it to the database.
     * http://stackoverflow.com/questions/12867000/how-to-remove-surrogate-characters-in-java
     * @param query
     * @param preserveLength if true then the surrogates are turned into '[]'.
     * @return
     */
    public static String removeSurrogates(String query,boolean preserveLength) 
        {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < query.length(); i++) {
            char ch = query.charAt(i);
            if (Character.isHighSurrogate(ch))
                {
                if (preserveLength) sb.append("[]");
                i++;//skip the next char is it's supposed to be low surrogate
                }
            else
                sb.append(ch);
        }    
        return sb.toString();
        }

Hope this helps!

See:



No comments:

Post a Comment