Archive for the ‘Encoding’ Category


August 4, 2011 Leave a comment

I recently encountered a little piece of code which reminded me why it is important for a developer to RTFM.

I saw in some code base the following utility method:

public static String allowOnlyNumbers(String input)
   StringBuilder sb = new StringBuilder(input.length());
   for (int i=0; i<input.length(); i++)
      char c = input.charAt(i);
      if (Character.isDigit(c))
   return sb.toString();

This utility method was used in the following context:

PhoneDetails phone = new PhoneDetails();

So the method had to only allow characters in the range of ‘0’-‘9’. Looks OK, right?
I ran a quick test:


The result was as expected:


I always look at Java Core’s sources to learn, so I looked at isDigit(char) as well to see how they did it, and to my initial surprise, the code was NOT:

public boolean isDigit(c)
   return c>='0' && c<='9';

It was far more complex and seemed to support Unicode as well. So I ran another test on the code:

System.out.println(allowOnlyNumbers("\u06F1\u06F2\u06F3")); // Arabic digits 1, 2 and 3

The result was again as expected (but not as intended):


Needless to say, looking at the JavaDoc it is specified very clearly how this method behaves:

Determines if the specified character is a digit.

A character is a digit if its general category type, provided by Character.getType(ch), is DECIMAL_DIGIT_NUMBER.

Some Unicode character ranges that contain digits:

‘\u0030’ through ‘\u0039’, ISO-LATIN-1 digits (‘0’ through ‘9’)
‘\u0660’ through ‘\u0669’, Arabic-Indic digits
‘\u06F0’ through ‘\u06F9’, Extended Arabic-Indic digits
‘\u0966’ through ‘\u096F’, Devanagari digits
‘\uFF10’ through ‘\uFF19’, Fullwidth digits

Many other character ranges contain digits as well.

So the message here is (as Baz Luhrmann said):

Read the directions even if you don’t follow them.

You can see the code running here

Groovy – a cool problem

May 25, 2011 Leave a comment

So I got a small groovy script to maintain. This script is executed as part of a SoapUI project.
The target of the script was simple – iterate 5 times and execute a test-case.
Here was the script:

for( i in 1..5)
{"Running Authentication - iteration <" + i + ">.")
 def step = testRunner.testCase.getTestStepByName( "authentication-fail" );, context);

The output was very simple:

Running Authentication - iteration <1>
Running Authentication - iteration <2>
Running Authentication - iteration <3>
Running Authentication - iteration <4>
Running Authentication - iteration <5>

But then I was asked to make ‘5‘ configurable, so I defined it as a parameter and made the following change to the code:

def maxCallsBeforeLock = testRunner.testCase.getPropertyValue("MaxCallsBeforeLock")
for( i in 1..maxCallsBeforeLock)
{"Running Authentication - iteration <" + i + ">.")
 def step = testRunner.testCase.getTestStepByName( "authentication-fail" );, context);

Unfortunately, the result was not as I expected:

Running Authentication - iteration <1>
Running Authentication - iteration <2>
Running Authentication - iteration <52>
Running Authentication - iteration <53>

Since I’m familiar with the ASCII Code I immediately recognized that 53 is the ASCII code of ‘5‘ and the solution became apparent that I need to convert my fake integer to a real integer. So the revision to the code was simply to replace this:

def maxCallsBeforeLock = testRunner.testCase.getPropertyValue("MaxCallsBeforeLock")

With this:

def maxCallsBeforeLock = testRunner.testCase.getPropertyValue("MaxCallsBeforeLock").toInteger()

Which solved the problem.

Now I had just one thing left to do – to understand why this happened.
53 is the ASCII code of the character ‘5‘. It is also the hashCode of the string “5“.
After some investigation and the help of friends on Stack Overflow I got the answer that Strings with the length of 1 are converted to a single character. So when trying to use them in a range, the numerical value of the corresponding character is used.

Categories: Encoding, Groovy Tags:

Encoding, Unicode and all that is between them

March 14, 2011 Leave a comment

I have worked with many developers in the past several years, and I noticed something which I find peculiar – developers are able to understand the flow of a large scale multi-threaded application with more ease than of understanding the concept of Unicode.

After reading a very good article about Unicode I have decided to share that article as I think it is something that we should all be aware of when writing international software.

I recommend – read this article, it will fill in gaps you might not know you have.

Important note for Java developers – Java maintains all characters in memory in UTF-16 format – so it is important to understand that even a simple text like ‘Hello‘ will consume 10 bytes of memory and not 5 bytes.

%d bloggers like this: