Archive

Archive for the ‘Fundamentals’ Category

RTFM

August 4, 2011 Leave a comment

I recently encountered a little piece of code which reminded me why it is important for a developer to RTFM.

I saw in some code base the following utility method:

public static String allowOnlyNumbers(String input)
{
   StringBuilder sb = new StringBuilder(input.length());
   for (int i=0; i<input.length(); i++)
   {
      char c = input.charAt(i);
      if (Character.isDigit(c))
      {
          sb.append(c);
      }
   }
   return sb.toString();
}

This utility method was used in the following context:

...
PhoneDetails phone = new PhoneDetails();
phone.setNumber(allowOnlyNumbers(phoneNumber));
phone.setAreaCode(allowOnlyNumbers(areaCode));
...

So the method had to only allow characters in the range of ‘0’-‘9’. Looks OK, right?
I ran a quick test:

System.out.println(allowOnlyNumbers("abc123"));

The result was as expected:

123

I always look at Java Core’s sources to learn, so I looked at isDigit(char) as well to see how they did it, and to my initial surprise, the code was NOT:

public boolean isDigit(c)
{
   return c>='0' && c<='9';
}

It was far more complex and seemed to support Unicode as well. So I ran another test on the code:

System.out.println(allowOnlyNumbers("\u06F1\u06F2\u06F3")); // Arabic digits 1, 2 and 3

The result was again as expected (but not as intended):

۱۲۳

Needless to say, looking at the JavaDoc it is specified very clearly how this method behaves:

Determines if the specified character is a digit.

A character is a digit if its general category type, provided by Character.getType(ch), is DECIMAL_DIGIT_NUMBER.

Some Unicode character ranges that contain digits:

‘\u0030’ through ‘\u0039’, ISO-LATIN-1 digits (‘0’ through ‘9’)
‘\u0660’ through ‘\u0669’, Arabic-Indic digits
‘\u06F0’ through ‘\u06F9’, Extended Arabic-Indic digits
‘\u0966’ through ‘\u096F’, Devanagari digits
‘\uFF10’ through ‘\uFF19’, Fullwidth digits

Many other character ranges contain digits as well.

So the message here is (as Baz Luhrmann said):

Read the directions even if you don’t follow them.

You can see the code running here

Advertisements

Encoding, Unicode and all that is between them

March 14, 2011 Leave a comment

I have worked with many developers in the past several years, and I noticed something which I find peculiar – developers are able to understand the flow of a large scale multi-threaded application with more ease than of understanding the concept of Unicode.

After reading a very good article about Unicode I have decided to share that article as I think it is something that we should all be aware of when writing international software.

I recommend – read this article, it will fill in gaps you might not know you have.

http://www.joelonsoftware.com/articles/Unicode.html

Important note for Java developers – Java maintains all characters in memory in UTF-16 format – so it is important to understand that even a simple text like ‘Hello‘ will consume 10 bytes of memory and not 5 bytes.

Java Primitive Data Types – The Integrals

July 11, 2010 Leave a comment

I have recently encountered a need to know the difference between the different primitive data types that Java provides and I started to look around to find the answer – which was as expected quite a simple task, however the information I found was not laid-out the way I expected it to be, so here is Java Primitive Data Types (The Integrals) in a nutshell.

Type Size Range
byte 8 bit -128 to 127
Usage The byte data type is an 8-bit signed two’s complement integer.
The byte data type can be useful for saving memory in large arrays, where the memory savings actually matters. They can also be used in place of int where their limits help to clarify your code; the fact that a variable’s range is limited can serve as a form of documentation.
Type Size Range
short 16 bit -32,768 to 32,767
Usage The short data type is a 16-bit signed two’s complement integer.
As with byte, the same guidelines apply: you can use a short to save memory in large arrays, in situations where the memory savings actually matters.
Type Size Range
int 32 bit -2,147,483,648 to 2,147,483,647
Usage The int data type is a 32-bit signed two’s complement integer.
For integral values, this data type is generally the default choice unless there is a reason (like the above) to choose something else. This data type will most likely be large enough for the numbers your program will use, but if you need a wider range of values, use long instead.
Type Size Range
long 64 bit -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
Usage The long data type is a 64-bit signed two’s complement integer.
Use this data type when you need a range of values wider than those provided by int
Type Size Range
boolean Undefined true/false
Usage The boolean data type has only two possible values: true and false. Use this data type for simple flags that track true/false conditions. This data type represents one bit of information, but its “size” isn’t something that’s precisely defined.
Type Size Range
char 16 bit ‘\u0000’ to ‘\uffff’ (or 0 – 65535)
Usage The char data type is a single 16-bit Unicode character.

Floating point numbers – will be covered in future posts.

More reading: Primitive Data Types (Java’s Tutorials)

Categories: Fundamentals Tags: , , , , , ,
%d bloggers like this: