Checking Text Strings for Valid Characters in Oracle

on May 6, 2015


One of the most typical problems when processing textual information is filtering out good characters from bad characters. The approach of splitting strings into single-symbol arrays and validating them against a given list was outdated many years ago. But even when using the built-in functionality provided by Oracle, you need to understand the strengths and weaknesses of the various Oracle methods.

Currently, the most frequently used techniques are TRANSLATE (from the DBMS_STANDARD set of commands) and REGEXP_REPLACE (part of the whole family of commands that support regular expression notation). Regular expressions provide enormous depth and flexibility, but it is also fair to ask whether the extra price required for this richness is worth paying. The following tests check to see how well and how fast both of these approaches work with a “black list” (removing what is specified) and a “white list” (keeping what is specified).

The first test tries to remove all of the digits and a period from the string. Also, considering that regular expressions support POSIX character format masks, the test will compare whether explicit listing of digits is any different from using [:digit:].

p0278-01

p0279-01

The results of this test are a bit surprising. First, the TRANSLATE command easily beat regular expressions. Second, using the POSIX format mask is about 25 percent faster than listing all of the characters.

Instead of removing the digits and period, the second test will preserve them. Also, instead of just checking timing, now you can look at the RUNSTATS report to see what is happening.

p0279-02

p0280-01

This time, regular expression syntax shows both its strengths and weaknesses. It saves some time for the cost of extra memory allocation. To be fair, the first version of the code is significantly more readable than a double-TRANSLATE call. Unless you are significantly memory bound, this means that for anything more complex than the simplest operations, REGEXP processing provides a very powerful mechanism to build code that is not only readable but also optimized.

Related Posts

Leave a Reply