NULLs in Group-Value Functions in Oracle 12c

By: Bob Bryla, Kevin Loney


Group-value functions in Oracle treat NULL values differently than single-value functions do. Group functions (other than COUNT(*)) ignore NULL values and calculate a result without considering them.

Take AVG as an example. Suppose you have a list of 100 friends and their ages. If you picked 20 of them at random and averaged their ages, how different would the result be than if you picked a different list of 20, also at random, and averaged it, or if you averaged all 100? In fact, the averages of these three groups would be very close. What this means is that AVG is somewhat insensitive to missing rows, even when the missing data represents a high percentage of the total number of records available.

NOTE

AVG is not immune to missing data, and there can be cases where it will be significantly off (such as when missing data is not randomly distributed), but these cases are less common.

The relative insensitivity of AVG to missing data needs to be contrasted with, for instance, SUM. How close to correct is the SUM of the ages of only 20 friends to the SUM of all 100 friends? Not close at all. So if you had a table of friends, but only 20 out of 100 supplied their ages, and 80 out of 100 had NULL for their age, which one would be a more reliable statistic about the whole group and less sensitive to the absence of data—the AVG age of those 20 friends, or the SUM of them? Note that this is an entirely different issue than whether it is possible to estimate the sum of all 100 based on only 20 (in fact, it is precisely the AVG of the 20, times 100). The point is, if you don’t know how many rows are NULL, you can use the following to provide a fairly reasonable result:

p0182-01

You cannot get a reasonable result from this, however:

p0182-02

This same test of whether or not results are reasonable defines how the other group functions respond to NULLs. STDDEV and VARIANCE are measures of central tendency; they, too, are relatively insensitive to missing data. (These are shown in “STDDEV and VARIANCE,” later in this chapter.)

MAX and MIN measure the extremes of your data. They can fluctuate wildly while AVG stays relatively constant: If you add a 100-year-old man to a group of 99 people who are 50 years old, the average age only goes up to 50.5—but the maximum age has doubled. Add a newborn baby, and the average goes back to 50, but the minimum age is now 0. It’s clear that missing or unknown NULL values can profoundly affect MAX, MIN, and SUM, so be cautious when using them, particularly if a significant percentage of the data is NULL.

Is it possible to create functions that also take into account how sparse the data is and how many values are NULL, compared to how many have real values, and make good guesses about MAX, MIN, and SUM? Yes, but such functions would be statistical projections, which must make explicit their assumptions about a particular set of data. This is not an appropriate task for a general-purpose group function. Some statisticians would argue that these functions should return NULL if they encounter any NULLs because returning any value can be misleading. Oracle returns something rather than nothing, but leaves it up to you to decide whether the result is reasonable.

COUNT is a special case. It can go either way with NULL values, but it always returns a number; it never evaluates to NULL. The format and usage for COUNT are shown shortly, but to simply contrast it with the other group functions, it counts all the non-NULL rows of a column, or it counts all the rows. In other words, if asked to count the ages of 100 friends, COUNT returns a value of 20 (because only 20 of the 100 gave their ages). If asked to count the rows in the table of friends without specifying a column, it returns 100.

Leave a Reply