Thursday, 13 August 2009
How to calculate 95 Percentile of a set of values in oracle?
How to calculate 95 Percentile of a set of values in oracle?
Oracle provides functions to calculate percentile values in a set of ordered data.
Inverse Percentile Functions
Using the CUME_DIST function, you can find the cumulative distribution
(percentile) of a set of values. However, the inverse operation (finding what value
computes to a certain percentile) is neither easy to do nor efficiently computed. To
overcome this difficulty, the PERCENTILE_CONT and PERCENTILE_DISC functions
were introduced. These can be used both as window reporting functions as well as
normal aggregate functions.
These functions need a sort specification and a parameter that takes a percentile
value between 0 and 1. The sort specification is handled by using an ORDER BY
clause with one expression. When used as a normal aggregate function, it returns a
single value for each ordered set.
PERCENTILE_CONT, which is a continuous function computed by interpolation,
and PERCENTILE_DISC, which is a step function that assumes discrete values. Like
other aggregates, PERCENTILE_CONT and PERCENTILE_DISC operate on a group
of rows in a grouped query, but with the following differences:
_ They require a parameter between 0 and 1 (inclusive). A parameter specified
out of this range will result in error. This parameter should be specified as an
expression that evaluates to a constant.
_ They require a sort specification. This sort specification is an ORDER BY clause
with a single expression. Multiple expressions are not allowed.
Normal Aggregate Syntax
[PERCENTILE_CONT | PERCENTILE_DISC]( constant expression )
WITHIN GROUP ( ORDER BY single order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])
Inverse Percentile Example Basis
We use the following query to return the 17 rows of data used in the examples of
this section:
SELECT cust_id, cust_credit_limit, CUME_DIST()
OVER (ORDER BY cust_credit_limit) AS CUME_DIST
FROM customers WHERE cust_city='Marshal';
CUST_ID CUST_CREDIT_LIMIT CUME_DIST
Inverse Percentile Functions
SQL for Analysis and Reporting 21-29
---------- ----------------- ----------
28344 1500 .173913043
8962 1500 .173913043
36651 1500 .173913043
32497 1500 .173913043
15192 3000 .347826087
102077 3000 .347826087
102343 3000 .347826087
8270 3000 .347826087
21380 5000 .52173913
13808 5000 .52173913
101784 5000 .52173913
30420 5000 .52173913
10346 7000 .652173913
31112 7000 .652173913
35266 7000 .652173913
3424 9000 .739130435
100977 9000 .739130435
103066 10000 .782608696
35225 11000 .956521739
14459 11000 .956521739
17268 11000 .956521739
100421 11000 .956521739
41496 15000 1
PERCENTILE_DISC(x) is computed by scanning up the CUME_DIST values in each
group till you find the first one greater than or equal to x, where x is the specified
percentile value. For the example query where PERCENTILE_DISC(0.5), the result
is 5,000, as the following illustrates:
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_cont
FROM customers WHERE cust_city='Marshal';
PERC_DISC PERC_CONT
--------- ---------
5000 5000
The result of PERCENTILE_CONT is computed by linear interpolation between rows
after ordering them. To compute PERCENTILE_CONT(x), we first compute the row
number = RN= (1+x*(n-1)), where n is the number of rows in the group and x is the
specified percentile value. The final result of the aggregate function is computed by
Inverse Percentile Functions
21-30 Oracle Database Data Warehousing Guide
linear interpolation between the values from rows at row numbers CRN =
CEIL(RN) and FRN = FLOOR(RN).
The final result will be: PERCENTILE_CONT(X) = if (CRN = FRN = RN), then
(value of expression from row at RN) else (CRN - RN) * (value of expression for row
at FRN) + (RN -FRN) * (value of expression for row at CRN).
Consider the previous example query, where we compute PERCENTILE_
CONT(0.5). Here n is 17. The row number RN = (1 + 0.5*(n-1))= 9 for both groups.
Putting this into the formula, (FRN=CRN=9), we return the value from row 9 as the
result.
Another example is, if you want to compute PERCENTILE_CONT(0.66). The
computed row number RN=(1 + 0.66*(n-1))= (1 + 0.66*16)= 11.67. PERCENTILE_
CONT(0.66) = (12-11.67)*(value of row 11)+(11.67-11)*(value of row 12). These results
are:
SELECT PERCENTILE_DISC(0.66) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_disc, PERCENTILE_CONT(0.66) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_cont
FROM customers WHERE cust_city='Marshal';
PERC_DISC PERC_CONT
---------- ----------
9000 8040
Inverse percentile aggregate functions can appear in the HAVING clause of a query
like other existing aggregate functions.
As Reporting Aggregates
You can also use the aggregate functions PERCENTILE_CONT, PERCENTILE_DISC
as reporting aggregate functions. When used as reporting aggregate functions, the
syntax is similar to those of other reporting aggregates.
[PERCENTILE_CONT | PERCENTILE_DISC](constant expression)
WITHIN GROUP ( ORDER BY single order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])
OVER ( [PARTITION BY value expression [,...]] )
This query computes the same thing (median credit limit for customers in this result
set, but reports the result for every row in the result set, as shown in the following
output:
SELECT cust_id, cust_credit_limit, PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) OVER () AS perc_disc,
Inverse Percentile Functions
SQL for Analysis and Reporting 21-31
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit)
OVER () AS perc_cont
FROM customers WHERE cust_city='Marshal';
CUST_ID CUST_CREDIT_LIMIT PERC_DISC PERC_CONT
---------- ----------------- ---------- ----------
28344 1500 5000 5000
8962 1500 5000 5000
36651 1500 5000 5000
32497 1500 5000 5000
15192 3000 5000 5000
102077 3000 5000 5000
102343 3000 5000 5000
8270 3000 5000 5000
21380 5000 5000 5000
13808 5000 5000 5000
101784 5000 5000 5000
30420 5000 5000 5000
10346 7000 5000 5000
31112 7000 5000 5000
35266 7000 5000 5000
3424 9000 5000 5000
100977 9000 5000 5000
103066 10000 5000 5000
35225 11000 5000 5000
14459 11000 5000 5000
17268 11000 5000 5000
100421 11000 5000 5000
41496 15000 5000 5000
Inverse Percentile Restrictions
For PERCENTILE_DISC, the expression in the ORDER BY clause can be of any data
type that you can sort (numeric, string, date, and so on). However, the expression in
the ORDER BY clause must be a numeric or datetime type (including intervals)
because linear interpolation is used to evaluate PERCENTILE_CONT. If the
expression is of type DATE, the interpolated result is rounded to the smallest unit
for the type. For a DATE type, the interpolated value will be rounded to the nearest
second, for interval types to the nearest second (INTERVAL DAY TO SECOND) or to
the month(INTERVAL YEAR TO MONTH).
Like other aggregates, the inverse percentile functions ignore NULLs in evaluating
the result. For example, when you want to find the median value in a set, Oracle
Database ignores the NULLs and finds the median among the non-null values. You
Hypothetical Rank and Distribution Functions
21-32 Oracle Database Data Warehousing Guide
can use the NULLS FIRST/NULLS LAST option in the ORDER BY clause, but they
will be ignored as NULLs are ignored.
Reference: Oracle Datawarehousing Guide
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment