You are on page 1of 9

Energy & Fuels 2007, 21, 2955-2963 2955

Comparison of Probability Distribution Functions for Fitting


Distillation Curves of Petroleum
Sergio Sánchez,† Jorge Ancheyta,*,†,‡ and William C. McCaffrey§
Instituto Mexicano del Petróleo, Eje Central Lázaro Cárdenas 152, D.F. 07330, México, Escuela Superior
de Ingenierı́a Quı́mica e Industrias ExtractiVas (ESIQIE-IPN), D.F. 07738, Mexico, and Department of
Chemical and Materials Engineering, UniVersity of Alberta, Edmonton, Alberta, T6G 2G6 Canada

ReceiVed January 2, 2007. ReVised Manuscript ReceiVed June 14, 2007

The fitting capability of 25 probability distribution functions for distillation data of petroleum fractions was
analyzed in this work. Rankings of all the functions based on two different approaches were established after
a statistical analysis of the fit of the functions with a data set of 137 distillation curves. In general, distribution
functions with four parameters showed better fitting capability than those with three parameters. Two-parameter
functions were not effective in fitting distillation data. The Weibull extreme, Kumaraswamy, and Weibull
functions were found to be the best distribution functions for fitting distillation data considering their ranking
and the required CPU time. These distribution functions exhibited the lowest Akaike information criterion and
Bayesian information criterion average values, standard deviations lower than 1%, correlation coefficients
higher than 0.999, and residuals randomly distributed without any tendency. The fitting capability of the best
functions was validated with an independent set of distillation data, and the ranking was the same.

Introduction have been used for interpolating points along the distillation
curve. Another approach, which offers more accurate adjust-
Specialized characterization of petroleum has been the topic ments, is the use of least-square methods for fitting probability
of a number of research papers, which can be from the well- distribution functions to distillation data.
known and used “assays” to sophisticated studies based on up- Probability distribution functions have been utilized in a wide
to-date laboratory techniques, for example, NMR, MS, XRD, variety of applications: in process simulation software during
and so forth. Since the early 1930s, several studies have the creation of pseudocomponents, which are used together with
discussed the implementation of more accurate characterization quadrature techniques for determining the optimal number of
methods.1 While sophisticated approaches to characterize pe- pseudocomponents for simulation purposes, for example, char-
troleum have enhanced our understanding of its structure, acterization of petroleum fractions;3-5 for phase equilibrium
traditional characterization methods are still widely employed. calculations when continuous thermodynamics methods are
Empirical correlations are also very popular to estimate “bulk” applied;6-8 and to describe the extent of chemical transforma-
properties of petroleum fractions, which are mainly based on tions occurring during petroleum refining processes, which is a
distillation curves and specific gravities. These correlations are relatively recent application of the probability distribution
very useful in process engineering, particularly when the functions.9,10 Other examples include the description of poly-
available experimental data are limited.2 In the case of distil- merization reaction products, which are mixtures of compounds
lation curves, sometimes a limited amount of distillation points with different molecular weights and use in environmental
are available and it is necessary to interpolate/extrapolate to studies for representing the size distribution of particles of dust
determine a required value. Several American Society for and aerosols, the rain precipitation per day, the level of rivers
Testing and Materials (ASTM) methods are commonly used to and lakes, and other meteorological phenomena.11
obtain distillation data: ASTM D-5307, ASTM D-2892, ASTM
Distillation data and specific gravity are the most common
D-1160, ASTM D-86, and so forth. All of them employ
properties used as inputs in empirical correlations to characterize
standardized devices and report boiling point temperatures of
petroleum fractions. This characterization is achieved by means
the sample versus distillation yields on a volumetric and/or a
of correlations that are useful for determining molecular weight,
gravimetric basis. Since distillation curves are formed with a
critical properties, and so forth. They can also be utilized for
finite number of temperature-yield data, they can be fitted to
distinguishing reaction products as pseudocomponents or lumps
different functions in order to generate a continuous representa-
(naphtha, middle distillates, etc.) of some typical refinery
tion. Polynomial regression, cubic spline interpolation, and
Lagrange interpolation are all common mathematical tools which
(3) Whitson, C. H. SPE J. 1983, 275, 683.
(4) Whitson, C. H.; Anderson, T. F.; Soreide, I. NPRA, New Orleans,
* To whom correspondence should be addressed. Fax: (+52-55) 9175- LA, March 6-10, 1988.
8418. E-mail: jancheyt@imp.mx. (5) Dhulesia, H. Hydrocarbon Process. 1984, 62, 179.
† Instituto Mexicano del Petróleo. (6) Kehlen, H.; Ratzsch, M. T. Chem. Eng. Sci. 1987, 42, 221.
‡ ESIQIE-IPN. (7) Willman, B.; Teja, A. S. Ind. Eng. Chem. Res. 1986, 26, 948.
§ University of Alberta. (8) Peng, D. Y.; Wu, J. P.; Batycky, J. P. AOSTRA J. Res. 1987, 3, 113.
(1) Watson, K. M.; Nelson, E. F. Ind. Eng Chem. 1933, 25, 880. (9) Bacaud, R.; Rouleau, L; Bacaud, B. Energy Fuels 1996, 10, 915.
(2) Technical data book - petroleum refining; American Petroleum (10) Krishna, R.; Saxena, A. K. Chem. Eng. Sci. 1989, 44, 703.
Institute: Washington, DC, 1976. (11) Kumaraswamy, P. J. Hydrol. 1980, 46, 79.

10.1021/ef070003y CCC: $37.00 © 2007 American Chemical Society


Published on Web 08/08/2007
2956 Energy & Fuels, Vol. 21, No. 5, 2007 Sánchez et al.

processes such as hydrocracking, catalytic cracking, and so Table 1. Probability Distribution Functions Used in This Worka
forth.12 To have accurate and reliable representations of distil- other functions
lation data for further interpolation, a strict analysis of other included
number of
approaches apart from the traditional interpolation techniques eq. function parameters PDF CDF ref
is mandatory.
1 R (normalized) 2 Φ Φ 18
The main objective of the present work is to contrast the 2 R 4 Φ Φ 18
fitting capability of several probability distribution functions 3 β 4 Γ I 42
to distillation data. 4 Bradford 3 42
5 Burr 4 42
6 χ 3 Γ Γ 42
Brief Background of Probability Distribution Functions 7 fatigue life 3 Φ 18
8 Fisk 3 42
Probability distribution functions were first developed to 9 Frèchet 3 17
measure the possibility of the occurrence of a specific event. 10 folded normal 2 Φ 42
Depending on the number of possible events, they can be 11 γ 3 Γ Γ 42
discrete functions, when the number of possible events is 12 generalized extreme value 3 17
discrete, or they can be classified as continuous functions. In 13 generalized logistic 3 42
14 Gumbel 2 17
the present contribution, only continuous distribution functions 15 half normal 2 42
are studied. Probability distribution functions are defined for a 16 Jhonson SB 4 Φ 43
reduced number of parameters and have simple formulas for 17 Kumaraswamy 4 11
calculating mean, mode, variance, and so forth. They have two 18 log-normal 2 Φ 42
19 Nakagami 3 Γ Γ 42
main forms: the probability density function (PDF) and the 20 normal 2 Φ 42
cumulative distribution function (CDF). The former is the most 21 Riazi 3 16
commonly used form of the probability distribution functions, 22 Student's t 3 Γ I 42
a very well-known example of this type of function is the 23 Wald 2 Φ 42
24 Weibull 3 42
classical Gaussian bell.13 CDFs increase monotonically and
25 Weibull extreme 4 18
generally describe the same behavior that is observed with
a Φ: CDF normal. Γ: γ function. Ι: incomplete β function.
distillation curves. Due to their simplicity, probability distribu-
tion functions are easily included in computer programs for
modeling, optimization, and control purposes. It is not recom- point, specific gravity, and refractive index of C7+ fractions.
mended to select the distribution function a priori, however, The extreme value distributions are a family of equations which
since finding an adequate distribution function that represents have been used in applications involving natural phenomena
the experimental data with minimal error depends on the specific such as rainfall, floods, air pollution, and corrosion. Gumbel,
characteristics of each type of function. Frèchet, and Weibull functions may all be represented as
As was mentioned in the Introduction, several distribution members of a single family of generalized extreme value
functions have been utilized for calculations related to the distribution functions.17 The Kumaraswamy distribution is
petroleum industry. Whitson et al.3,4 proposed a petroleum comparable in characteristics with the β distribution function
characterization method based on the three-parameter γ distribu- for their versatility and double-bounded nature; however, the
tion function for the characterization of C7+. Dhulesia5 proposed Kumaraswamy distribution has a simpler form than the β
the Weibull distribution function in its cumulative form to distribution for both the PDF and the CDF.11 Even though the
describe ASTM distillation curves of petroleum fractions. The Kumaraswamy distribution function was originally developed
Weibull equation was tested with distillation data of the feed for hydraulic modeling, it has been applied to describe bounded
and products of a fluid catalytic cracking unit, and the fitted physical variables encountered in civil engineering.
curves showed good agreement with experimental data. The Among all distribution functions found in the literature, only
normal function has been employed in phase equilibrium 25 were chosen to be analyzed in this work. The selected
methods when the continuous thermodynamics approach was distributions are summarized in Table 1. Not included in the
used.6 The normal function and the error function were utilized list are a few well-known distributions, such as the Tukey-
for modeling reaction behavior of the hydrocracking process.9,10 Lambda, Cauchy, and F distributions,18 because they are either
Willman and Teja7 used the bivariate log-normal distribution seldom used to model empirical data or they lack a convenient
function for characterizing the composition of mixtures involved analytical form for the CDF. Even though most of the distribu-
during phase equilibrium calculations. The β distribution tions reported in Table 1 are not widely applied to engineering
function has been employed for characterizing petroleum applications, they all have the potential to be very useful for
fractions in state equation calculations.8 Exponential and χ2 describing real-world data sets. Definitions of the probability
distribution functions, which are simplified cases of the γ distribution functions in their two main forms (CDF and PDF)
distribution function, have been used to characterize the heavy are presented in Table 2. The forms of the functions may vary
end of reservoir fluids and to develop phase equilibrium slightly from those reported in the literature. It is also possible
computations.14,15 A modified form of the Weibull distribution that a distribution function could be known with different names.
was utilized by Riazi16 for establishing a method to predict
complete property distributions for the molecular weight, boiling Methodology
(12) Ancheyta, J.; Sánchez, S.; Rodrı́guez, M. A. Catal. Today 2005, Data Source. The fitting capability of the 25 selected functions
109, 76. was done using two distillation data sources: those previously
(13) Evans, M.; Hastings, N.; Peacock, B. Statistical distributions; John
Wiley: New York, 1993.
(14) Behrens, R. A.; Sandler, S. I. SPE Res. Eng. 1998, 3, 1041. (17) Kotz, S.; Nadarajah, S. Extreme Value distributions: theory and
(15) Luks, K. D.; Turek, E. A.; Kragas, T. R. Ind. Eng. Chem. Res. 1990, applications, Imperial College Press: London, 2000.
29, 2101. (18) Heckert, N. A.; Filliben, J. NIST handbook 148; NIST: Gaithersburg,
(16) Riazi, M. R. Ind. Eng. Chem. Res. 1989, 28, 1731. MD, 2003.
Comparison of Probability Distribution Functions Energy & Fuels, Vol. 21, No. 5, 2007 2957

Table 2. Definitions of Continuous Probability Distribution Functionsa


probability density function cumulative distribution function
eq distribution (PDF) (CDF)
1 R (normalized) (C, D)
D
[ 21 (C - Dy) ] D
2

y Φ(C)x2π
2
exp - Φ C-( y )
Φ(C)
2 R (A, B, C, D)
D
[ 21(C - Dt) ] D
2

B2t2Φ(C)x2π
exp - Φ C-( t )
Φ(C)
3 β (A, B, C, D)
Γ(C + D)
Γ(C) Γ(D) (B - A) C+D-1
(y - A)C-1
(B - y)1-D
I(By -- AA,C,D)
[ ]
4 Bradford(A, B, C)
C C(y - A)
ln 1 +
[C(y - A) + B - A] ln(C + 1) B-A
ln(C + 1)
5 Burr (A, B, C, D) (1 + t -C)-D
CD -C-1
t (1 + t-C)-D-1
B
6 χ (A, B, C)
1
( )
tC-1 exp - t2
2
1-Γ ( C2 ,21t )2

C
(2(C/2)-1B)Γ
2()
[x x ( )] [ ]
7 fatigue life (A, B, C)
y A -1 y A 0.5 0.5

y2 - A2 y A
exp -
+ -2
A y
Φ
(A)
-( )
y
2x2πC2By2 A y 2C2 C
8 Fisk (A, B, C)
C tC-1 1
B (1 + tC)2 1 + t-C
9 Fréchet (A, B, C) exp(-t-C)
C -C-1
t exp(-t-C)
B

xπ2 cosh(AyB) exp( - 21 y +B A )


10 folded normal (A, B)
(-yB- A)
2 2
1 Φ(t) - Φ
B 2 2

11 γ (A, B, C) Γ(C,t)
1 C-1
t exp(-t)
ΒΓ(C)
12 generalized extreme value (A, B, C) exp[-(1 + Ct)-1/C]
1
(1 + Ct)-1-(1/C) exp[-(1 + Ct)-1/C]
B
13 generalized logistic (A, B, C)
C exp(-t) 1
B [1 + exp(-t)]c+1 [1 + exp(-t)]c
14 Gumbel (A, B) exp[-exp(-t)]
1
exp(-t) exp[-exp(-t)]
B

{ }
15 Johnson SB (A, B, C, D)
y-A
( By -- Ay)]
2
D(B - A)
exp -
[C + D ln(
B - y)] [
Φ C + D ln

(y - A)(B - y)x2π 2
16 Kumaraswamy (A, B, C, D)
(By -- AA) [1 - (By -- AA) ] [ (By -- AA) ]
C-1 C D-1 C D
CD 1- 1-
17 half normal (A, B) 2Φ(t) - 1
1
B x ( )
2
π
1
exp - t2
2

[ ( )] ( )
18 log-normal (A, B)
1 1 ln(y) - A 2 ln(y) - A
exp - Φ
Byx2π 2 B B
19 Nakagami (A, B, C) Γ(C,Ct2)
2CC 2C-1
t exp(-Ct2)
BΓ(C)
20 normal (A, B) Φ(t)
1 1
x
B 2π
exp - t2
2 ( )
21 Riazi (A, B)
B2 B-1 B
A
y
B
(
exp - yB
A ) 1 - exp - yB
A ( )
2958 Energy & Fuels, Vol. 21, No. 5, 2007 Sánchez et al.

Table 2 (Continued)
probability density function cumulative distribution function

{
eq distribution (PDF) (CDF)
22 Student’s t (A, B, C)
(C +2 1) 1 + t 1
( C C1
)
C ( C)
Γ 2 -[(C+1)/2] I , , ,te0
1 2 C + t2 2 2
BxπC Γ( )
2
1
1- I ( C C1
, , ,t>0
2 C + t2 2 2 )
23 Wald (A, B)
x
B
2πy3
exp -
B y-A
[ (
2y A )]
2
Φ (x By - A
y A )
+ exp
2B
A
Φ ( ) (x B-y - A
y A )
24 Weibull (A, B, C) C C-1 1 - exp(-tC)
t exp(-tC)
B
25 Weibull extreme (A, B, C, D) CDtC-1 [1 - exp(-tC)]D
exp(-tC)[1 - exp(1 - tC)]D-1
B
a t ) (y - A)/B
reported in the literature19-37 and data obtained in laboratories at Step 1. Temperature data were changed to a dimensionless form
the Mexican Institute of Petroleum and the University of Alberta. using the following equation:
The selected samples include whole crude oils; vacuum gasoils;
atmospheric and vacuum residua; atmospheric gasoils; light cycle Ti - T 0
oils (LCO); hydrotreated LCO; feeds of the fluid catalytic cracking θi ) (26)
T1 - T0
(FCC) process; feeds; and products of mild thermal processing,
vacuum residue hydrotreating, hydrotreating of bitumen-derived
gasoils process, and hydrotreating of middle distillates. The where θi is the dimensionless temperature, Ti is the actual
distillation data set was comprised of petroleum samples mainly temperature boiling point, and T0 and T1 are reference temperatures,
from Kuwait, Saudi Arabia, Mexico, and Canada. A total of 137 which are chosen to have θi values between 0 and 1 (T0 ) 30 °C
distillation curves were considered in the analysis, each having at and T1 ) 1000 °C in this work). The dimensionless distillation
least six experimental points, with a total of 1627 temperature- curve together with the original distillation data of a selected sample
versus-yield points. All experimental distillation data were obtained are shown in Figure 1. In the case of dimensionless distillation
using standardized methods (physical distillation methods: ASTM data, the values neither start at zero nor finish at one since the
D-2892 and ASTM D-1160; simulated distillation methods: ASTM reference temperatures, T0 and T1, covered a wider range of values
D-5307, ASTM D-6352, and high-temperature simulated distilla- than those of the selected sample.
tion) and were collected in a database. Distillation data were not Step 2. An optimization method38 was applied for obtaining the
reduced at a unique basis; instead, they were treated on their original optimal set of parameters of the probability distribution function.
basis (ASTM D-2892, ASTM D-1160, ASTM D-5307, etc.). The optimization criterion was the minimization of the residual sum
Different units of temperature and product recovery were found in of squares (RSS) defined by eq 27.
the literature and were transformed to degrees Celsius and weigth
percent.
Example of Parameter Estimation. The comparison of fitting
RSS ) ∑(y exp,i - ycal,i)2 (27)
capability of all functions reported in Tables 1 and 2 was performed
by statistical methods. The procedure for parameter estimation is where yexp,i and ycal,i are the experimental and calculated weight
described below, and the four-parameter β-distribution function (eq fractions, respectively. The optimal set of parameters using a β
3) is taken as an example using a single distillation data set, which distribution function for the data given in Figure 1 was A )
corresponds to a simulated distillation curve of hydrocracked Maya 0.089 62, B ) 1.050 13, C ) 2.490 03, and D ) 6.341 86. To be
crude oil: sure about the precision of the estimated parameters and conver-
gence to a global minimum, a sensitivity analysis was conducted
(19) Ali, F.; Ghaloum, N.; Hauser, A. Energy Fuels 2006, 20, 45. using an approach reported elsewhere.39
(20) Anabtawi, J. A.; Ali, S. A. Ind. Eng. Chem. Res. 1991, 30, 2586. Step 3. Predicted values of liquid recovery using the distribution
(21) Aoyagi, K.; McCaffrey, W.; Gray, M. R. Pet. Sci. Technol. 2003, function with the optimal values of parameters were obtained. The
21, 997. results from the model are also shown in Figure 1.
(22) Barman, B. N. Energy Fuels 2005, 19, 1995.
(23) Bollas G. M.; Vasalos, I. A.; Lappas, A. A.; Iatridis, D. K.; Tsioni,
G. K. Ind. Eng. Chem. Res. 2004, 43, 3270.
(24) Chen, Y. W.; Tsai, M. C. Ind. Eng. Chem. Res. 1997, 36, 2521.
(25) Espinosa, M.; Figueroa, Y.; Jimenez, F. Energy Fuels 2004, 18,
1832.
(26) Laredo, G. C.; López, C. R.; Alvárez, R. E.; Castillo, J.; Cano, J.
L. Energy Fuels 2004, 18, 1687.
(27) Lenoir, J. M.; Hipkin H. G. J. Chem. Eng. Data 1973, 18, 195.
(28) Marafi, A.; Al-Bazzaz, H.; Al-Marri, M.; Maruyama, F.; Absi-
Halabi, M.; Stanislaus, A. Energy Fuels 2003, 17, 1191.
(29) Marroquin, G.; Ancheyta, J.; Ramı́rez, A.; Farfan, E. Energy Fuels
2001, 15, 1213.
(30) Maw, S. C.; Heldman, J. L.; Hwang, S. C.; Tsonopoulos, C. Ind.
Eng. Chem. Process Des. DeV. 1984, 23, 577.
(31) Michael, G.; Al-Siri, M.; Khan, Z. H.; Ali, F. A. Energy Fuels 2005,
19, 1598.
(32) Owusu-Boakye, A.; Dalai, A. K.; Ferdous, D.; Adjaye, J. Energy
Fuels 2005, 19, 1763.
(33) Rousis, S. G.; Fitzgerald, W. P. Anal. Chem. 2000, 72, 1400.
(34) Sarma, A. K.; Konwer, D. Energy Fuels 2005, 19, 1755.
(35) Schwartz, H. E.; Brownlee, R. G.; Boduszinski, M. M.; Su, F. Anal.
Chem. 1987, 59, 1393.
(36) Ukwuoma, O. Pet. Sci. Technol. 2002, 20, 525. Figure 1. Experimental (O) and predicted (__) distillation values with
(37) Yui, S. M.; Ng, S. H. Energy Fuels 1995, 9, 665. β function (hydrocracked Maya crude oil).
Comparison of Probability Distribution Functions Energy & Fuels, Vol. 21, No. 5, 2007 2959

a penalty term (2k) that is an increasing function of the number of


parameters; this feature makes it very useful for the comparison of
models with different numbers of parameters. In this study, the
preferred probability distribution function will be that with the
lowest AIC value.
The expression to calculate the Bayesian information criterion
for models with randomly distributed residuals is

BIC ) kln(n) + nln (RSS


n )
(30)

Compared to the AIC, the BIC penalizes free parameters more


strongly. In the same way as that using the AIC, the model with
the lowest value of BIC is the one that is preferred. Since AIC is
strongly dependent on sample size, it is recommended to use relative
values and, particularly, the AIC differences (∆i, given by eq 31)
Figure 2. Comparison of experimental and predicted values using β for selecting a model. Models with ∆i > 10 may be considered as
function for the testing data set. failing to explain a substantial variation in the data and may be
omitted from further consideration.
Step 4. A statistical analysis and residuals analysis using predicted
and experimental values were carried out in order to identify ∆i ) AICi - AICmin (31)
possible model errors. For data of the example, values of maximum
absolute error, average absolute error, RSS, and standard deviation Alternatively, Burnham and Anderson40 proposed the use of AIC
(SD) were 1.19, 1.36, 5.34, and 0.56, respectively. weights (ωi) for model selection, which is considered as evidence
Step 5. In most of the cases, the largest errors were found at the that model i is the best model for a given situation among all
extreme points of the distillation curves. This can be due to the models. The evidence ratios (ω1/ωj) are used to compare two
low accuracy of the experimental measurements in these parts of different models, where model 1 is the estimated best model and j
the curve. That is why, for practical purposes, initial boiling point indexes the rest of the models in the set. AIC weights and evidence
(IBP) and final boiling point (FBP), or even 5% and 95% distillation ratios are calculated by
points, are commonly excluded from calculations.
Parameter Estimation for All Distribution Functions. The
procedure previously described was employed for the parameter
estimation of all distribution functions given in Table 2 for each ωi )
1
2( )
exp - ∆i
(32)

( )
of the 137 distillation data sets. R
1
As an example, comparisons of experimental and predicted values
from the β-distribution function are shown in Figure 2, in which a

r)1
exp - ∆r
2
visual analysis and predictive capability of the function can be

( )
established. Correlation coefficient (R2), slope, intercept, and ω1 1
standard deviation were obtained by statistical analysis of the parity ) exp (33)
ωj 2∆j
plot of experimental versus calculated values of liquid recovery. A
summary of the statistical parameters derived from a regression Results and Discussion
analysis of the parity plot is presented in Table 3. This table gives
more quantitative analysis of the predictive capability of the β To determine the best distribution function to describe
function. distillation curves, the various functions were first fitted to the
The predictive capability of the different functions was classified experimental data. The largest errors during data fitting were
according to their statistical indicators. First, a methodology based obtained with boiling points close to the end of the distillation
on regression analyses was applied, in which SD as calculated by curves followed by those close to the beginning. This problem
eq 28 was the main criteria for establishing the ranking; the
was particularly evident for the normal and Student’s t distribu-
correlation coefficient (R2), slope, and intercept were also consid-
ered. tion functions, which are symmetrical. The difficulty in fitting
distribution functions to IBP and FBP data is compounded by
the larger experimental error associated with the endpoints of
SD )
xnRSS
-2
(28) distillation curves. These errors are associated with the sensitiv-
ity of experimental devices when initializing or finalizing the
A second approach took into consideration both the Akaike and tests and are observed regardless of whether the equipment is
Bayesian information criteria (AIC and BIC, respectively). The AIC operated manually or automatically. The experimental error is
is an operational way of considering both the complexity of a model variable, depending upon the standardized method that is
and how well it fits the data.40 The AIC methodology attempts to employed; for instance, in the ASTM D-2892 method, errors
find the model that best explains the data with a minimum of free up to 1.2 wt % for the volume recovery are tolerated, whereas
parameters. When residuals are randomly distributed, the AIC is in the ASTM D-1160 method, errors can range from 1.7 up to
calculated as
5.7 wt % for the different points in the distillation curve.
Selecting the “best” distribution function is not a trivial task.
AIC ) 2k + nln (RSS
n )
(29) A wide variety of statistical data can be used in this duty,
including standard deviations, R2, Akaike and Bayesian infor-
where k is the number of parameters, n is the number of mation criteria, and even CPU time, which are all presented in
observations, and RSS is the residual sum of squares. AIC includes Table 4. It is well-accepted that correlation coefficients are not
very useful in discriminating between models. In this study,
(38) Marquardt, D. W. J. Soc. Ind. Appl. Math. 1963, 2, 431. the correlation coefficients were very close to unity (0.986-
(39) Alcazar, L. A.; Ancheyta, J. Chem. Eng. J. 2007, 128, 85.
(40) Burnham, K. P.; Anderson, D. R. Model selection and multimodel 0.999) for all of the functions. To highlight this point, only the
inference, 2nd ed.; Springer-Verlag: New York, 1998. R distribution function (eq 1) exhibited a value of R2 lower than
2960 Energy & Fuels, Vol. 21, No. 5, 2007 Sánchez et al.

Table 3. Main Statistical Parameters from the Regression Analysis of β Distribution


Regression statistics
correlation coefficient, R2 0.999
standard deviation, wt % 0.814
observations 1474

coefficients lower 95% upper 95%


slope 1.002 1.0033 1.0002
intercept -0.0655 0.0062 -0.1372

analysis of variance Df sum of squares mean square F


regression 1 1.0961 × 106 1.0961 × 106 1.6560 × 106
residual 1473 973.66 0.6619
total 1474 1.0971 × 106

Table 4. Ranking of All Distribution Functions


intercept, SD, SD-based average average AIC-based CPU
equation R2 slope wt % wt % ranking AIC BIC ranking timea
1 0.986 1.023 -1.47 3.374 25 31.79 32.57 25 2.267
2 0.998 1.001 0.08 1.276 10 3.62 5.19 12 2.143
3 0.999 1.002 -0.01 0.814 2 -14.77 -13.20 2 3.705
4 0.996 1.007 -0.37 1.778 13 10.66 11.84 17 0.219
5 0.995 1.026 -0.60 1.992 21 -0.41 1.15 9 0.124
6 0.998 1.001 0.06 1.182 8 -4.04 -2.86 6 1.924
7 0.995 1.014 -0.71 1.911 19 19.02 20.19 22 0.724
8 0.998 0.998 0.22 1.264 9 4.76 5.93 13 0.114
9 0.996 1.002 -0.17 1.801 15 7.96 9.14 16 0.143
10 0.995 0.987 0.90 1.876 18 15.34 16.12 18 1.257
11 0.998 1.003 -0.09 1.072 5 -5.87 -4.69 5 3.400
12 0.996 1.002 -0.17 1.797 14 7.67 8.84 14 0.124
13 0.998 0.996 0.37 1.132 7 2.54 3.72 11 0.114
14 0.995 1.000 -0.07 1.84 16 7.78 8.56 15 0.086
15 0.993 1.016 -0.92 2.397 24 21.65 22.43 24 0.657
16 0.998 1.001 0.02 1.126 6 -3.90 -2.34 7 0.771
17 0.999 1.004 -0.05 0.885 4 -13.19 -11.62 3 0.219
18 0.996 1.012 -0.61 1.849 17 15.55 16.34 19 0.838
19 0.997 1.000 0.22 1.393 11 -0.49 0.69 8 2.086
20 0.994 0.986 0.91 2.093 23 18.34 19.12 21 1.000
21 0.996 0.997 0.45 1.673 12 2.03 3.21 10 0.124
22 0.994 0.984 1.00 2.064 22 19.66 20.83 23 4.381
23 0.995 1.015 -0.79 1.987 20 18.20 18.98 20 1.305
24 0.999 1.002 0.01 0.865 3 -10.44 -9.26 4 0.124
25 1 1.003 -0.06 0.599 1 -16.61 -15.04 1 0.248
a CPU time relative to that required with normal distribution function (eq 20).

0.99. A parity plot is presented in Figure 2, and the slopes and One method to rank the models is to compare the standard
intercepts of the parity plots for each model are included in deviations. Since the standard deviation values for all functions
Table 4. The slopes of the experimental versus predicted values ranged from 0.59 to 3.37 wt %, they were more useful than the
plots are in all cases virtually unity (0.984-1.026), and R2 values or the slopes and intercepts from the parity plots. From
intercepts range between -1.47 and +1.00. A more useful the results given in Table 4 (R2, slope, intercept, and SD) and
technique to eliminate potential models was to identify which residual analysis, the following classification of accuracy of
distributions yielded nonrandom residuals. Nearly all of the two- predictions was established:
parameter models had trends in their residuals. The two- Group 1: SD < 1.0; 0.999 < R2 < 1; number of functions
parameter models that were eliminated due to trends in the ) 4 (eqs 3, 17, 24, and 25)
residuals were normalized R (eq 1), Frèchet (eq 9), folded Group 2: 1.0 < SD < 1.992; 0.995 < R2 < 0.998; number
normal (10), half normal (eq 15), log-normal (eq 18), normal of functions ) 11 (eqs 2, 4, 5, 6, 8, 9, 11, 13, 14, 16, 19, and
(eq 20), Student’s t (eq 22), and Wald (eq 23). Additionally, 21)
the three-parameter models that were eliminated due to trends Group 3: trends in residuals; number of functions ) 10 (eqs
in their residuals were the fatigue life (eq 7) and the generalized 1, 7, 10, 12, 15, 18, 20, 22, and 23)
extreme value (eq 12) models. Interestingly, the Gumbel Functions of group 1 are the most accurate, and those of group
distribution, eq 14, was the only two-parameter model to display 3 do not adequately describe the functionality of the distillation
random residuals. Not surprisingly, all of the four-parameter data.
models were effective in describing the experimental data. For Model selection should be based not solely on goodness of
comparison purposes, Figure 3 presents the residuals analysis fit but also on the degree of confidence of the predicted
for the worst (two-parameter R function, eq 1) and best (four- parameters. It is well-known that increasing the number of free
parameter Weibull extreme function, eq 25) functions. The parameters to be estimated can improve the goodness of fit but
differences and precision of estimations are very clear; while can also decrease the confidence in the estimates of the model
residuals for eq 25 ranged between -5 and +5 and were parameters. Therefore, ranking the models solely on the basis
randomly distributed, those for eq 1 varied from a -15 to +10 of SD data may not be satisfactory for comparing functions with
with a very pronounced trend. different numbers of parameters and different sample sizes. In
Comparison of Probability Distribution Functions Energy & Fuels, Vol. 21, No. 5, 2007 2961

Figure 4. Average AIC vs sample size. The light-colored bars represent


the γ function, whereas the dark-colored bars represent the Weibull
extreme function.

any of the lower-ranked models.40 Model selection is best


achieved through an inspection of evidence ratios and residuals.
A summary of the AIC and evidence ratios of the 10 best-ranked
functions is presented in Table 5. It can be seen that the
differences among the four best-ranked functions (∆i from 1.84
to 6.17) are not high enough to conclude unambiguously that
there is a single best model from the top four ranked models.
Since the ∆i values of the models are fairly similar, variation
Figure 3. Residual plot for (a) normalized R and (b) Weibull extreme in the selection of the best model is expected from data set to
distribution functions. data set. In the case of distillation data, a priori selection of the
model is not recommended, and instead the Weibull extreme,
this work, AIC and BIC were used to take into account the β, Kumaraswamy, and Weibull functions should all be evalu-
different numbers of parameters in the probability distribution ated. A similar conclusion can be reached using the information
functions when ranking the best functions to use to describe given by the evidence ratios for the four best-ranked functions,
distillation data. but the high value of the evidence ratio for the Weibull function
A value of AIC was calculated for each set of data for every (wj/w1 ) 21.9) makes it very unlikely that this model was the
function, and a new ranking was determined using the average best.
AIC of each function. A similar procedure was also applied for The distillation data sets used in this study ranged in size
calculating an average BIC value and to rank each model. Both from 6 to 19 data points in each set. In order to examine if
the AIC and BIC values are presented in Table 4. Even though sample size had any impact on model selection, AIC data versus
the BIC penalizes functions with more free parameters, the BIC- the sample size were plotted for each function. Figure 4 shows
based ranking was very similar to the AIC-based ranking, with the results for the Weibull extreme (rank 1) and γ (rank 5)
only the Gumbel and generalized extreme value functions functions. Two clearly defined groups are formed, both with
exchanging places in the ranking. It should be noted that the similar general trends (sample size of 6-10 and sample size of
four best-ranked functions using AIC, the Weibull extreme, β, 11-19). The average AIC values for each group for each
Kumaraswamy, and Weibull functions, are the same as those function were calculated and ordered. The new ranking changed
identified by the SD-based rankings (group 1). The only only slightly. Importantly, the group of the four best-ranked
difference between the top functions in the AIC- and SD-based functions remained unchanged.
rankings is in the order. The CPU time spent during the parameter optimization
The question remains if there is a significant difference in process for each function was calculated for each function. The
the ability of the different probability distribution functions to results are included in Table 4 as relative values of computing
describe the distillation data. It is generally not recommended time with respect to that required for the normal distribution
to apply null hypothesis testing to information-theoretic ranking function. Focusing on the functions of group 1 (ranks 1-4), it
data to determine if the “best” model is significantly better than can be seen that the Weibull extreme (eq 25), Kumaraswamy

Table 5. ∆i and Evidence Ratios of the Best 10 Ranked Functions


AIC-based function average
ranking (parameters) eq AIC ∆i ωi ωj/ω1
1 Weibull extreme (4) 25 -16.61 0 0.61213 1.0
2 β (4) 3 -14.77 1.84 0.24387 2.5
3 Kumaraswamy (4) 17 -13.19 3.42 0.11049 5.5
4 Weibull (3) 24 -10.44 6.17 0.02794 21.9
5 γ (3) 11 -5.87 10.75 0.00284 215.5
6 χ (3) 6 -4.04 12.58 0.00114 538.1
7 Jhonson SB (4) 16 -3.90 12.71 0.00106 575.2
8 Nakagami (3) 19 -0.49 16.12 0.00019 3171.9
9 Burr (4) 5 -0.41 16.20 0.00019 3289.8
10 Riazi (3) 21 2.03 18.64 0.00005 11178.2
2962 Energy & Fuels, Vol. 21, No. 5, 2007 Sánchez et al.

Table 6. Statistical Parameters for Regression Analysis for Data Set Validation
Weibull R
extreme Kumaraswamy Weibull β (normalized)
eq 25 17 24 3 1
R2 0.994 0.994 0.994 0.993 0.984
SD 2.38 2.43 2.54 2.75 4.10
slope 1.004 ( 0.016 1.009 ( 0.016 1.006 ( 0.017 1.008 ( 0.009 1.010 ( 0.028
average AIC 18.11 18.54 19.34 19.92 36.09
average BIC 20.18 20.61 20.89 21.99 37.13
∆i 0 0.43 1.23 1.81 17.99
evidence ratio 1 1.24 1.85 2.47 8048
positive residuals 164 167 172 176 195
negative residuals 182 179 174 170 151
absolute differencea 18 12 2 6 44
a Absolute difference between positive and negative residuals.

(eq 17), and Weibull (eq 24) functions require similar computing Table 7. Composition of Products from Hydrocracking at P ) 10
time (0.248, 0.124, and 0.219, respectively), while the required MPa and LHSV ) 0.5 h-1 (El-Kady41)
CPU time for evaluating the β distribution function (eq 3, rank Product 1 2 3
2) is more than 10 times longer (3.705). This can be explained reactor temperature, °C 410 430 450
by the evident relative simplicity of the Weibull extreme, Reported Yields
Weibull, and Kumaraswamy distribution functions, which do gases (C2-C5) 5.90 10.42 17.23
not include any special function as in the case of the β function light naphtha (IBP-80 °C) 3.51 6.92 19.03
(Table 1). gasoline (80-150 °C) 9.07 15.26 36.03
The following observations, based on the number of param- kerosene (150-250 °C) 21.23 22.84 11.15
gasoil (250-380 °C) 25.28 28.26 13.75
eters (two, three, or four) of each distribution function, can be residue (380-538 °C) 35.01 16.30 2.81
made:
Estimated Distillation Data of Liquid Product, °C
• Four-parameter distribution functions offer the best fitting IBP 36.0 36.0 36.0
capability. Five of them are ranked among the top 10. The 5% 86.4 60.9 46.3
Weibull extreme, β, and Kumaraswamy distributions are in the 10% 125.6 89.7 56.5
best-ranked group. 30% 228.1 174.4 87.8
50% 321.9 247.6 122.4
• Some of the three-parameter distribution functions can fit 70% 401.9 325.0 174.9
distillation data with good accuracy: the Weibull and γ 90% 479.9 428.2 307.9
distributions have standard deviations of 0.86 and 1.07% and 95% 505.7 468.9 394.4
are ranked within the best five. FBP 538.0 538.0 538.0
• Two-parameter distribution functions exhibited poor fitting
capability. All but one of them are in group 3. deviations and residuals using the four best equations are lower
• It must be noticed that the γ and normal distribution than those obtained with the normalized R distribution function
functions, which are the most popular distribution functions used (SD of about 2.5 versus 4.1). The correlation coefficients and
for fitting distillation data, were ranked 5 and 20, respectively. slopes of the parity plots are closer to unity and the intercepts
For validation purposes, fitting capabilities of the four best closer to zero for the four best functions as compared to the
functions and the worst function (eqs 25, 17, 24, 31, and 1) worst function. Additionally, the absolute difference between
were determined using other data sets. A total of 30 samples, the number of positive and negative residuals of the normalized
which are from three whole crude oils and their various boiling R function is more than twice that of the other functions, which
range fractions, with a total of 346 points were selected for this means that the former is overestimating the experimental values.
task. They cover a wide range of distillation temperatures (from An inspection of the AIC and BIC values, ∆i, and evidence
20 to 540 °C). The validation results are presented in Table 6. ratios yielded the same order in the ranking from the validation
Residuals for the Weibull extreme and normalized R functions set as from the testing data set. These validation results
are shown in Figure 5. It can be seen in Table 6 that the standard corroborate that Weibull extreme, Kumaraswamy, and Weibull
are the best distribution functions to fit distillation data.
To illustrate one application of fitting distribution functions
to experimental distillation data, a data set of hydrocracking
products of vacuum gas oil (distillation range, 380-550 °C;
molecular weight, 425 g/mol; density at 15 °C, 0.931 g/mL),
obtained in a fixed-bed reactor at 10 MPa and 0.5 h-1 liquid
hourly space velocity (LHSV), was taken from the literature.41
The reported composition data of various products are detailed
in Table 7. The complete distillation curves of the whole
hydrocracking products were not reported; however, they can
be reproduced from yields and temperature ranges of products
by using distribution functions. The IBP of naphtha was assumed

(41) El-Kady, F. Y. Indian J. Technol. 1979, 17, 176.


(42) McLaughlin, M. P. A compendium of common probability distribu-
tions. http://www.causascientia.org/math_stat/Dists/Compendium. pdf (ac-
cessed Mar 2007).
Figure 5. Residual plot for validation data set. (+) Weibull extreme; (43) Johnson, R.; Kotz, S.; Balakrishnan, M. Continuous UniVariate
(O) normalized R. Distributions; John Wiley and Sons: New York, 1994.
Comparison of Probability Distribution Functions Energy & Fuels, Vol. 21, No. 5, 2007 2963

approaches. It was possible to identify a set of four probability


distribution functions, which correlate the data within experi-
mental error. Additionally, the required CPU time and simplicity
were taken into account as a final criterion to select the most
suitable distribution functions. Weibull extreme, Weibull, and
Kumaraswamy probability distribution functions are recom-
mended for fitting distillation data, whose application for this
purpose has not been previously reported. Further work is
necessary to correlate the features of these best-ranked functions
with the nature of distillation curves of petroleum.

Nomenclature
A, B, C, D ) Distribution parameters
AIC ) Akaike information criterion
Figure 6. Comparison of Weibull extreme distribution function (s) AICi ) AIC for model i in eq 31
and Hermite interpolation method (- -) for representing experimental AICmin ) AIC for best model in eq 31
distillation data of products from hydrocracking at different tempera- BIC ) Bayesian information criterion
tures: (b) 410 °C, (9) 430 °C, and (2) 450 °C (data from El-Kady41). k ) Number of free parameters
n ) Number of observations
to be that of n-C5 so that a total of six points was available. R ) Number of models considered in the study in eq 32
R2 ) Correlation coefficient
The procedure previously described was applied and a set of
RSS ) Residual sum of squares
optimal parameters for the Weibull extreme distribution function SD ) Standard deviation
was determined, and a complete distillation curve was generated t ) Independent variable in eqs 1-25
from only partial data, which is also reported in Table 7. This T ) Actual temperature boiling point
procedure and data were successfully applied for kinetic T1, T2 ) Reference temperatures
modeling of the hydrocracking in which a complete distillation y ) Independent variable in eqs 1-25, t ) (y - A)/B
curve was needed.12 For comparison purposes, the results of ycal ) Calculated weight fraction
the complete distillation curves obtained by using the Weibull yexp ) Experimental weight fraction
extreme probability distribution function were plotted together Acronyms
with those determined by the common interpolation method
CDF ) Cumulative distribution function
(piecewise cubic Hermite interpolation), and the results are
FBP ) Final boiling point
presented in Figure 6. The Hermite interpolation method was IBP ) Initial boiling point
selected because it preserves monotonicity and the shape of the MS ) Mass spectroscopy
data. It can be clearly observed that the distillation curves NMR ) Nuclear magnetic resonance
obtained with cubic interpolation show “humps”, although PDF ) Probability density function
passing through all of the experimental points; this feature is XRD ) X-ray diffraction
not typically observed in distillation curves. On the contrary,
Greek symbols
the Weibull extreme probability distribution function provides
a very good fit to the shape of the distillation curve and ∆i ) AIC differences for model i with respect to best model
experimental points. It is worthy to mention that the Weibull Ι ) Incomplete beta function
Γ ) Gamma function
function was previously recommended by Dhulesia to describe
Φ ) Normal cumulative distribution function
distillation curves of feeds and products of the FCC process.5 θ ) Dimensionless temperature
ωi ) AIC weight for model i
Conclusion ω1/ωj ) Evidence ratio of model j with respect to model 1

The probability distribution functions in their cumulative Acknowledgment. The authors thank the Mexican Institute of
forms are very useful in general for fitting distillation data. On Petroleum for economic support. Discussions with Fraser Forbes
the basis of statistical analyses of 25 functions and 1474 on the AIC and BIC rankings are also appreciated.
distillation data points, it was possible to establish a ranking of
fitting capability of the functions according to two approaches: Supporting Information Available: Entire experimental data
(1) with standard deviation, a correlation coefficient, and a set used for determining the fitting capability of probability
residuals analysis and (2) with the AIC and BIC methodology. distribution functions. This information is available free of charge
via the Internet at http://pubs.acs.org.
Even when SD introduces a bias due to the number of
parameters, the rankings obtained are quite similar with both EF070003Y

You might also like