Narottam Chand1, R.C. Joshi2 and Manoj Misra2 1 Department of Computer Science and Engineering National Institute of Technology Hamirpur, INDIA Email: 2 Department of Electronics and Computer Engineering Indian Institute of Technology Roorkee, INDIA Email: {joshifcc, manojfec}

ABSTRACT Caching is a key technique in mobile computing environment for improving the data retrieval performance. Due to cache size limitations, cache replacement algorithms are used to find a suitable subset of items for eviction from the cache. It has been observed that cached items in a client are related to each other and therefore replacement of a data item which is highly associated may lead to series of misses during client’s subsequent requests. The existing policies for cache replacement in mobile environment do not consider relationship among data items along with caching parameters. This paper proposes a novel cache replacement policy, R-LPV that considers the caching parameters of a data item alongwith the relationship of this item with the cache set. Association rule based data mining is applied to find the relationship among data items. The simulation experiments show that the R-LPV policy substantially outperforms two other policies, namely I-LRU and Min-SAUD. Keywords: Cache replacement, mobile, invalidation, data mining, profit, wireless.



A promising technique to improve the performance in mobile environment is to cache frequently accessed data on the client side [1-4, 11, 19, 20]. Caching at mobile client can relieve bandwidth constraints imposed on wireless and mobile computing. Copies of remote data from the server can be kept in the local memory of the mobile client, thus substantially reducing user requests for retrieval of data from the origin server. This not only reduces the uplink and downlink bandwidth consumption, but also the average query latency. Caching frequently accessed data in the mobile client also saves on energy used to retrieve the repeatedly requested data. Due to limited cache size at a mobile client, it is impossible to hold all accessed data items in the cache. Thus, cache replacement policies are used to find a suitable subset of data items for eviction. The performance of a caching system depends heavily on the replacement algorithms used to dynamically select an evict subset from a finite cache space. Cache replacement algorithms have been extensively studied in the context of operating system, virtual memory management and database buffer management [5]. In this context, cache replacement algorithms usually maximize the cache

hit ratio by attempting to cache the items that are most likely to be accessed in the near future. In contrast to the typical use of caching techniques in these areas, client side data caching in mobile data dissemination has the following characteristics [6-8]: 1. Cached data items may have different sizes and thus a replacement policy needs to be extended to handle items of varying sizes. As to the size factor, the cost to download data items from the server may vary. As a result, the cache hit ratio may not be the best measurement for evaluating the quality of a cache replacement algorithm. 2. Data items may be constantly updated at the server side. Thus the consistency issue shall be considered. That is, data items that tend to be inconsistent earlier should be replaced earlier. 3. Mobile clients may frequently disconnect voluntarily (to save power and/or connection cost) or due to failure. 4. A client may move from one cell to another. With the consideration of the above issues, the endeavor to find an efficient replacement algorithm in mobile environment becomes an even more difficult problem compared to traditional cache replacement algorithms. To utilize the limited resources at a mobile client, cache replacement policies for mobile on-demand broadcasts with varying sizes have been investigated in the recent 1

Ubiquitous Computing and Communication Journal

past [7, 8]. These policies consider various caching parameters namely, data access probability, update frequency, retrieval delay from the server, cache invalidation delay, and data size in developing a benefit function which determines the cached item(s) to be replaced. Incorporating these parameters to their designs, these cache replacement algorithms show significant performance improvement over the conventional ones (for example, LRU or LFU algorithms). Data items queried during a period of time are related to each other [9]. Hence, the cached items in a client are associated to each other and therefore replacement of a cached item cannot be seen in isolation. Replacement of a data item which is highly related to a cache subset may lead to series of cache misses in the client’s subsequent requests. Data mining research deals with finding associations among data items by analyzing a collection of data [10]. In our replacement algorithm, the access history of a client is mined to obtain the association rules. Then confidence value and caching parameters (update frequency, retrieval delay from the server, cache invalidation delay, and data size) of the data item in the consequent of the rules, are used to compute the benefit function for replacement. In contrast to the use of data access probability, our policy uses confidence value which is a conditional probability. 1.1 Motivation We have identified some characteristics of the mobile environment that affect client side caching performance: 1. Various parameters of a data item viz. update frequency, retrieval delay from the server, cache invalidation delay, and data size contribute towards benefit gained by caching the item. 2. Data items queried during a period of time are related to each other. The choice of forthcoming items can dependent, in general, on a number of previously accessed items. 3. Cached items in a client are related to each other and therefore replacement of a data item which is highly associated may lead to series of misses during client’s subsequent requests. 4. Cache misses are not isolated events, and a cache miss is often followed by a series of cache misses. 5. The association among data items can be utilized to make replacement decision. In view of the above, the motivation for our study is to deign a novel cache replacement algorithm for the mobile environment. To find the relationship among data items, association rules based data mining technique is used. Consider the following examples to illustrate the caching characteristics in mobile environment: Example 1. Consider two association rules d1→d2

and d3→d4 with confidences of 90 percent and 40 percent respectively. Here di denotes the data item i. Assume that the fetching delay of item d2 is 2 and that of item d4 is 4. The expected delay saving from caching d2 is then equal to 1.8 (=0.9x2), whereas the expected delay saving from caching d4 is equal to 1.6 (=0.4x4). In this case, caching d2 (i.e., the item with lower fetch delay) is in fact more beneficial than caching d4 (i.e., the item with higher fetch delay). Similar explanation also holds when using other caching parameters. Thus, not only the caching parameters of the data item, but also the confidence shall be taken when devising the cache replacement algorithm. In conclusion, a caching strategy that considers relationship among cached items alongwith other parameters is a better choice in mobile environment. 1.2 Paper contribution To maximize the performance improvement due to caching, we propose a novel cache replacement algorithm, not only considering the caching parameters of a data item, but also the relationship of this item with the cache set. Association rule based data mining is applied to find the relationship among data items. We design a profit function to dynamically evaluate the profit from caching an item. Simulation is performed to evaluate the performance of our algorithm under several circumstances. The experimental results show that our algorithm considerably outperforms other algorithms. More precisely, this paper makes following contributions: 1. Data mining algorithm to generate association rules with only one item in the consequent and one or more items in the antecedent. Our motivation is to find several items that are highly related to the item to be replaced. 2. Development of cache replacement policy that considers data association rules alongwith caching parameters. 3. Extensive simulation to evaluate the performance of proposed cache replacement algorithm. 1.3 Organization The rest of the paper is organized as follows. The related work is described in Section 2. The system model is presented in Section 3. Section 4 describes data mining techniques to generate the caching rules. Section 5 presents the details of proposed cache replacement policy. Section 6 is devoted to the performance evaluation and presents simulation results. Concluding remarks are given in Section 7. 2 RELATED WORK Caching frequently accessed data on the client

Ubiquitous Computing and Communication Journal


side is an effective technique to improve performance in mobile environment [1, 12]. A lot of research has been done on cache invalidation in the past few years [1-4, 11, 19, 20], with relatively little work being done on cache replacement methods. In the following, we briefly review related studies on cache replacement in mobile environment. Cache replacement policy is widely studied in web proxy caching. Recently, many new deterministic replacement schemes, such as GD-Size [13], LNC-R-W3-U [14], LRV [15], and Hybrid [16], have been studied particularly for the Web proxy cache. W.-G. Teng et al. [17] have proposed a novel proxy caching strategy based on the normalized profit function called IWCP (integration of Web caching and prefetching). This algorithm not only considers factors such as the size, fetching cost, reference rate, invalidation cost, and invalidation frequency of a Web object, but also exploited the effect caused by various Web prefetching schemes. Since access latency, connectivity, limited power and memory capacity of mobile devices are the characteristics of today’s mobile environment, the algorithms deployed in Web proxy cache cannot be adapted directly to manage mobile cache. The cache replacement issues for wireless data dissemination were first studied in the Broadcast Disk (Bdisk) project [12]. Acharya et al. proposed a cache replacement policy PIX in which the cached data item with the minimum value of p/x was evicted for cache replacement, where p is the item access probability and x is its broadcast frequency. Thus, a cached item has a high access probability or has a long broadcast delay. Simulation based study showed that this strategy could significantly improve the access latency over the traditional LRU and LFU policies. Caching and its relation with the broadcast schedule in the Bdisk system was empirically investigated in [18]. Caching and scheduling affect each other. A scheduling scheme determines data retrieval costs, and thus affects the caching policy. On the other hand, caching affects as well, since it reduces client access requests to the server and thus changes the clients’ access patterns. This gives rise to a circular problem. In [18], some interesting results were discovered through simulation. Efficient scheduling can provide performance improvement when caches are small. However, schedules need not be very skewed when large caches are used, and in this case efficient caching algorithms should be favored against refined broadcast schedules. It remains open on how to design a cooperative protocol such that the performance is optimized. All of the above studies are based on simplifying assumptions, such as fixed data sizes, no updates, and no disconnections, thereby making the proposed schemes impractical for a realistic mobile environment.

[4] uses a modified LRU, called invalid-LRU (ILRU), where invalid cache items are first replaced. In case, there is no invalid cache item, the client considers the oldest valid item for replacement. In [6], Xu et al. developed a cache replacement policy, SAIU, for wireless on-demand broadcast. SAIU took into consideration four factors that affect cache performance, i.e., access probability, update frequency, retrieval delay, and data size. However, an optimal formula for determining the best cached item(s) to replace based on these factors was not given in this study. Also, the influence of the cache consistency requirement was not considered in SAIU. Xu et al. [6, 7] propose an optimal cache replacement policy, called Min-SAUD, which accounts for the cost of ensuring cache consistency before each cached item is used. It is argued that cache consistency must be required since it is crucial for many applications such as financial transactions and that a cache replacement policy must take into consideration the cache validation delay caused by the underlying cache invalidation scheme. Similar to SAIU, Min-SAUD considers different factors in developing the gain function which determines the cache item(s) to be replaced. Although the authors proved that their gain function is optimal, they did not show how to get such an optimal gain function. Since this approach needs an aging daemon to periodically update the estimated information of each data item, the computational complexity and energy consumption of the algorithm are too high for a mobile device. Huaping Shen et al. [11] proposed an energyefficient utility based replacement strategy with O(logN) complexity, called GreedyDual Least Utility (GD-LU), where N is the number of items in the cache. Recently, L. Yin et al. [8] propose a generalized value function for cache replacement algorithm for wireless networks under strong consistency model. The distinctive feature of the value function in contrast to [7] is that it is generalized and can be used for various performance metrics by making the necessary changes. Disadvantage is that the strategy suffers from high computational complexity. 3 SYSTEM MODEL

This paper studies cache replacement in mobile environment. Fig. 1 depicts a typical system model used during the study. As illustrated, there is a cache management mechanism in a client. The client employs a rule generating engine to derive caching rules from the client’s access log. The derived caching rules are stored in the caching rule depository of the client. Whenever an application issues a request, the cache request processing module first logs this request into record and checks whether the desired data item is in the cache. If it is a cache

Ubiquitous Computing and Communication Journal


hit, the cache manager still needs to validate the consistency of the cached item with the copy at the origin server. To validate the cached item, the cache manager retrieves the next validation report from the broadcast channel. If the item is verified as being upto date, it is returned to the application immediately. If it is a cache hit, but the value is obsolete, the cache manager sends an uplink request to the server and waits for the data broadcast. When the requested data item appears on the broadcast, the cache manager returns it to the requester and retains a copy in the cache. In the case that a cache miss occurs, the client cache manager sends uplink request to the server for the cache missed item. When the requested data item arrives on the wireless channel, the cache manager returns it to the requester and retains a copy in the cache. The issue of cache replacement arises when the free cache space is not enough to accommodate a data item to be cached. We develop an optimal cache replacement scheme that incorporates a profit function in determining the cache item(s) to be replaced. Various caching parameters alongwith replacement rules are considered while computing the profit value. To address the cache consistency issue, strategy based on update report (UR) [4] has been used in this paper. The strategy periodically broadcasts update report (UR) to minimize uplink requests and downlink broadcasts. To reduce the query latency, the strategy uses request reports (RRs), where all the recently requested items are broadcast.

the antecedent while the then part is called the consequent. The rule presented above is known as association rule in the data mining literature [10, 21, 22]. We propose to use data mining technique to discover the association rules in the access history and apply the rules to make the replacement decision. We will call them the caching rules. The problem of finding association rules among items has been clearly defined by Agrawal et al. [10]. Our context, mobile environment, imposes different conditions and thus a direct application of existing association rule mining algorithms, such as those presented in [21] is not applicable to generate the caching rules: We are interested in rules with one or more data items in the antecedent. We restrict the maximum number of items in the antecedent because the computation of rules where for example nine items imply a tenth one, is expensive. In our opinion, these rules are not more useful than those where three items imply another one, and whose computation requires less effort thus making the process more suitable for a mobile client. Our motivation is to retain in cache a data item that is highly related to the data items present in the cache. For example, out of two rules {d1, d2, ..., d9}→d10 and {d1, d2, d3}→d10, latter will be more appropriate to find the relation of d10 with cached items when d5 and d6 are not present in the cache. Due to cache replacement, all the items in antecedent of a long rule may not be present in the cache and hence such rules are rarely used. We want to generate rules that have just a single item in the consequent. This is because, when to consider a data item di for replacement, the confidence value of the rules having the item di in consequent will be used. In the following, a formal statement of the problem of mining caching rules is presented. Problem statement Let D = {d1, d2, ..., dN} be the set of data items at the server. Suppose a client’s access trace S consists of a set of consecutive parts: {part1, part2, ..., parti, ..., partn}. Let A = {d1, d2, ..., dm} denotes the set of data items accessed by the client. Let Si denotes the data items contained in part parti. Si is called a session and Si ⊂ A. We say a session Si contains x if Si ⊇ x, where x ⊂ A. A caching rule rx,y is an expression of the form x→dy, where x ⊂ A, dy ∈ A, and x ∩ {dy} = φ. x is called antecedent of the rule and dy is called consequent. In general, a set of data items is called an itemset. The number of data items in an itemset is called the size of the itemset and an itemset of k size is called kitemset. The support of an itemset x, support(x), is defined as the percentage of sessions that contains x in the client’s access trace S. The support of a

Figure 1: System model for cache replacement. 4 GENERATION OF CACHING RULES

At a mobile client, data items queried during a period of time are related to each other. Observation of the history of data items queried by the client may lead to find relationship among the items. These relationships can take the form of patterns of behavior that can tell us that if the client has accessed certain items during a period of time then it is likely that one particular item will be accessed in near future. An example of such a relationship is “if a client accesses d1 and d2 then it accesses d3 80 percent of the times”. The if part of the rule is called

Ubiquitous Computing and Communication Journal


caching rule rx,y, support(rx,y), is defined as the support of the itemset that consists data items in both the antecedent and the consequent of the rule, i.e., support(rx,y) = support(x ∪ {dy}). We define the confidence of a caching rule rxy, confidence(rx,y), as the support of the rule divided by the support of the antecedent.
Confidence(rx, y ) =

support(x ∪ {d y }) sup port ( x )

× 100% .

possible to generate new itemsets or the number of items in an itemset exceeds the predefined maximum NR. Lines 3-14 generate all the new candidate frequent k-itemsets out of the frequent (k-1)-itemsets. Lines 15-18 remove those candidate frequent kitemsets that do not fulfill the minimum support requirement. In line 21 the algorithm returns all the frequent itemsets generated. 1) F1 = {frequent 1-itemsets} 2) k = 2 3) while Fk-1 ≠ φ ∧ k ≤ NR do 4) Fk = φ 5) for each itemset f1 ∈ Fk-1 do 6) for each itemset f2 ∈ Fk-1 7) if f1[1] = f2[1] ∧ ... ∧ f1[k-2] = f2[k-2] ∧ f1[k-1] < f2[k-1] 8) then f = {f1} ∪ {f2[k-1]}; Fk = Fk ∪ {f} 9) for each (k-1)-subsets s ∈ f do 10) if s ∉ Fk-1 11) then Fk = Fk –{f}; break 12) end 13) end 14) end 15) for each itemset f1 ∈ Fk do 16) if support(f1) < minsup 17) then Fk = Fk –{f1] 18) end 19) k = k+1 20) end 21) return ∪kFk Figure 2: Algorithm to generate frequent itemsets. 4.2 Algorithm to generate caching rules We are interested in generating, from a frequent k-itemset fi, rules of the form x→dy, where x is a (k1)-itemset, dy is a 1-itemset and fi = x ∪ {dy}. Table 2 shows the notation used in the algorithm. Fig. 3 illustrates the main idea of the algorithm. The algorithm accepts the frequent itemsets and a minimum confidence (minconf) as parameters. For each frequent itemset, the rules are generated as follows. Of all the data items within frequent itemset, one item becomes the consequent of the rule, and all other items become the antecedent. Thus, a frequent k-itemset can generate at most k rules. For example, suppose {d1, d2, d3} is a frequent 3-itemset. It can generate at most three rules: {d1, d2}→d3, {d1, d3}→d2, and {d2, d3}→d1. After the rules have been generated, their confidences are calculated to determine if they have the minimum confidence. Only the rules with at least the minimum confidence are kept in the rule set Z. For example, for the rule {d1, d2}→d3, confidence conf = support({d1, d2, d3})/support({d1, d2}). If conf ≥ minconf, the rule holds and it will be added to the rule set Z.

In general, cx,y denotes confidence(rx,y) and rx,y is
x,y ⎯ expressed as x ⎯⎯ → d y .

The confidence of a caching rule is the conditional probability that a session in S contains the consequent given that it contains the antecedent. Given an access trace S, the problem of mining caching rules is to find all the association rules that have support and confidence greater than the userdefined minimum support (minsup) and minimum confidence (minconf), respectively. The problem of mining caching rules can be decomposed into the following subproblems [10]: 1. Find all the itemsets x such that support(x) ≥ minsup. An itemset x that satisfies this condition is called frequent itemset. See Section 4.1. 2. Use the frequent itemsets to generate association rules with minimum confidence. See Section 4.2. 4.1 Algorithm to generate frequent itemsets In this section, we present an algorithm to generate frequent itemsets from the client’s access trace. Table 1 shows the notations used in the algorithm. Fig. 2 shows the main steps of the algorithm. It accepts an access trace S, a minimum support (minsup), and the maximum number of items NR to be used in a rule as parameters. In line 1, S is analyzed to generate the frequent 1-itemsets. This is done by calculating the support of each data item and comparing it to the minimum support. Every data item that has minimum support forms one frequent 1-itemset. Table 1: Notations. Maximum number of items in a rule An itemset with k items The set of frequent k-itemsets (those with minimum support) f 1, f 2 Frequent (k-1)-itemsets within Fk-1 f1[m] m-th item in itemset f1 f A new frequent k-itemset obtained by combining a frequent (k-1)-itemset with one item Loop from line 3 to line 20 is used to generate all the frequent 2-, 3-, ..., k-itemsets. Each iteration of the loop, say iteration k, generates frequent kitemsets based on the (k-1)-itemsets generated in the previous iteration. This loop continues until it is not NR k-itemset Fk

Ubiquitous Computing and Communication Journal


Table 2: Notations. Z Fk fi fi[m] The set of caching rules The set of frequent k-itemsets Frequent k-itemset within Fk m-th item in itemset fi

1) k = 2 2) while Fk ≠ φ do 3) for each itemset fi ∈ Fk do 4) for each item fi[j] ∈ fi 5) if

sup port (f i ) ≥ minconf sup port (f i − f i [ j])

6) then Z = Z ∪ {{fi-fi[j]}→fi[j]} 7) end 8) end 9) k = k+1 10) end 11) return Z Figure 3: Algorithm to generate the caching rules. 4.3 Session formation Before applying the data mining algorithm to generate caching rules, we first need to identify sessions out of the client access trace. This section, describes the session formation and other related issues. The concept of user session is defined in all existing web prefetching algorithms. The objective of session formation in such studies is to separate independent accesses made by the same user at distant points in time. This issue is handled in [23], where accesses of each user are grouped into sessions according to their closeness in time. If the time between successive page requests exceeds a certain limit, it assumed that the user is starting a new session. In this paper, we determine the session boundaries using an approach similar to [23]. Here, we assume that a new session starts when the time delay between two consecutive requests is greater than a pre-specified time session_threshold. The access trace of a client is collected in the form S = <(d1, t1), (d2, t2), ..., (di, ti), ... (dk, tk)>. Here di denotes ID of the data item which the client accesses at time ti. After the access history of a client is collected in a predefined time interval in the above format, this trace is partitioned into sessions. If the time gap between two consecutive accesses is larger than a session_threshold, the new session starts. For example, if ti+1 – ti > session_threshold, we assume that session <d1, d2, ..., di> ends and at ti+1 a new session is started with first item di+1. In this way the access trace is partitioned into sessions and used in frequent itemsets generation algorithm given in Section 4.1. The caching rules that are generated are highly related to the client’s access pattern and the access

pattern may change from time to time. For example, at one time the client is interested in stock quotes and after some time, the client may like to browse the cricket score. Therefore, we need to analyze the client’s access trace periodically, eliminating the obsolete rules and add new rules if necessary. Mining the access trace very frequently is a waste of client resources, however, using the same set of rules for a long time may affect the caching performance negatively since the current rule set may no longer reflect the access pattern of the client. In the R-LPV scheme, we re-mine and update caching rules periodically to keep them fresh. This is done by adding recent access to the trace and cutting off the oldest trace. Another important issue is whether to perform mining on the mobile support station (MSS) or on the client. Mining of association rules in mobile environment is application dependent. For example, data broadcasting [24] and mobility prediction [25] use the MSS side mining whereas in prefetching [9] the mining is performed on the client side. Mining on MSS is a better choice when decision is to be taken on the wired side such as [24] and [25]. Mining on the MSS, while performing caching and prefetching have the following disadvantages: − Overhead of broadcasting the rule set to the clients. − Access trace at the MSS corresponds to the cache miss pattern only. If there is a cache hit, the access trace is not reflected at the MSS. Transmitting of the access trace for cache hits to the MSS consumes client energy as well as wireless bandwidth. − Communication consumes more energy than computation for a mobile client. For example, the energy cost for transmitting 1K bits of information is approximately the same as the energy consumed to execute three million instructions [26]. Keeping in view the above facts, in our approach, each client mines and maintains its own caching rule set independent of the other clients. 5 RULE BASED CACHE REPLACEMENT

Here we present a replacement strategy called Rule based Least Profit Value (R-LPV) which considers the profit gained due to data caching. To devise a profit function from caching an item, we have to find the replacement rule set for the item, which is a subset of caching rules generated in Section 4. 5.1 Generating replacement rule set The replacement rules to be used in the computation of profit function are generated from the caching rules. For example to compute profit value profity for an item dy, the replacement rule set, Zy,

Ubiquitous Computing and Communication Journal


contains all the caching rules with consequent dy and antecedent as subset of cached items at the client. The generation steps for Zy are as follows: 1. Include in Zy all the rules rx,y ∈ Z such that every data item that appears in x is cached by the client. If V is set of cached data items at the client, then Zy = U rx , y ∧ rx , y ∈ Z ∧ x ⊆ V . For example, if Z = {{d1, d2, d3}→d4, {d1, d2}→d4, {d1, d5}→d2, {d1, d5}→d4}, then Z4 = {{d1, d2, d3}→d4, {d1, d2}→d4, {d1, d5}→d4}. If a rule rx,y is contained in Zy, then no rule rw,y with w ⊂ x should be used in the replacement set Zy. So we retain only the rules with maximum number of items in the antecedent. To achieve this, update Zy as, Zy = Zy U rx , y ∧ rw , y ∈ Z y ∧ w ⊂ x ∧ rx , y ∈ Z y . For example rule {d1, d2}→d4 will be excluded because {d1,d2} ⊂ {d1,d2,d3}, so Z4 = {{d1, d2, d3}→d4, {d1, d5}→d4}. The idea behind retaining the rules with maximum items in the antecedent is to consider the relation of an item with largest cache subset while performing replacement. An item which is related to larger subset is more beneficial for the client and hence is assigned lower priority for replacement. 5.2 Replacement using profit function Most cache replacement algorithms employ an eviction function to determine the priorities of data items to be replaced (or to be cached). We devise a profit function to determine the gain from caching an item. Our replacement policy depends on the profit of individual item to be kept in the cache. To determine the profit due to an item dy, we will first calculate the expected number of accesses cy of dy. Using the replacement rule set Zy, the value cy for an item dy can be computed in terms of confidence of the rules. Due to replacement rule rx,y ∈ Zy, the expected number of accesses of item dy is increased by cx,y. Therefore cy =
rx , y ∈Z y

data item dy is not in the cache, it will take by to retrieve it into the cache. In otherwise, if dy is in the cache, we can save the delay by by. However, it also takes ( v + Pu y × v y ) to validate it and get the updated data if necessary. Thus caching the item dy can save the delay by ( b y − ( v + Pu y × v y ) per access. Combining the Eq. (2) and (3), we get profity = cy × ( b y − v − Pu y × v y ) Table 3: Notations. D N dy rx,y cx,y cy sy by v vy ay uy The set of all the data items in the database The number of data items in the database A data item with id y
x,y ⎯ Caching rule x ⎯⎯ → d y




Pu y

Confidence of the rule rx,y Expected number of accesses to item dy Size of data item dy The delay of retrieving data item dy from the server, i.e. cache miss penalty The cache validation delay, i.e., access latency of an effective invalidation report The delay in retrieving updated data item dy from the server The mean access rate to data item dy The mean update rate of data item dy The probability of invalidating cached item dy The set of cached data items Cache size (in bytes)




Profity for a cached item dy is given as

porfity = cy × dsy


where dsy denotes delay-saving due to caching of item dy. R-LPV associates a profit value to each item in the cache and replaces the item with least profit. Delay-saving can be computed as inspired by [7, 8]. To facilitate the computation, notations are given in Table 3. Based on the notations, the delay saving dsy due to caching of item dy is ds y = b y − v − Pu y × v y (3) The Eq. (3) can be justified as follows. If the

The profit function is used to identify the item to be retained (i.e. not to be evicted). Intuitively, the ideal data item in cache should have a high expected number of accesses, a low update rate, high retrieval delay and small data size. These factors are incorporated if we evict an item di with minimum profiti/si. The objective of the cache replacement is to maximize the total profit value for cached data items, that is maximize profit ,
d y ∈V


subject to

d y ∈V





Based on the above objective function, we can define our R-LPV policy. We follow the description provided by Yin et al. [8]. To add a new data item to the cache, suppose we need to replace data items of size s. R-LPV finds the set V ′ of items to be replaced which satisfies the following conditions:

Ubiquitous Computing and Communication Journal


d y ∈V′




⎛ ⎞ sy ≥ s⎟ , ∀V k ⎜ Vk ⊆ V ∧ ⎜ ⎟ d y ∈Vk ⎝ ⎠


fewer than K samples are available, all the samples are used to estimate the value. Shim et al. [27] showed that K can be as small as 2 or 3 to achieve the best performance. Client access log is used to compute ay, thus, there is no additional spatial overhead to maintain TaK . The parameter uy is y maintained and stored on the server side and is piggybacked to the clients when data item dy is delivered. To estimate by and vy, we use a well known exponential aging method [27]. It combines both the history and the current observed value to estimate the parameters. Whenever a query for item dy is answered by the server, by is reevaluated as follows

d y ∈Vk

∑ profit


d y ∈V′

∑ profit


Here V ′ is the least profitable subset of V with total size of at least s. 5.3 Implementation issues The R-LPV algorithm evicts a least profitable cached item during each replacement as given by optimization problem defined by Eq. (6). The optimization is essentially the 0/1 knapsack problem which is known to be NP-hard. When the data size is small compared to cache size, a sub-optimal solution can be obtained. We define a heuristic to throw out a cached item dy with minimum profity/sy value until the free space is sufficient to accommodate the incoming data item. To implement the R-LPV algorithm, we need to estimate the parameters by, vy, and Pu y . To facilitate this estimation, we assume that the arrival of data accesses and updates for data item dy follow the Poisson processes. Specifically, t a and t u , the y y interarrival times for data access and update of item dy follow exponential distributions with mean of ay and vy respectively. In other words the density functions for t a and t u are f ( t a ) = a y e y y y
−a y t a y

b y = α × b new + (1 − α) × b old y y


Where b new is the currently measured data y retrieval delay, and b old is the calculated by before y the last retrieval of item dy. Here α is a constant factor to weigh the importance of the most recent measured value. Similarly,

v y = α × v new + (1 − α) × v old y y



g(t u ) = u ye y

−u y t u y

, respectively [7]. Let Ta y and

Tu y be the time of access and time of invalidation of
data dy respectively. The probability that the cache invalidation happens before the next data access is:
∞ Ta y

Pu y = Pr(Tu y < Ta y ) =

∫ ∫a e
y 0 0

− a y Ta y

u ye

− u y Tu y

dTu y dTa y =

uy ay + uy

So, the Eq. (4) becomes ⎛ ⎞ uy (7) profit y = c y × ⎜ b y − v − × vy ⎟ ⎜ ⎟ ay + uy ⎝ ⎠ We apply a sliding window method used by Shim et al. [27] to estimate ay and uy. The method uses K most recent samples to estimate ay and uy as follows K (8) ay = T − TaK y
uy = K T − TuKy


K Where T is the current time, TaK and Tu y are the y

A binary min-heap data structure is used to implement the R-LPV policy. The key field of the heap is the profity/sy value for each cached item dy. When the cache replacement occurs, the root item of the heap is deleted and the operation is repeated until sufficient space is obtained for the incoming data item. Let Nc denotes the number of cached items and Nv the number of items to be deleted during a replacement operation. Every deletion and insertion operation has a complexity of O(logNc). Thus, the time complexity for every cache replacement operation is O(NvlogNc). Most likely, the maximum value of Nv is three. In addition, when an item’s profit value is updated, its position in the heap needs to be updated. O(logNc) time is needed to update its position. Thus, overall time complexity of R-LPV is O(logNc). To implement the R-LPV policy, profity given by Eq. (7) needs to be recalculated whenever a replacement is necessary. This computation overhead may be very high. To reduce the computation overhead, we follow the same idea as proposed in [8]. When a cache replacement is necessary, instead of recalculating the profit value for every data item, the value of Nl least profitable items will only be recalculated. Most likely, the items to be replaced will be among them only because their values are relatively small. It has been observed that Nl = 3 provides satisfying performance, thus, the computational overhead of the profit function is very low.

time of Kth most recent accesses and updates. When

Ubiquitous Computing and Communication Journal




In this section, we evaluate the performance of the proposed methodology. We compare R-LPV cache replacement algorithm with the I-LRU [4] and Min-SAUD [7] algorithms. 6.1 The simulation model In the simulation, a single server maintains a collection of N data items and a number of clients access the data items. The UR [4] cache invalidation model is adopted for data dissemination. 6.1.1 The client model The time interval between two consecutive queries generated from each client follows an exponential distribution with mean Tq. Each client generates a single stream of read-only queries. After a query is sent out the client does not generate new query until the pending query is served. Each client generates accesses to the data items following Zipf distribution [28] with a skewness parameter θ. In Zipf distribution, the access probability of the ith (1 ≤ i ≤ N) data item is represented as follows: 1 , 0≤θ≤1 Ai = N 1 θ i θ k =1 k If θ = 0, clients uniformly access the data items. As θ is increasing, the access to the data items becomes more skewed. During simulation we have chosen θ to be 0.8. Every client, if active, listens to the URs and RRs to invalidate its cache accordingly. When a new request is generated, the client listens to the next UR or RR to decide if there is a valid copy of the requested item in the cache. If there is one, the client answers the query immediately. In case of invalid copy, the client downloads the update copy of the data item from the broadcast channel and returns it to the application. If cache miss happens, the client sends an uplink request to the server. After receiving the uplink request, the server broadcasts the requested data during next report (UR or RR). Then, the client can download them and answer the query. To accommodate a new item, the client follows RLPV cache replacement policy. The client cache management module consists of an access log, which is used to generate caching rules. The access log is divided into sessions and the sessions are mined using the association rule mining algorithms. To keep the caching rules fresh, the client updates the access log whenever a query is generated, and remines the rules periodically.

contains update history of the past w broadcast intervals. There are N data items at the server. Data item sizes vary from smin to smax such that size si of item di is, s i = s min + ⎣random().(s max − s min + 1)⎦ , i = 1, 2,... N, where random() is a random function uniformly distributed between 0 and 1. The server generates a single stream of updates separated by an exponential distributed update interarrival time with mean value of Tu. The data items in the database are divided into hot (frequently accessed) data subset and cold data subset. Within the same subset, the update is uniformly distributed, where 80% of the updates are applied to the hot subset. Most of the system parameters are listed in Table 4. Table 4: Simulation parameters.
Parameter Database size (N) smin smax Number of clients (M) Client cache size (C) UR broadcast interval (L) Number of RR broadcasts (m-1) Broadcast window (w) Broadcast bandwidth Hot subset percentage Hot subset update percentage Mean update arrival time (Tu) Mean query generate time (Tq) Skewness parameter (θ) Default Value 10000 items 10 KB 100 KB 70 600 KB 20 sec 4 10 intervals 100 Kbps 20 % 80 % 10 sec 100 sec 0.8 1-10000 sec 5~300 sec Range

200~1400 KB

6.1.2 The server model The server broadcasts URs (or RRs) over the wireless channel with the broadcast interval specified by the parameter L (or L/m). IR part of the UR report

6.2 The simulation results In the simulation results we show the byte hit ratio (B) and average query latency (Tavg) as a function of different factors such as cache size (C), mean query generate time (Tq) and mean update arrival time (Tu). Byte hit ratio is defined as the ratio of the number of data bytes retrieved from the cache to the total number of requested data bytes. The average query latency is the total query latency divided by the number of queries. Three cache replacement algorithms are compared in our simulations: • I-LRU [4]: The Invalid-LRU algorithm keeps removing the invalid item that was used the least recently until there is enough space in the cache. If there is no invalid item in the cache, the oldest valid item is removed as per LRU policy. • Min-SAUD [7]: Min-SAUD considers various factors that affect cache performance, namely, access probability, update frequency, data size, retrieval delay, and cache validation cost. • R-LPV: This is our algorithm. It keeps removing the item di with least profiti/si value, where the profit function is defined by Eq. (7).

Ubiquitous Computing and Communication Journal


6.2.1 The effects of cache size In this section, we investigate the performance of the cache replacement algorithms under different cache sizes. The simulation results are shown in Fig. 4. Our algorithm outperforms the other ones in terms of the byte hit ratio and average query latency. This is explained as follows. In contrast to I-LRU, the proposed algorithm R-LPV favors small size data item, thus a larger number of items can be saved in the cache. As a result, the byte hit ratio is higher and the average query latency is lower. Both Min-SAUD and I-LRU do not consider the relationship among data items. Thus important data items could be replaced by unimportant ones using these algorithms. For the R-LPV algorithm, since it has mined the relationship among data items, it knows which data items have higher future access probabilities. So it will keep these important data items in the cache for longer. In this way, more client’s requests can be served locally in the cache and the cache byte hit ratio is improved. On average, the improvement of R-LPV in byte hit ratio over I-LRU and Min-SAUD is 9.3% and 5.2% respectively. The byte hit ratio for all the algorithms improves with increasing cache size because with larger size more data items can be stored in the cache. Fig. 4b shows that R-LPV incurs lowest average query latency at all the cache sizes. For example, the Tavg value of R-LPV is 33.3% less than that of I-LRU and 26.7% less than that of Min-SAUD when the cache size is 600 KB. The R-LPV algorithm considers relationship among data and differentiates the importance of the data items and keeps the important data item in the cache for longer. As a result, this scheme achieves a lower average query latency than the I-LRU and Min-SAUD algorithms. In contrast, the I-LRU and Min-SAUD algorithms do not differentiate the items based on correlation in the cache. 6.2.2 The effects of mean update arrival time We measure the byte hit ratio and average query latency as a function of the mean update arrival time (Tu). Tu determines how frequently the server updates its data items. As shown in Fig. 5, our algorithm is much better than I-LRU and Min-SAUD. For example, in Fig. 5b, when the update arrival time is 10 seconds, the average query latency of R-LPV is 26.7% less than that of the Min-SAUD algorithm. Our algorithm considers various caching parameters and exploits relationship among the cached items to determine their importance, thus retains important item for longer. In contrast I-LRU and Min-SAUD algorithms do not take into consideration the relationship of an item with the set of cached items and hence an important item may be replaced by an unimportant one. On average, R-LPV is about 27.9% and 18.5% better than I-LRU and Min-SAUD

respectively in terms of average query latency. Similar results can be found in Fig. 5a. For all the three algorithms, the average query latency is high at lower Tu because of lower number of updates at the server. Fig. 5a shows that the byte hit ratio drops with the decrease in mean update arrival time.
0.9 0.8 0.7

Byte hit ratio

0.6 0.5 0.4 0.3 0.2 0.1 0.0 200 400 600 800 1000 1200 1400 I-LRU Min-SAUD R-LPV

Client cache size (KB)

5.0 4.5 I-LRU Min-SAUD R-LPV

Average query latency (sec)

4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 200 400 600 800 1000



Client cache size (KB)

(b) Figure 4: The effects of cache size on byte hit ratio and average query latency. 6.2.3 The effects of mean query generate time Fig. 6 shows the effects of mean query generate time on the average query latency of the R-LPV, ILRU and Min-SAUD algorithms. As explained before, each client generates queries according to the mean query generate time. The generated queries are served one by one. If the queried data is in the local cache, the client can serve the query locally; otherwise, the client has to request the data from the server. If the client cannot process the queries generated due to waiting for the server reply, it queues the generated queries. For all the algorithms, the average query latency drops when Tq increases since fewer queries are generated and the server can serve the queries more quickly. As we can see from Fig. 6, the average query latency of R-LPV is much less than that of I-LRU and Min-SAUD. This is due

Ubiquitous Computing and Communication Journal


to the reason that R-LPV uses the cache space more effectively and retains the most important items in the cache based on the caching parameters and correlation. For example, when Tq is 150 seconds, the Tavg of R-LPV is respectively 42.9% and 14.3% lower than I-LRU and Min-SAUD.
0.9 0.8 0.7



Byte hit ratio

0.6 0.5 0.4 0.3 0.2 0.1 0.0 1 10 100 1000 10000 I-LRU Min-SAUD R-LPV

In this paper, we present the Rule based Least Profit Value (R-LPV) replacement policy for mobile environment that considers various caching parameters such as update frequency, retrieval delay from the server, cache invalidation delay, and data size, alongwith association among the cached items. We design a profit function to dynamically evaluate the profit from caching an item. To enhance the caching performance, the generalized association rules are applied to find the relationship among data items. Simulation experiments demonstrate that RLPV policy outperforms the I-LRU and Min-SAUD policies in terms of byte hit ratio and average query latency. In our future work, we will investigate the use of data mining techniques for prefetching in integration with caching to further enhance the data availability at a mobile client. 8 REFERENCES [1] D. Barbara and T. Imielinski: Sleepers and Workaholics: Caching Strategies in Mobile Environments, Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 1-12 (1994). [2] G. Cao: On Improving the Performance of Cache Invalidation in Mobile Environments, ACM/Kluwer Mobile Networks and Applications, Vol. 7, No. 4, pp. 291-303 (2002). [3] G. Cao: A Scalable Low-Latency Cache Invalidation Strategy for Mobile Environments, IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No. 5, pp. 1251-1265 (2003). [4] N. Chand, R.C. Joshi and Manoj Misra: Energy Efficient Cache Invalidation in a Disconnected Wireless Mobile Environment, International Journal of Ad Hoc and Ubiquitous Computing (IJAHUC), Vol. 2, No. ½, pp. 83-91 (2007). [5] E. Coffman, P. Denning: Operating System Theory, Prentice-Hall, Englewood Cliff, NJ (1973). [6] J. Xu, Q.L. Hu, D.L. Lee and W.-C. Lee: SAIU: An Efficient Cache Replacement Policy for Wireless On-Demand Broadcasts, Ninth ACM International Conference on Information and Knowledge Management, pp. 46-53 (2000). [7] J. Xu, Q.L. Hu, W.-C. Lee and D.L. Lee: Performance Evaluation of an Optimal Cache Replacement Policy for Wireless Data Dissemination, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 1, pp. 125-139 (2004). [8] L. Yin, G. Cao and Y. Cai: A Generalized Target-Driven Cache Replacement Policy for Mobile Environments, Journal of Parallel and Distributed Computing, Vol. 65, No. 5, pp. 583594 (2005).

Mean update arrival time (sec)


Average query latency (sec)




1.0 I-LRU Min-SAUD R-LPV 1 10 100 1000 10000



Mean update arrival time (sec)

(b) Figure 5: The effects of mean update arrival time on byte hit ratio and average query latency.
5.0 4.5 I-LRU Min-SAUD R-LPV

Average query latency (sec)

4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0 50 100 150 200



Mean query generate time (sec)

Figure 6: The effects of mean query generate time on average query latency.

Ubiquitous Computing and Communication Journal


[9] H. Song and G. Cao: Cache-Miss-Initiated Prefetch in Mobile Environments, Computer Communications Journal, Vol. 28, No. 7, pp. 741-753 (2005). [10] R. Agrawal, T. Imielinski and A. Swami: Mining Association Rules Between Sets of Items in Large Database, ACM SIGMOD Conference on Management of Data, pp. 207-216, (1993). [11] H. Shen, Mohan Kumar, Sajal Das and Z. Wang: Energy-Efficient Data Caching and Prefetching of Mobile Devices Based on Utility, Mobile Networks and Applications (MONET), Vol. 10, No. 4, pp. 475-486 (2005). [12] S. Acharya, R. Alonso, M. Franklin and S. Zdonik: Broadcast Disks: Data Management for Asymmetric Communication Environments, ACM SIGMOD Conference on Management of Data, San Jose, USA, pp. 199-210 (1995). [13] P. Cao and S. Irani: Cost-Aware WWW Proxy Caching Algorithms, USENIX Symposium on Internet Technologies and Systems, pp. 193-206 (1997). [14] J. Shim, P. Scheuermann and R. Vingralek: Proxy Cache Design: Algorithms, Implementation and Performance, IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 4, pp. 549-562 (1999). [15] L. Rizzo and L. Vicisano: Replacement Policies for a Proxy Cache, IEEE/ACM Transactions and Networking, Vol. 8, No. 2, pp. 158-170 (2000). [16] R. Wooster and M. Abrams: Proxy Caching That Estimates Edge Load Delays, Proceedings of Computer Networks and ISDN Systems, pp. 977-986 (1997). [17] W.-G. Teng, C.-Y. Chang and M.-S. Chen: Integrating Web Caching and Web Prefetching in Client-Side proxies, IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 5, pp. 444-455 (2005). [18] V. Liberatore: Caching and Scheduling for Broadcast Disk Systems, UMIACS, pp. 98-71 (1998). [19] G. Cao: Proactive Power-Aware Cache Management for Mobile Computing Systems, IEEE Transactions on Computers, Vol. 51, No. 6, pp. 608-621 (2002). [20] S. Gitzenis and N. Bambos: Power-Controlled Data Prefetching/Caching in Wireless Packet Networks, IEEE INFOCOM, pp. 1405-1414 (2002). [21] R. Agrawal and R. Srikant: Fast Algorithms for Mining Association Rules, 20th International Conference on Very Large Data Bases, VLDB, pp. 487-499 (1994). [22] T.M. Anwar, H.W. Beck and S.B. Navathe: Knowledge Mining by Imprecise Queries: A Classification-Based Approach, IEEE Eighth International Conference on Data Engineering, Arizona, pp. 622-630 (1992).

[23] R. Cooley B. Mobasher and J. Srivastava: Data Preparation for Mining World Wide Web Browsing Patterns, Knowledge and Information Systems, Vol. 1, No. 1, pp. 5-32 (1999). [24] Y. Saygin and O. Ulusoy: Exploiting Data mining Techniques for Broadcasting Data in Mobile Computing Environmentts, IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 6, pp. 1387-1399 (2002). [25] G. Yavas, D. Katsaros, O. Ulusoy and Y. Manolopoulos: A Data Mining Approach for Location Prediction in Mobile Environments, Elsevier Data and Knowledge Engineering, Vol. 54, No. 2, pp. 121-146 (2005). [26] G.J. Pottie and W.J. kaiser: Wireless Integrated Network Sensor, Communications of ACM, Vol. 43, No. 5, pp. 551-558 (2000). [27] J. Shim, P. Scheuermann and R. Vingralek: Proxy Cache Design: Algorithms, Implementation and Performance, IEEE Transactions on Knowledge and Data Engineering, Vol. 11, No. 4, pp. 549-562 (1999). [28] L. Breslau, P. Cao, L. Fan, G. Phillips and S. Sheker: Web Caching and Zipf-Like Distributions: Evidence and Implications, IEEE INFOCOM, pp. 126-134 (1999).

Ubiquitous Computing and Communication Journal


Sign up to vote on this title
UsefulNot useful