Distributed Database Environment1
Hossein Hakimzadeh2, William Perrizo3, Prabhu Ram4, Igor Tatarinov3
Abstract
This study proposes DC-ROLL, a new approach to concurrency control in a distributed environmentbased on the Request Order Linked List (ROLL) [P91] and the Datacycle architecture [H+87]. DC-ROLLallows for more flexibility in levels of concurrency and avoids certain performance bottlenecks of thetraditional two-phase locking.
Unlike the original Datacycle architecture where the entire database along with concurrency controlstructures is pumped to access managers (sites), the DC-ROLL architecture only requires broadcastingthe ROLL, the database concurrency control object. As a result, DC-ROLL has much lower networkbandwidth requirements.
DC-ROLL eliminates the burden of concurrency control from the sites storing the data being accessed.Instead, all concurrency control functions are handled at the sites where transactions run. DC-ROLL alsosimplifies database consistency management in replicated environments since no centralized (primary)concurrency control manager is needed.
1. Introduction
The industry trend to decentralize data management has generated many research studies inapplication of the traditional concurrency control schemes to distributed database systems, see[BHG87] for references. Many new ideas have also been proposed, e.g., [H+87, PRR91, BL94].
1234
Partially supported by grants: NSF OSR-9553368 and DARPA DAAH04-96-1-0329.
Dept. of Mathematics & Computer Science, Indiana University South Bend, South Bend, IN 46634-7111 Computer Science Dept., North Dakota State University, Fargo, ND 58105-51
Boeing Information & Support Services PO Box 3707, MS 7L-40, Seattle, WA 98124-2207
1
One of the major problems in distributed database research is scaleability. Ideally, it should bepossible to double the system throughput by doubling the number of sites (nodes). This is veryhard to achieve due to unavoidable network delays. Another issue is the increasing probabilitythat some sites experience higher loads due to hotness of their data. The latter problem can bemitigated by load balancing.
A common approach to load balancing in distributed database systems is data replication[BHG87]. To efficiently use traditional two phase locking in a replicated environment, one copyof each data item can be designated as primary [S79]. This actually may result in an increasedconcurrency control burden on the sites storing hot data. In other words, this does not solve theproblem completely.
Datacycle, a non-traditional approach to distributed database management, was proposed in[H+87] and [B+92]. Datacycle efficiently solves the problem of scaleability but puts significantconstraints on the database size and contention level (percent of updates).
This study proposes DC-ROLL, a novel approach to concurrency control in a distributedenvironment based on the Request Order Linked List (ROLL) [P91] and the Datacyclearchitecture. DC-ROLL avoids certain performance bottlenecks of the traditional two-phaselocking. It also simplifies database consistency management in replicated environments since nocentralized concurrency control manager is needed. Unlike the original Datacycle, DC-ROLLdoes not require pumping the entire database. Only the ROLL, the database concurrency controlobject, is broadcast. As a result, DC-ROLL has much lower network bandwidth requirements.
2
2. Request Order Linked List (ROLL) Overview
The Request Order Linked List (ROLL) was proposed in [P91] as a novel approach toconcurrency control in centralized and distributed database systems.2.1 ROLL Object Structure
The ROLL is an object that stores concurrency control data. The ROLL is handled directly bytransactions, i.e., the bottleneck of a dedicated database scheduler is eliminated.
The basic ROLL element is a bit vector that indicates a transaction’s requests for access to dataitems. Each bit position in the bit vector corresponds to a different data item. Bit-value 1indicates that the transaction requests access to the item corresponding to that bit position, whilebit-value 0 indicates that it does not5.
The ROLL object is a linked list of bit vectors, one bit vector for each active transaction. Theorder of the bit vectors in the linked list determines the serialization partial order of thetransactions. The ROLL also has a tail pointer to the last element is the list.
100001110000001010 ... 011T1T2
010001100000001100 ... 001Tail Pointer2.2 Basic ROLL Operations
Three basic operations can be performed on the ROLL: POST, CHECK and RELEASE.
5
Here we consider single mode access only. To allow for shared and exclusive modes, each data item in the database
is allocated two bits in the ROLL bit vector.
3
POST is an atomic operation that establishes the serialization partial order of transactions.Having created a request bit vector (RBV) as above, a transaction POSTs (enqueues) its RBVinto the list. To do that, the transaction has to obtain exclusive access to the tail pointer, i.e., noother transaction can POST at the same moment. Then, the transaction has to copy the tailpointer value to its uplink pointer and write the address of its RBV into the tail pointer.
Transactions POST their RBVs only once in the beginning of their execution, i.e., the basicROLL method requires pre-declaration. We discuss a dynamic version of the algorithm thatallows rePOSTs in section 2.3.3.
The CHECK operation allows a transaction to check the ROLL for availability of the entire setof needed data items in one operation. This operation does not require exclusive access to theROLL, i.e., several transactions can perform CHECK simultaneously. A transaction CHECKs bysimply accumulating the logical OR of all bit vectors of all bit vectors ahead of it in the list. Theresulting bit vector represents an access filter for the transaction.
To RELEASE a data item once finished with it, the transaction sets the corresponding bit tozero. If another transaction CHECKs the ROLL later, it will find the data item available. LikeCHECK, RELEASE does not require exclusive access.2.3 A ROLL-based Concurrency Control Scheme
The concurrency control algorithm based on the properties of the ROLL protocol can besummarized using the following five rules:
4
1. Transactions know the data items they are going to use when they start6. When a new
transaction arrives, it POSTs an RBV requesting access to the needed data items.
2. The new transaction CHECKs the ROLL to find which data items are available. If there are
items that can be used immediately, the transaction proceeds. If there are no available dataitems, the transaction has to wait, i.e., it has been blocked.
3. Blocked transactions periodically re-CHECK the ROLL attempting to resume their
execution7.
4. Once a transaction is finished with a data item, it can RELEASE the data item immediately,
i.e., data can be RELEASEd before the transaction commits8.
5. When a transaction commits, its RBV can be removed from the ROLL.A scheduler based on these five rules produces only serializable executions [P91].
There is a significant advantage in the ROLL approach that we would like to point out. Unlikeother schemes based on pre-declaration, ROLL is not conservative. A transaction does not haveto wait for all needed data items to become available. Instead, it may proceed as soon as enoughdata items become available for its next execution step.
6
As with any other pre-declaration method, over-declaration can be used when the needed data items are not known
beforehand. A non-predeclarative version of ROLL is discussed in section 2.3.3.
7
As it will be shown in section 4, this comes at almost no additional cost when ROLL is combined with Datacycle in
a distributed environment.
8
The effect of recoverability and other issues on this rule is discussed in section 2.3.2.
5
2.3.1 Deadlock Avoidance
When ROLL is used for concurrency control, no deadlocks can happen, since a transaction canonly wait for transactions ahead of it in the ROLL list. Of course, this is a property of any pre-declaration method of concurrency control. However we would like to emphasize it again thatunlike other pre-declaration methods, ROLL is not conservative. If some data items are notavailable, the transaction can still proceed and access those needed data items which areavailable.
2.3.2 Recoverability And Other Issues
One may notice that the five rules we listed in the ROLL protocol description do not ensurerecoverability (RC) of transactions [BGH87]. Along with RC, cascading aborts are usually notdesired. Recovery issues and the possibility to defer updates also dictate that executions be strict[BGH87]. These three requirements9 mean that write bits should be released only at committime. Read bits can still be released earlier.2.3.3 Implementation Problems
Three potential problems leap immediately in mind. One is the request bit vector size. Using acoarser granularity of locks or/and simple compression techniques like run-length encoding solvethis problem10. The second problem is dynamic transaction support, i.e., situations when theexact transaction data set is not known beforehand. This can be solved by over-declaration or 9
One, in fact, because strictness implies cascadelessness and the latter, in turn, implies recoverability.
We estimated the approximate size of a ROLL object in the case when 1,000 concurrent transactions are executed,
10
each transaction accessing 1,000 objects, to be less than 3M (the database size is 1,000,000 objects). With run-lengthencoding, ROLL size depends only slightly on the total number of data items in the database.
6
allowing rePOSTs as described in [P91]. Another problem arises if the number of objects in thedatabase may change. This problem can be solved by making the bit vector sufficiently long andthen lengthening (shortening) it periodically if required.3. The Original Datacycle Architecture Overview
In this section we briefly describe the Datacycle architecture as it was proposed in [H+87, B+92].Unlike traditional database systems where transactions are sent to data sites, in Datacycle data isbroadcast to the sites where transactions run. Figure 1 shows the architecture of the Datacyclemodel. The entire contents of the database along with any concurrency control information areperiodically pumped by the Data Pump through a high bandwidth channel to Access Managers.Access Managers are actually the nodes where transactions are executed. To process the datastream on the fly, Access Managers use custom hardware that performs filtering of dataaccording to given criteria and other basic operations. If a transaction needs to modify a data itemit sends a request to the Update Manager, a subsystem of the Data Pump. The Update Manageruses an optimistic concurrency control method to ensure serializability.
Data PumpFiber Optic Bus
AccessManager
UpdateManager
AccessManagerAccessManager
Upstream network
Figure 1. The Original Datacycle Architecture
7
The major advantage of the Datacycle architecture is its almost perfect scaleability: the requiredbandwidth of the data channel does not depend on the number of Access Managers.
However, Datacycle also has certain disadvantages. The bandwidth of the data channel putssignificant limitations on the size of the database. Even for medium sized databases (<100M), thebandwidth of the data channel should be higher than 1Gbs or else the traditional distributeddatabase approach would be more efficient [BL94].
The Update Manager may become a bottleneck if the number of updates is high. The optimisticapproach to concurrency control is also only suitable for low contention workloads. A transactionwith a series of dependent reads could accomplish each read no sooner than every cycle. Allthese problems make Datacycle only suitable for workloads with few updates and simplequeries.
4. DC-ROLL - a Novel Approach to Concurrency Control in a DistributedEnvironment
To reap the benefits of the Datacycle architecture and avoid its problems, we propose thatDatacycle be used only to broadcast concurrency control data. The database is still accessed usingthe traditional approach: having obtained access to data items, transactions send requests to thesites where the items are stored. Processing data requests is straightforward since it does notinvolve any concurrency control operations.
The ROLL is an ideal structure for concurrency control data to use in such an architecture11. TheUpdate Manager of the original Datacycle model would now work as a global ROLL manager. 11
A simple lock table would not work because blocked transactions queues would have to be maintained.
8
Transactions POST their request bit vectors to the ROLL manager. The pump broadcasts theentire ROLL to all sites. On each site, transactions CHECK to see if the needed data is available.If the data is available and read access is needed, a request is sent to the sites storing the data. If atransaction only needs to write a data item, it uses its private local workspace to do that, i.e.,updates are deferred until commit point.
To demonstrate the advantages of DC-ROLL, consider a distributed database where some dataare replicated for higher performance [BHG87]. In a replicated database, copies of some dataitems are stored at multiple sites. Updates have to be propagated to all copies of the data itemthat is being modified. Deferred updates can improve performance but introduce the problem ofundetected conflicts: one transaction may lock copies of data items at one site and modify themwhile another transaction may do the same with other copies of the same data. One of thetransactions would have to be aborted when it tries to commit and propagate its updates to othersites. To avoid this problem, one copy of each data item can be designated as primary [S79]. Alltransactions are required to use the primary copies of the data items they access.
It is easy to see that the site storing the primary copy of a hot data item may become a bottleneck.Suppose two phase locking is used for concurrency control. Along with a huge number ofrequests for the hot data, such sites would also have to deal with an equally large number oflock/release requests. On the contrary, DC-ROLL removes the burden of concurrency control,that may be significant [GL92], from the sites that store the actual data. No lock/release requestsare ever sent. Instead, all concurrency control functions are handled at the sites wheretransactions run. Furthermore, there is no need for designating some data item copies as primary.Whenever a transaction modifies any copy of a data item, the request vector it has postedindicates that the data item is being accessed. Other transactions are therefore aware of this and
9
would wait until the item becomes available before modifying any of its copies.5. Conclusion and Future Work
We have proposed a new approach to concurrency control in a distributed environment. Ourapproach is based on a combination of the ROLL concurrency control ROLL structure andDatacycle architecture. The proposed architecture eliminates the burden of concurrency controlfrom the sites storing the data being accessed. It also simplifies concurrency control in areplicated environment.
We are currently using analytic and simulation modeling to study performance benefits of theDC-ROLL architecture. We are also implementing a prototype of a DC-ROLL-based distributeddatabase system on a cluster of workstation connected with an ATM switch.6. References
[B+92] T. Bowen, G. Gopal, G. Herman, T. Hickey, K. Lee, W. Mansfield, J. Raitz, A. Weinrib. TheDatacycle Architecture. Communications of the ACM, Vol. 35, No. 12, December 1992.
[BHG87] P. Bernstein, V. Hadzilacos, N. Goodman. Concurrency Control and Recovery in DatabaseSystems. Addison-Wesley, 1987.
[BL94] S. Banerjee, V. O. K. Li, Evaluating the Distributed Datacycle Scheme for a High PerformanceDistributed Database, In Proceedings of the 6th International Conference on Computing andInformation, Peterborough, Canada, May 1994.
[GL92] V. Gottemukala, T. Lehman. Locking and Latching in a Memory-Resident Database System. InProceedings of the 18th VLDB Conference. Vancouver, Canada, 1992.
[H+87] G. Herman, et al. The Datacycle Architecture for Very High Throughput Database Systems, InProceedings of the ACM SIGMOD conference, San Francisco, CA, May 1987.
[P91] W. Perrizo. Request Order Linked List (ROLL): A Concurrency Control Object for Centralizedand Distributed Database Systems. In Proceedings of the IEEE Int. Conference on Data Engineering,Kobe, Japan, April, 1991.
[PRR91] W. Perrizo, J. Rajkumar, P. Ram, P., HYDRO: A Heterogeneous Distributed Database System.In Proceedings of the ACM SIGMOD Conference, Denver, CO, May 1991.
[S79] M. Stonebreaker. Concurrency Control and Consistency of Multiple Copies of Data in DistributedINGRES. IEEE Trans. on Software Engineering, May 1979.
10
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- azee.cn 版权所有 赣ICP备2024042794号-5
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务