What is a Database?

It is often said that we live in an information society and that information is a very valuable resource (or, as some people say, information is power). In this information society, the term database has become a rather common term although its meaning seems to have become somewhat vague as the importance of database systems has grown. Some people use the term database of an organization to mean all the data in the organization (whether computerized or not). Other people, use the term to mean the software that manages the data. We would use it to mean a collection of computerized information such that it is available to many people for various uses.

Definition - Database

A database is a well organized collection of data that are related in a meaningful way which can be accessed in different logical orders but are stored only once. The data in the database is therefore integrated, structured, and shared.

The main features of data in a database therefore are:

It is well organized
It is related
It is accessible in different orders without great difficulty
It is stored only once

It is assumed that operations (update, insert, delete, retrieve etc.) on the database can be carried out in a simple and flexible way. Also since a database tends to be a long-term resource of an organization, it is expected that planned as well as unplanned applications can (in general) be carried out without great difficulty.

In any modern organization, a large amount of data is generated about its operations. This data is sometime called operational data. The operational data:

includes the data an organisation must necessarily maintain about its operation.
includes relationships linking basic entities.
excludes input, output data, work queues, temporary results or any transient information.

Since data is a valuable resource for any enterprise, often a great deal of money is spent collecting, storing and maintaining data. The running of an enterprise may depend on proper maintenance of its operational data. For example, a university's operational data may include the following:

Student personal data (e.g. name, sex, current address, home address, date of birth, nationality)
Student academic data (e.g. school results, academic history, current enrolment)
Academic staff data (e.g. name, sex, current address, date of birth, nationality, academic qualifications, appointment history, current appointment, salary history, current salary, sabbatical leave information, recreational leave, sick leave, etc.)
Non-academic staff data ( e.g. name, sex, current address, date of birth, nationality, trade qualifications and experience, appointment history, current appointment, salary history, current salary, recreational leave, sick leave, etc.)
Subjects offered data (e.g. subject name, department, syllabus, lecturer, quota if any)
Financial data (e.g. budget information, receipts, expenditure etc.)

The above data is certainly not a complete list of data that a university generates (for example, data generated by the library is not included) and the reader will no doubt think of other information that should be included.

The above data would have a number of classes of users, for example

Lecturers, who need information about enrolments in the subjects that they teach and Heads of Departments about finances of the departments. These are users that use the information in the database but do not develop their own software. They are often called end users.
Programmers in the Computer Centre who develop programs (called application programs) to produce reports that are needed regularly by the governments, university administration and the departments. These programmers are called applications programmers. They are also responsible for developing programs that assist the end users in retrieving the information that they need in their work.
The Registrar or some other person who is in charge of the database and makes decisions about what information is stored in the database and who can modify and access it.

To conclude we note that data is

a valuable resource and an investment
needed to manage other resources efficiently
should be managed like other resources (e.g. manpower)

What is a DBMS?

As discussed earlier, a database is a well organized collection of data. To be able to carry out operations like insertion, deletion and retrieval, the database needs to be managed by a substantial package of software. This software is usually called a Database Management System (DBMS). The primary purpose of a DBMS is to allow a user to store, update and retrieve data in abstract terms and thus make it easy to maintain and retrieve information from a database. A DBMS relieves the user from having to know about exact physical representations of data and having to specify detailed algorithms for storing, updating and retrieving data.

A DBMS is usually a very large software package that carries out many different tasks including the provision of facilities to enable the user to access and modify information in the database. The database is an intermediate link between the physical database, the computer and the operating system, and on the other hand, the users. To provide the various failities to different types of users, a DBMS normally provides one or more specialized programming languages often called Database Languages. Different DBMS provide different database languages although a language called SQL has recently taken on the role of a de facto standard.

Database languages come in different forms. A language is needed to describe the database to the DBMS as well as provide facilities for changing the database and for defining and changing physical data structures. Another language is needed for manipulating and retrieving data stored in the DBMS. These languages are called Data Description Languages (DDL) and Data Manipulation Languages (DML) respectively.

Each DBMS has a DDL as well as a DML. The two languages may be parts of a unified database language. The languages come in at least the following two forms:

Extended Host Languages - Subroutine calls from one or more programming languages. For example, a system may provide extensions to COBOL or FORTRAN to enable the user to interact with the database. The programming language that is extended is usually called the Host Language.
Query Languages - These are special purpose languages that usually provide more powerful facilities to interact with the database. These languages are often designed to be simple so that they may be used by non-programmers.

To summarise, a database system consists of

The database (data)
A DBMS (software)
A DDL and a DML (Part of the DBMS)
Application programs

Some DBMS packages are marketed by computer manufacturers that will run only on that manufacturer's machines (e.g. IBM's IMS) but increasingly independent software houses are designing and selling DBMS software that would run on several different types of machines (e.g. ORACLE).

Advantages of using a DBMS

There are three main features of a database management system that make it attractive to use a DBMS in preference to more conventional software. These features are centralized data management, data independence, and systems integration.

In a database system, the data is managed by the DBMS and all access to the data is through the DBMS providing a key to effective data processing. This contrasts with conventional data processing systems where each application program has direct access to the data it reads or manipulates. In a conventional DP system, an organization is likely to have several files of related data that are processed by several different application programs.

In the conventional data processing application programs, the programs usually are based on a considerable knowledge of data structure and format. In such environment any change of data structure or format would require appropriate changes to the application programs. These changes could be as small as the following:

Coding of some field is changed. For example, a null value that was coded as -1 is now coded as -9999.
A new field is added to the records.
The length of one of the fields is changed. For example, the maximum number of digits in a telephone number field or a postcode field needs to be changed.
The field on which the file is sorted is changed.

If some major changes were to be made to the data, the application programs may need to be rewritten. In a database system, the database management system provides the interface between the application programs and the data. When changes are made to the data representation, the metadata maintained by the DBMS is changed but the DBMS continues to provide data to application programs in the previously used way. The DBMS handles the task of transformation of data wherever necessary.

This independence between the programs and the data is called data independence. Data independence is important because every time some change needs to be made to the data structure, the programs that were being used before the change would continue to work. To provide a high degree of data independence, a DBMS must include a sophisticated metadata management system.

In DBMS, all files are integrated into one system thus reducing redundancies and making data management more efficient. In addition, DBMS provides centralized control of the operational data. Some of the advantages of data independence, integration and centralized control are:

Redundancies and inconsistencies can be reduced

In conventional data systems, an organisation often builds a collection of application programs often created by different programmers and requiring different components of the operational data of the organisation. The data in conventional data systems is often not centralised. Some applications may require data to be combined from several systems. These several systems could well have data that is redundant as well as inconsistent (that is, different copies of the same data may have different values). Data inconsistencies are often encoutered in everyday life. For example, we have all come across situations when a new address is communicated to an organisation that we deal with (e.g. a bank, or Telecom, or a gas company), we find that some of the communications from that organisation are recived at the new address while others continue to be mailed to the old address. Combining all the data in a database would involve reduction in redundancy as well as inconsistency. It also is likely to reduce the costs for collection, storage and updating of data.

Better service to the Users

A DBMS is often used to provide better service to the users. In conventional systems, availability of information is often poor since it normally is difficult to obtain information that the existing systems were not designed for. Once several conventional systems are combined to form one centralised data base, the availability of information and its up-to-dateness is likely to improve since the data can now be shared and the DBMS makes it easy to respond to unforseen information requests.

Centralizing the data in a database also often means that users can obtain new and combined information that would have been impossible to obtain otherwise. Also, use of a DBMS should allow users that do not know programming to interact with the data more easily.

The ability to quickly obtain new and combined information is becoming increasingly important in an environment where various levels of governments are requiring organisations to provide more and more information about their activities. An organisation running a conventional data processing system would require new programs to be written (or the information compiled manually) to meet every new demand.

Flexibility of the system is improved

Changes are often necessary to the contents of data stored in any system. These changes are more easily made in a database than in a conventional system in that these changes do not need to have any impact on application programs.

Cost of developing and maintaining systems is lower

As noted earlier, it is much easier to respond to unforseen requests when the data is centralized in a database than when it is stored in conventional file systems. Although the initial cost of setting up of a database can be large, one normally expects the overall cost of setting up a database and developing and maintaining application programs to be lower than for similar service using conventional systems since the productivity of programmers can be substantially higher in using non-procedural languages that have been developed with modern DBMS than using procedural languages.

Standards can be enforced

Since all access to the database must be through the DBMS, standards are easier to enforce. Standards may relate to the naming of the data, the format of the data, the structure of the data etc.

Security can be improved

In conventional systems, applications are developed in an ad hoc manner. Often different system of an organisation would access different components of the operational data. In such an environment, enforcing security can be quite difficult.

Setting up of a database makes it easier to enforce security restrictions since the data is now centralized. It is easier to control who has access to what parts of the database. However, setting up a database can also make it easier for a determined person to breach security. We will discuss this in the next section.

Integrity can be improved

Since the data of the organization using a database approach is centralized and would be used by a number of users at a time, it is essential to enforce integrity controls.

Integrity may be compromised in many ways. For example, someone may make a mistake in data input and the salary of a full-time employee may be input as $4,000 rather than $40,000. A student may be shown to have borrowed books but has no enrolment. Salary of a staff member in one department may be coming out of the budget of another department.

If a number of users are allowed to update the same data item at the same time, there is a possiblity that the result of the updates is not quite what was intended. For example, in an airline DBMS we could have a situation where the number of bookings made is larger than the capacity of the aircraft that is to be used for the flight. Controls therefore must be introduced to prevent such errors to occur because of concurrent updating activities. However, since all data is stored only once, it is often easier to maintain integrity than in conventional systems.

Enterprise requirements can be identified

All enterprises have sections and departments and each of these units often consider the work of their unit as the most important and therefore consider their needs as the most important. Once a database has been set up with centralised control, it will be necessary to identify enterprise requirements and to balance the needs of competing units. It may become necessary to ignore some requests for information if they conflict with higher priority needs of the enterprise.

Data model must be developed

Perhaps the most imporrant advantage of setting up a database system is the requirement that an overall data model for the enterprise be built. In conventional systems, it is more likely that files will be designed as needs of particular applications demand. The overall view is often not considered. Building an overall view of the enterprise data, although often an expensive exercise, is usually very cost-effective in the long term.

Disadvantages of using a DBMS

A database system generally provides on-line access to the database for many users. In contrast, a conventional system is often designed to meet a specific need and therefore generally provides access to only a small number of users. Because of the larger number of users accessing the data when a database is used, the enterprise may involve additional risks as compared to a conventional data processing system in the following areas.

Confidentiality, privacy and security.
Data quality.
Data integrity.
Enterprise vulnerability may be higher.
The cost of using DBMS.

Confidentiality, Privacy and Security

When information is centralised and is made available to users from remote locations, the possibilities of abuse are often more than in a conventional data processing system. To reduce the chances of unauthorised users accessing sensitive information, it is necessary to take technical, administrative and, possibly, legal measures.

Most databases store valuable information that must be protected against deliberate trespass and destruction.

Data Quality

Since the database is accessible to users remotely, adequate controls are needed to control users updating data and to control data quality. With increased number of users accessing data directly, there are enormous opportunities for users to damage the data. Unless there are suitable controls, the data quality may be compromised.

Data Integrity

Since a large number of users could be using a database concurrently, technical safeguards are necessary to ensure that the data remain correct during operation. The main threat to data integrity comes from several different users attempting to update the same data at the same time. The database therefore needs to be protected against inadvertent changes by the users.

Enterprise Vulnerability

Centralising all data of an enterprise in one database may mean that the database becomes an indispensible resource. The survival of the enterprise may depend on reliable information being available from its database. The enterprise therefore becomes vulnerable to the destruction of the database or to unauthorised modification of the database.

The Cost of using a DBMS

Conventional data processing systems are typically designed to run a number of well-defined, preplanned processes. Such systems are often "tuned" to run efficiently for the processes that they were designed for. Although the conventional systems are usually fairly inflexible in that new applications may be difficult to implement and/or expensive to run, they are usually very efficient for the applications they are designed for.

The database approach on the other hand provides a flexible alternative where new applications can be developed relatively inexpensively. The flexible approach is not without its costs and one of these costs is the additional cost of running applications that the conventional system was designed for. Using standardised software is almost always less machine efficient than specialised software.

DBMS Architecture

We now discuss a conceptual framework for a DBMS. Several different frameworks have been suggested over the last several years. For example, a framework may be developed based on the functions that the various components of a DBMS must provide to its users. It may also be based on different views of data that are possible within a DBMS. We consider the latter approach.

A commonly used views of data approach is the three-level architecture suggested by ANSI/SPARC (American National Standards Institute/Standards Planning and Requirements Committee). ANSI/SPARC produced an interim report in 1972 followed by a final report in 1977. The reports proposed an architectural framework for databases. Under this approach, a database is considered as containing data about an enterprise. The three levels of the architecture are three different views of the data:

External - individual user view
Conceptual - community user view
Internal - physical or storage view

The three level database architecture allows a clear separation of the information meaning (conceptual view) from the external data representation and from the physical data structure layout. A database system that is able to separate the three different views of data is likely to be flexible and adaptable. This flexibility and adaptability is data independence that we have discussed earlier.

We now briefly discuss the three different views.

The external level is the view that the individual user of the database has. This view is often a restricted view of the database and the same database may provide a number of different views for different classes of users. In general, the end users and even the applications programmers are only interested in a subset of the database. For example, a department head may only be interested in the departmental finances and student enrolments but not the library information. The librarian would not be expected to have any interest in the information about academic staff. The payroll office would have no interest in student enrolments.

The conceptual view is the information model of the enterprise and contains the view of the whole enterprise without any concern for the physical implemenation. This view is normally more stable than the other two views. In a database, it may be desirable to change the internal view to improve performance while there has been no change in the conceptual view of the database. The conceptual view is the overall community view of the database and it includes all the information that is going to be represented in the database. The conceptual view is defined by the conceptual schema which includes definitions of each of the various types of data.

The internal view is the view about the actual physical storage of data. It tells us what data is stored in the database and how. At least the following aspects are considered at this level:

Storage allocation e.g. B-trees, hashing etc.
Access paths e.g. specification of primary and secondary keys, indexes and pointers and sequencing.
Miscellaneous e.g. data compression and encryption techniques, optimisation of the internal structures.

Efficiency considerations are the most important at this level and the data structures are chosen to provide an efficient database. The internal view does not deal with the physical devices directly. Instead it views a physical device as a collection of physical pages and allocates space in terms of logical pages.

The separation of the conceptual view from the internal view enables us to provide a logical description of the database without the need to specify physical structures. This is often called physical data independence. Separating the external views from the conceptual view enables us to change the conceptual view without affecting the external views. This separation is sometimes called logical data independence.

Assuming the three level view of the database, a number of mappings are needed to enable the users working with one of the external views. For example, the payroll office may have an external view of the database that consists of the following information only:

Staff number, name and address.
Staff tax information e.g. number of dependents.
Staff bank information where salary is deposited.
Staff employment status, salary level,leave information etc.

The conceptual view of the database may contain academic staff, general staff, casual staff etc. A mapping will need to be created where all the staff in the different categories are combined into one category for the payroll office. The conceptual view would include information about each staff's position, the date employment started, full-time or part-time, etc etc. This will need to be mapped to the salary level for the salary office. Also, if there is some change in the conceptual view, the external view can stay the same if the mapping is changed.

Figure 1.1 DBMS Architecture

System Architecture

A database management system is a complex piece of software that usually consists of a number of modules. The DBMS may be considered as an agent that allows communication between the various types of users with the physical database and the operating system without the users being aware of every detail of how it is done. To enable the DBMS to fulfil its tasks, the database management system must maintain information about the data itself that is stored in the system. This information would normally include what data is stored, how it is stored, who has access to what parts of it and so on.

The information (data) about the data in a database is called the metadata. In addition to information listed above, some information regarding the use of a database is often collected to monitor the system's performance. This metadata helps management in maintaining an effective and efficient database system.

A database management system is often used by two different type of users. Firstly, there are users who pose ad hoc queries and updates, which are usually executed only once. Then there are users that use canned programs that are installed on the system by application programmers. These programs are often used repeatedly. The database system must provide a query language and an embedded host language to meet the needs of these two types of users.

In addition, a DBMS provides facilities for

describing the database, when a database is being set up
authorization specification and checking
access path selection
concurrency control
logging and recovery

and many more. To provide all the above facilities, a DBMS often has a system architecture as shown in the figure below. The main components of the DBMS therefore are:

A Query Language and a Data Description Language (DDL) to provide users the access to the database.
A translator for users' instructions in the query language and the DDL including query optimization.
A Database manager
A file manager
The physical database
The metadata

The above listing of DBMS components does not include some very important components e.g. concurrency controller and recovery manager. We have left these components out to keep the architecture relatively simple.

Database Administrator

The database will be able to meet the demands of various users in the organisation effectively only if it is maintained and managed properly. Usually a person (or a group of persons) centrally located, with an overall view of the database, is needed to keep the database running smoothly. Such a person is called the Database Administrator (DBA).

The DBA would normally have a large number of tasks related to maintaining and managing the database. These tasks would include the following:

Deciding and Loading the Database Contents - The DBA in consultation with senior management is normally responsible for defining the conceptual schema of the database. The DBA would also be responsible for making changes to the conceptual schema of the database if and when necessary.
Assisting and Approving Applications and Access - The DBA would normally provide assistance to end-users interested in writing application programs to access the database. The DBA would also approve or disapprove access to the various parts of the database by different users.
Deciding Data Structures - Once the database contents have been decided, the DBA would normally make decisions regarding how data is to be stored and what indexes need to be maintained. In addition, a DBA normally monitors the performance of the DBMS and makes changes to data structures if the performance justifies them. In some cases, radical changes to the data structures may be called for.
Backup and Recovery - Since the database is such a valuable asset, the DBA must make all the efforts possible to ensure that the asset is not damaged or lost. This normally requires a DBA to ensure that regular backups of a database are carried out and in case of failure (or some other disaster like fire or flood), suitable recovery procedures are used to bring the database up with as little down time as possible.

Monitor Actual Usage - The DBA monitors actual usage to ensure that policies laid down regarding use of the database are being followed. The usage information is also used for performance tuning.

Figure 1.2 System Architecture of a DBMS

Exercises

What are the primary features of a database?
What is a DBMS?
Describe a typical system architecture of DBMS?
What is data independence? Why is data independence important?
List some of the advantages of using DBMS as compared to a conventional data processing.
Present a conceptual view of a student database. Present some ecternal views that different types of end users might have and so the mappings of these views to the conceptual view.

Intuition behind the relational model

An experiment was conducted in class. Three students wrote down on the board, their name and birth date (but not the year) and a relative's name and birth date. Our experiment didn't quite work :-) but fortunately one of the students choose to convey that information in the form of a table, e.g.,

   Joe  14/3

   June 12/1

Which suggests that it is very natural for people to think of data as organised in a table (or at least one out of every three people!). You should think of data in the relational model as organised in tables.

Mathematical basis of the relational model

Think back to your kindergarten maths class and remember some simple, non-scary terms.

Sets

A set is an unordered collection of distinct elements. For example, {3, 4, a} is a set, as is {4, a, howdy}. But {4, 4} is not a set. Set algebra defines operations available on sets, such as union, e.g.,

Intuition behind the relational model

   Joe  14/3

   June 12/1

Mathematical basis of the relational model

Think back to your kindergarten maths class and remember some simple, non-scary terms.

Sets

Domains

A domain is a set of values of some "type" (e.g., the domain of integers, the domain of reals, the domain of character strings of twenty characters chosen from an alphanumeric character set).

Each domain value is assumed to be atomic, that is, there are no relational algebra operators to decompose it. In most database implementations, the atomicity restriction has meant that the value must be of some simple type, such as integer, float, array of character, boolean, etc. We will explore the intent and use of this restriction when we discuss nested relations and normal forms.

Attributes

An attribute is the name of a role played by a domain in a relation. We will denote an attribute as A, with domain dom(A). We will assume that each attribute in a relation has a unique name.

Cartesian product (of sets)

The Cartesian product of n domains, written dom(A₁)X dom(A₂)X

... dom(A_n), is defined as follows.

(A₁ X A₂X... A_n = {(a₁, a₂, ..., a_n) | a₁

A₁ AND a₂

A₂ AND ... AND a_n A_n}

We will call each element in the Cartesian product a tuple. So each (a₁, a₂, ..., a_n) is known as a tuple. Note that in this formulation, the order of values in a tuple matters (we could give an alternative, but more complicated definition, but we shall see later how to swap attributes around in a relation).

Relations

A relation is a subset of the Cartesian product. We will assume that each relation has a name, so that we may readily distinguish between different relations by name. In general, a relation is really just a named set of tuples. We will often represent a relation as a table rather than a set since it is easier to visualise the rows and columns in tabular format. For example, consider a Professions relation with two attributes, Name, and Job, and containing the tuples depicted below.

Degree of a relation

The degree of a relation is the number of attributes (that is, the number of columns).

Cardinality of a relation

The cardinality of a relation is the number of tuples (that is, the number of rows).

A value in a tuple

To indicate a particular attribute value in a tuple, we will use the notation t.s where s is the name of an attribute. Alternatively we will use the notation t[s] to refer to the value of attribute s in the tuple. When the context is clear (that is, when it is clear which tuple we are referring to) we will often drop the t. and just use s.

Relational schema

A relation schema or just schema, written R(A₁, A₂, ..., A_n), is a relation name and a list of attribute names. The schema describes the organisation of possible relations. An individual relation is said to be an instance of a relation schema.

Summary

To recap:

Humans intuitively make tables to organise tables.
A relation is like a table.
Each row in a table corresponds to a tuple in a relation.
Each column in a table corresponds to an attribute in a relation.
A relation is actually just a set of tuples, with the following constraints.

Each attribute must have a unique name.
Each attribute value is chosen from the same domain. This is often called the domain constraint.
The information content of a relation is independent of the order of rows or columns.

The number of attributes is called the degree of the relation.
The number of tuples is called the cardinality of the relation.

Relational Algebra

Queries in relational algebra are based on specifying operations on relations. The relational algebra expressions are similar to algebraic expressions (e.g. 2*a + b*c) that we are familiar with except that we are now dealing with relations and not numbers.

We will define a number of operations that may be applied to relations of all degrees for selecting data from a relational data base. The operations include the usual set operations like union and intersection as well as operations for selecting a number of rows or columns from a relation.

The result of applying one or more relational operators to a set of (possibly one) relations is a new relation.

We first discuss selection, projection and join since these three operations are sufficient for formulating a large number of useful queries. In addition, we will discuss the cartesian product since it is needed in defining the join. Further relational operators will be discussed later in the section.

Restriction or Selection

The operation of selecting certain tuples from one relation is called restriction (or selection). Usually one wants to select tuples that meet a specified condition, for example, students enrolled in CP302. Selection or restriction may then be used. Selection is an unary operator since it operates on a single relation.

More formally, restriction of a relation R is a subset relation S of the tuples in R such that the tuples selected meet a specified condition. The degree of S is the same as that of R. The cardinality of S is equal to the number of tuples in R that meet the specified condition.

A restriction of a relation R such that the attribute A has value a will be denoted by R[A=a]. Similarly R[A>a] is a restriction on R such that attribute A has values greater than a. Some authors use sigma A=a[R] to denote R[A=a]. More generally sigma p[R] is used to denote a restriction on relation R with predicate p.

Let us now consider couple of examples of the use of the operator restriction. For example, suppose we wish to find students who live at 88, Long Hall, we do so by selecting tuples in relation student where the value of the attribute address is 88, Long Hall. We could write the selection as

student[address = '88, Long Hall'] or sigma address = '88, Long Hall'[student]

Cartesian Product

The cartesian product of any two relations R (of degree m) and S (of degree n) yields a relation R X S of degree m + n. This product relation has all the attributes that are present in relations R and S and the tuples in R X S are all possible combinations of tuples from R and S. The cardinality of R X S therefore is ab if a is the cardinality of S and b of R. The degree of R X S is x + y if x and y are degrees of R and S respectively.

Let us consider the cartesian product of the relations student and enrolment. The product is given Table 4. Note that the degree of this relation is 5 and the cardinality is 24 (not all the tuples of the product are shown in the relation). The product may not appear to be a terribly useful operator to the reader since it combines many tuples that have no association with each other. We will now show that a cartesian product followed by a restriction operator is a very important operator called the join. We discuss it now.

Join

The join of two relations R and S is a restriction of their cartesian product R X S such that a specified condition is met. The join is normally defined on an attribute a of R and an attribute b of S such that the attributes are from the same domain and are therefore related. The specified condition (called the join predicate) is a condition on attributes R.a and S.b.

The most commonly needed join is the join in which the specified condition is equality of the two attributes (one from each relation) in R X S. This join is called an equi-join. Since equi-join by definition has two equal attributes (columns) one of these may be removed by applying the projection operator. The resulting relation is called the natural join.

The degree of equi-join is x+y if the degree of R and S is x and y respectively. The degree of natural join is x+y-1. The cardinality of the join depends on the number of tuples that satisfy the join condition.

A join of R and S with join predicate a = b or a < b is usually written as

R[a = b]S or R[a < b]S

Let theta be one of the relations =, <>, <, >, <=, >=. We may now define a more general concept of theta-join of relation R (on attribute a) and relation S (on attribute b). The theta-join is a restriction on the cartesian product of R and S such that the condition a theta b is satisfied. We consider examples to show the operations equi-join, natural join and theta join.

To illustrate equi-join and natural join, we consider the relations student and enrolment (Tables 1 and 2). There is an implicit association between the two relations since the relation student provides personal information about students while enrolment provides information about the subjects that students are enrolled in. One may be wondering why the two relations are separate. The reason becomes clear if we consider what would happen when the two relations are combined. The personal information of a student is now repeated for each subject entry in which the student is enrolled in. This is not desirable not only because it wastes storage. This will be discussed in more detail later.

Data Integrity

We noted at the beginning of the chapter that the relational model has three main components; data structure, data manipulation, and data integrity. The aim of data integrity is to specify rules that implicitly or explicitly define a consistent database state or changes of state. These rules may include facilities like those provided by most programming languages for declaring data types which constrain the user from operations like comparing data of different data types and assigning a variable of one type to another of a different type. This is done to stop the user from doing things that generally do not make sense. In a DBMS, integrity constraints play a similar role.
The integrity constraints are necessary to avoid situations like the following:

Some data has been inserted in the database but it cannot be identified (that is, it is not clear which object or entity the data is about).
A student is enrolled in a course but no data about him is available in the relation that has information about students.
During a query processing, a student number is compared with a course number (this should never be required).
A student quits the university and is removed from the student relation but is still enrolled in a course.

Integrity constraints on a database may be divided into two types:

Static Integrity Constraints - these are constraints that define valid states of the data. These constraints include designations of primary keys etc.
Dynamic Integrity Constraints - these are constraints that define side-effects of various kinds of transactions (e.g. insertions and deletions).

We now discuss certain integrity features of the relational model. We discuss the following features:

Primary Keys
Domains
Foreign Keys and Referential Integrity
Nulls

Primary Keys

We have earlier defined the concept of candidate key and primary key. From the definition of candidate key, it should be clear that each relation must have at least one candidate key even if it is the combination of all the attributes in the relation since all tuples in a relation are distinct. Some relations may have more than one candidate keys.
As discussed earlier, the primary key of a relation is an arbitrarily but permanently selected candidate key. The primary key is important since it is the sole identifier for the tuples in a relation. Any tuple in a database may be identified by specifying relation name, primary key and its value. Also for a tuple to exist in a relation, it must be identifiable and therefore it must have a primary key. The relational data model therefore imposes the following two integrity constraints:
(a) no component of a primary key value can be null;
(b) attempts to change the value of a primary key must be carefully controlled.
The first constraint is necessary because if we want to store information about some entity, then we must be able to identify it, otherwise difficulties are likely to arise. For example, if a relation

CLASS (STUNO, LECTURER, CNO)
has (STUNO, LECTURER) as the primary key then allowing tuples like

3123	NULL	CP302
NULL	SMITH	CP302

is going to lead to ambiguity since the two tuples above may or may not be identical and the integrity of the database may be compromised. Unfortunately most commercial database systems do not support the concept of primary key and it would be possible to have a database state when integrity of the database is violated.

The second constraint above deals with changing of primary key values. Since the primary key is the tuple identifier, changing it needs very careful controls. Codd has suggested three possible approaches:

Method 1

Only a select group of users be authourised to change primary key values.

Method 2

Updates on primary key values be banned. If it was necessary to change a primary key, the tuple would first be deleted and then a new tuple with new primary key value but same other values would be inserted. Of course, this does require that the old values of attributes be remembered and be reinserted in the database.

Method 3

A different command for updating primary keys be made available. Making a distinction in altering the primary key and another attribute of a relation would remind users that care needs to be taken in updating primary keys.

Advantages and disadvantages of each to be discussed.

Foreign Keys and Referential Integrity

We have earlier considered some problems related to modifying primary key values. Additional problems arise because primary key value of a tuple may be referred in many relations of the database. When a primary key is modified, each of these references to a primary key must either be modified accordingly or be replaced by NULL values. Only then we can maintain referential integrity.

Before we discuss referential integrity further, we define the concept of a foreign key. The concept is important since a relational database consists of relations only (no pointers) and relationships between the relations are implicit, based on references to primary keys of other relations. These references are called foreign keys.

We now define foreign key. A foreign key in a relation R is a set of attributes whose values are required to match those of the primary key of some relation S.

In the following relation the supervisor number is a foreign key (it is the primary key of employee)

employee (empno, empname, supervisor-no, dept)

In the following relation Class, student-num and lecturer-num are foreign keys since they appear as primary keys in other relations (relations student and lecturer).

Class (student-num, lecturer-num, subject)
student ( )
lecturer ( )

Foreign keys are the implicit references in a relational database. The following constraint is called referential integrity constraint:

If a foreign key F in relation R matches the primary key P of relation S than every value of F must either be equal to a value of P or be wholly null.

The justification for referential integrity constraint is simple. If there is a foreign key in a relation (that is if the relation is referring to another relation) then its value must match with one of the primary key values to which it refers. That is, if an object or entity is being referred to, the constraint ensures the referred object or entity exists in the database.

Advantages of the Relational Model

Codd in his 1970 paper indicated that there were four major objectives in designing the relational data model. These were:

Data Independence - to provide a sharp and clear boundary between the logical and physical aspects of database management.
Simplicity - to provide a simpler structure than were being used at that time. A simple structure is easy to communicate to users and programmers and a wide variety of users in an enterprise can interact with a simple model.
Set-processing - to provide facilities for manipulating a set of records at a time so that programmers are not operating on the database record by record.
Sound Theoretical Background - to provide a theoretical background for database management field.

By: Prof. Roy

Search This Blog

Rehan Moini

Second Trimester Notes _ RDBMS

Definition - Database

Intuition behind the relational model

Mathematical basis of the relational model

Sets

Intuition behind the relational model

Mathematical basis of the relational model

Sets

Domains

Attributes

Cartesian product (of sets)

Relations

Degree of a relation

Cardinality of a relation

A value in a tuple

Relational schema

Summary

Comments

Post a Comment

Popular posts from this blog