Database
A database is a collection of data that is stored and organized in a way that can be easily accessed, managed and updated. Databases are used to store and manage data for various applications, such as websites, mobile apps, and desktop software
A database is nothing more than a set of related information
A database management system (DBMS) is a software that allows users to define, create, maintain and control access to the database
- Primary goal of a DBMS is to provide an efficient and convenient method for defining and managing databases
Before Databases
Structure vs. Unstructured Data
Data can be stored in different formats, depending on the type of data and how it is used
Structured Data: Data that is organized in a predefined manner, such as in a table with rows and columns
Example: Relational databases
Unstructured Data: Data that does not have a predefined structure or organization
Example: Text, images, videos, audio files
Non-Relational Database Systems
Databases before the advent of relational databases were non-relational. They were based on the hierarchical or network model. These databases were designed to handle specific types of data and were not as flexible as relational databases
Hierarchical Database: Data is organized in a tree-like structure, with parent-child relationships
- Single-parent hierarchy (one-to-many relationship)
textCompany / \ Dept1 Dept2 / \ | \ Emp1 Emp2 Emp3 Emp4
Network Database: Data is organized in a graph-like structure, with many-to-many relationships
- Exposes sets of records and sets of links that define relationships between different records
- Multiple-parent hierarchy (many-to-many relationship)
textEmployee1 | | Project1 Employee2 | | Project2
Relational Database Systems
A relational database is a type of database that stores and provides access to data points that are related to one another
The relational database model was introduced by E.F. Codd in 1970 of IBM's research laboratory
- Published a paper titled "A Relational Model of Data for Large Shared Data Banks"
- It is based on the concept of tables, where data is stored in rows and columns
- Rather than using pointers to navigate between related entities, redundant data is used to link related records in different tables
- The data can be accessed or reassembled in many different ways without having to reorganize the database tables
Key Concepts
Table: A table is a collection of data organized into rows and columns
- Modelled after a real-world entity
Column: A column is a set of data values of a particular type
- Some attribute of the entity
Row: A row is a single record in a table
- A single instance of the entity
Cell: A cell is a single data value in a table
- A single value for a single instance
- The intersection of a row and a column
Field: A field is a single piece of data in a table
Record: A record is a complete set of fields in a table
Primary Key: A primary key is a unique identifier for each row in a table
Foreign Key: A foreign key is a column that references a primary key in another table
Index: An index is a data structure that improves the speed of data retrieval operations on a database table
View: A view is a virtual table that is based on the result of a
SELECT
queryTransaction: A transaction is a single unit of work that is performed on a database
Foreign key
Referential integrity
One-to-many relationship
One-to-one relationship
Many-to-many relationship
Data type
Null value
Default value
Auto increment column
Relational database
Table
Column
Row
Cell
Primary key
Composite primary key
Non-primary key (unique key)
Index
Types of Relationships
One-to-One Relationship: A relationship where each record in one table is related to only one record in another table
Example: A person has only one passport
One-to-Many Relationship: A relationship where each record in one table is related to one or more records in another table
Example: A customer can have multiple orders
Many-to-Many Relationship: A relationship where each record in one table is related to one or more records in another table, and vice versa
Example: A student can enrol in multiple courses, and a course can have multiple students
Data Models
A data model is a conceptual representation of the data structures that are required by a database
Entity-Relationship Model: A data model that describes the relationships between entities in a database
- The entity-relationship (E-R) data model uses a collection of basic objects, called entities, and relationships among these objects
- An entity is a "thing" or "object" in the real world that is distinguishable from other objects
Relational Model: A data model that describes the relationships between tables in a database
- The relational model uses a collection of tables to represent both data and the relationships among those data
- Each table has multiple columns, and each column has a unique name
- Tables are also known as relations
Semi-Structured Model: A data model that allows for the representation of data that does not fit neatly into a tabular format
- This data model permits the specification of data where individual data items of the same type may have different attributes
- JSON and XML are examples of semi-structured data formats
Object-Oriented Model: A data model that represents data as objects, similar to object-oriented programming
Relational Data Model
Data are represented in the form of tables
- Each table has multiple columns, and each column has a unique name
- Each row in a table represents one piece of information (one record)
Data Abstraction
Data abstraction is the process of hiding the complex details of data storage and retrieval operations
Different levels of data abstraction:
Physical Level: Describes how data is stored in the database (e.g., data structures, file formats)
- Lowest level of abstraction
Logical Level: Describes what data is stored in the database and the relationships among the data
- Describes the structure of the database
- Independent of the physical storage of data (physical data independence)
View Level: Describes how data is presented to the user
- Highest level of abstraction
- Describes only a part of the database
- Not all users need to see all the data in the database
+------------------------------------+
| View Level |
| |
| View 1 View 2 View 3 ... View n |
| |
+------------------------------------+
|
|
Logical Level
|
|
Physical Level
The database system allows application developers to store and retrieve data using the abstractions of the data model, and converts the abstract operations into operations on the low-level implementation
ACID Properties
ACID is a set of properties that guarantee that database transactions are processed reliably
Transaction
What is a Transaction?
Transaction is a sequence of operations that are executed as a single unit
- A collection of queries
- One unit of work
- E.g., Account deposit (
SELECT
,UPDATE
,INSERT
)
Transaction Life-cycle:
- Transaction
BEGIN
COMMIT
orROLLBACK
- Transaction unexpected ending =
ROLLBACK
(e.g. crash)
Nature of Transactions:
- Usually Transactions are used to change and modify data
- However, it is perfectly normal to have a read only transaction
- Example, you want to generate a report and you want to get consistent snapshot based at the time of transaction
- We will learn more about this in the Isolation section
Example:
Lets say you have a bank account and you want to transfer money from your account to another account. This is a transaction. You will have to do the following steps:
ACCOUNT_ID | BALANCE |
---|---|
1 | $900 |
2 | $600 |
Send $100 from account 1 to account 2
BEGIN TRANSACTION
SELECT BALANCE FROM ACCOUNT WHERE ACCOUNT_ID = 1 AND BALANCE >= 100
UPDATE ACCOUNT SET BALANCE = BALANCE - 100 WHERE ACCOUNT_ID = 1
UPDATE ACCOUNT SET BALANCE = BALANCE + 100 WHERE ACCOUNT_ID = 2
COMMIT TRANSACTION
Atomicity
Atomicity ensures that all operations within a transaction are completed successfully, or none of them are completed at all
- All queries in a transaction must succeed
- If one query fails, all prior successful queries in the transaction should rollback
- If the database went down prior to a commit of a transaction, all the successful queries in the transactions should rollback
Example:
You are sending money from your account to another account. You have initiated a transaction but the computer crashes before the transaction is completed
In this case:
- The first account has been debited but the other account has not been credited
- This is really bad as we just lost data, and the information is inconsistent
- An atomic transaction is a transaction that will rollback all queries if one or more queries failed
- The database should clean this up after restart
Consistency
Consistency ensures that the database remains in a consistent state before and after a transaction
- The database should enforce all constraints
- The database should not allow any invalid data to be inserted
- The database should not allow any data that violates the integrity constraints
Example:
While transferring funds, there is a moment between the subtraction from one account and the addition to another account that the total amount of funds represented in the database is altered and may become inconsistent with a recorded total. This is acceptable, as long as the transaction, as a whole, is consistent when it is committed
Isolation
Isolation ensures that the operations within a transaction are not affected by other transactions that are running concurrently
- Can my in-flight transaction see changes made by other transactions?
- Transactions should be isolated from each other
- Transactions should not interfere with each other
- Transactions should not see the intermediate states of other transactions
Types of Isolation Levels:
Read Uncommitted: The lowest isolation level, where transactions can see uncommitted changes made by other transactions
Read Committed: Transactions can only see committed changes made by other transactions
Repeatable Read: Transactions can see changes made by other transactions, but they cannot see changes made to the same data by other transactions
Serializable: The highest isolation level, where transactions are completely isolated from each other
Isolation Read phenomena
Dirty Read: A transaction reads data that has been written by another transaction but not yet committed
Non-Repeatable Read: A transaction reads the same data multiple times and gets different results because another transaction has modified the data in between the reads
Phantom Read: A transaction reads a set of rows that satisfy a certain condition, but when it tries to read the same set of rows again, it finds that the set of rows has changed because another transaction has inserted or deleted rows that satisfy the condition
Lost Update: Two transactions read the same data, and then update the data based on the value they read. When they commit, one of the updates is lost because the other transaction overwrote it
Durability
Durability ensures that once a transaction has been committed, the changes made by the transaction are permanent and cannot be lost
- The database should be able to recover from any failure and restore the database to the state it was in before the failure
Example:
After you have transferred funds from one account to another, the transaction is committed. The database should ensure that the changes made by the transaction are permanent and irreversible even if the system crashes
Database Management System (DBMS)
A database management system (DBMS) is a software system that allows users to define, create, maintain, and control access to a database
Client-Server Architecture
In a client-server architecture, the database is stored on a server, and clients access the database through a network connection
3 Main Hardware Components:
- Client: The user interface that interacts with the database
- Server: The computer that hosts the database
- Network: The communication channel that connects the client and server
How a client-server architecture works:
The client sends a request to the server containing a database query or command
- Database drivers are used to connect the client to the server
The server passes the request to the database management system (DBMS)
The DBMS processes the request and sends a response back to the server
The server sends the response back to the client
The client displays the response to the user
DBMS Components
- Query Processor: Processes queries and commands
- Database Engine: Manages the storage and retrieval of data
- Database Server: Hosts the database and manages client connections
- Database Client: Provides a user interface to interact with the database
Instances and Schemas
Instance: A collection of data stored in the database at a particular moment in time
Schema: The structure that represents the logical view of the entire database
SQL (Structured Query Language)
SQL (Structured Query Language) is a standard language for interacting with relational databases and is used to perform various operations on databases
- The application software communicates with the DBMS by sending SQL queries through the data access API
- When the DBMS receives a query, it provides a service like returning the requested data (the query results) to the client
Entity-Relationship Model
The entity-relationship model is a data model that describes the relationships between entities in a database using a collection of basic objects called entities and relationships among these objects
- Entity: A thing or object in the real world that is distinguishable from other objects
- Relationship: Describes how entities are related to each other
Entity-Relationship Diagram (ERD)
An entity-relationship diagram (ERD) is a visual representation of the entities and relationships in a database that helps to understand the structure of the database
- Entity: A table in the database
- Attribute: A column in a table
- Relationship: A connection between two entities
Database Paradigms
Key-value: Redis, Memcache, etc.
- Caching
- Pub/Sub
- Leaderboards
Wide Column: Apache Cassandra, Apache HBase
- Time-series
- Historical Records
- High-write, low-read
Document: MongoDB, Firebase Database, Amazon DynamoDB, Apache CouchDB
- Most Apps
- Games
- IOT
RDBMS: MySQL, PostgreSQL, SQL Server, CockroachDB
- Most Apps
- Not ideal for unstructured data
Graph: Neo4j, Dgraph
- Graphs
- Knowledge Graphs
- Recommendation Engines
Search: Apache Lucene, Elasticsearch, Apache Solr, Algolia, meilisearch
- Search Engines
- Typeahead
Multi-Model: Fauna
References
- CODASYL (Conference/Committee on Data Systems Languages)
- Hierarchical