Tuesday, February 21, 2017

Database Concepts

Relational Database

A relational database is a digital database whose organization is based on the relational model of data, as proposed by E. F. Codd in 1970. The various software systems used to maintain relational databases are known as a relational database management system (RDBMS). Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.


Relational Model

The relational model (RM) for database management is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.


NoSQL

A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.com. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages.


Dimensional modeling

Dimensional modeling (DM) names a set of techniques and concepts used in data warehouse design. It is considered to be different from entity-relationship modeling (ER). Dimensional modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. According to data warehousing consultant Ralph Kimball, DM is a design technique for databases intended to support end-user queries in a data warehouse. It is oriented around understandability and performance. According to him, although transaction-oriented ER is very useful for the transaction capture, it should be avoided for end-user delivery.


Data Warehouse

In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place and are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could range from annual and quarterly comparisons and trends to detailed daily sales analysis.

OLAP and OLTP

Online analytical processing (OLAP) is characterized by a relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems, response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. OLAP databases store aggregated, historical data in multi-dimensional schemas (usually star schemas). OLAP systems typically have data latency of a few hours, as opposed to data marts, where latency is expected to be closer to one day. The OLAP approach is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are : Roll-up (Consolidation), Drill-down and Slicing & Dicing.[4]

Online transaction processing (OLTP) is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). OLTP systems emphasize very fast query processing and maintaining data integrity in multi-access environments. For OLTP systems, effectiveness is measured by the number of transactions per second. OLTP databases contain detailed and current data. The schema used to store transactional databases is the entity model (usually 3NF).[5] Normalization is the norm for data modeling techniques in this system.


Denormalization

In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Data warehouse as an example performs some degree of denormalization.

Examples of denormalization techniques include:
1. Materialised views, which may implement the following:
 a. Storing the count of the "many" objects in a one-to-many relationship as an attribute of the "one" relation
 b. Adding attributes to a relation from another relation with which it will be joined
2. Star schemas, which are also known as fact-dimension models and have been extended to snowflake schemas
3. Prebuilt summarisation or OLAP cubes


Normalization

There are degrees of normalization you can apply to your database design. In general, vialiations of normal form will cause different types of anomalies.


Anomalies

A poorly designed database can cause anomalies in the data (e.g. update, delete, insert anomaly). Constrains can prevent some anomalies. Minimal redundancy, less possibility of anomalies.


Functional dependency

In relational database theory, a functional dependency is a constraint between two sets of attributes in a relation from a database. In other words, functional dependency is a constraint that describes the relationship between attributes in a relation.


Multivalued Dependency

In database theory, a multivalued dependency is a full constraint between two sets of attributes in a relation.

In contrast to the functional dependency, the multivalued dependency requires that certain tuples be present in a relation. Therefore, a multivalued dependency is a special case of tuple-generating dependency. The multivalued dependency plays a role in the 4NF database normalization.

A multivalued dependency is a special case of a join dependency, with only two sets of values involved, i.e. it is a binary join dependency.


Superkey

A superkey is defined in the relational model of database organization as a set of attributes of a relation variable for which it holds that in all relations assigned to that variable, there are no two distinct tuples (rows) that have the same values for the attributes in this set. A super-key can be defined as a set of attributes of a relation schema upon which all attributes of the schema are functionally dependent.


Candidate key

In the relational model of databases, a candidate key of a relation is a minimal superkey for that relation; that is, a set of attributes such that:

1. The relation does not have two distinct tuples (e.g. rows or records in common database language) with the same values for these attributes (which means that the set of attributes is a superkey)
2. There is no proper subset of these attributes for which (1) holds (which means that the set is minimal).


First Normal Form

1. Types/values must be atomic (e.g. actor(id, role) -> ([1], [warrior, beast]))
2. No repeating attributes/fields


Second Normal Form

1. Follow First Normal Form
2. Non-key fields depend on the key field (e.g. no fields from another entity are wanted, only FK is needed)


Third Normal Form

1. Follow Second Normal Form
2. Has no transitive dependencies (no dependencies from one non-key field to another)


Boyce-Codd Normal Form (3.5NF, slightly stronger than 3NF)

A relational schema R is in Boyce–Codd normal form if and only if for every one of its dependencies X → Y, at least one of the following conditions hold:

1. X → Y is a trivial functional dependency (Y ⊆ X)
2. X is a super key for schema R


Fourth Normal Form

A Table is in 4NF if and only if, for every one of its non-trivial multivalued dependencies X →→ Y, X is a superkey—that is, X is either a candidate key or a superset thereof.


Fifth Normal Form

A table is said to be in the 5NF if and only if every non-trivial join dependency in it is implied by the candidate keys.


Domain-key Normal Form

Domain-key normal form (DK/NF) is a normal form used in database normalization which requires that the database contains no constraints other than domain constraints and key constraints.

A domain constraint specifies the permissible values for a given attribute, while a key constraint specifies the attributes that uniquely identify a row in a given table.

The domain/key normal form is achieved when every constraint on the relation is a logical consequence of the definition of keys and domains, and enforcing key and domain restraints and conditions causes all constraints to be met. Thus, it avoids all non-temporal anomalies.

The reason to use domain/key normal form is to avoid having general constraints in the database that are not clear domain or key constraints. Most databases can easily test domain and key constraints on attributes. General constraints however would normally require special database programming in the form of stored procedures (often of the trigger variety) that are expensive to maintain and expensive for the database to execute. Therefore, general constraints are split into domain and key constraints.

It's much easier to build a database in domain/key normal form than it is to convert lesser databases which may contain numerous anomalies. However, successfully building a domain/key normal form database remains a difficult task, even for experienced database programmers. Thus, while the domain/key normal form eliminates the problems found in most databases, it tends to be the most costly normal form to achieve.


Sunday, May 29, 2016

Access AWS Linux instance over SSH


How to access your AWS Linux instance over SSH?

There are two options:

1. Access with your .pem key file, which you obtained from AWS when creating an instance

2. Access with password



Before you will be able to ssh into your instance via option 2., you will have to set up the account and password via option 1.


Option 1. steps:

1. Log into your AWS web portal and find your Linux instance in the instances list

2. Right click on the instance, and click on the 'connect' option

3. You will find your Public DNS address (e.g. ec2-your-instance.us-your-region.compute.amazonaws.com)

4. Find your .pem key file path (e.g. /user/doc/my-key-file.pem), and do ssh in your client by

ssh -i /user/doc/my-key-file.pem ec2-user@ec2-your-instance.us-your-region.compute.amazonaws.com

Note: ec2-user is the default user name for your Linux instance. If it happens to be different, refer to this for details of how to obtain it. If it asks you about RSA key fingerprint, just type yes.

5. If You have successfully logged into the instance over SSH after previous 4 steps, you will now be able to set up password access over ssh to your instance, so that you don't always need your .pem key file.


Option 2. set up password access:

1. ssh -i /user/doc/my-key-file.pem ec2-user@ec2-your-instance.us-your-region.compute.amazonaws.com

2. $ cd /

3. $ sudo useradd -s /bin/bash -m -d /home/USERNAME -g root USERNAME

4. $ sudo passwd USERNAME
Enter new pwd:
Retype new pwd:

5. add the user as a sudoer (optional):

$ sudo visudo
and add the following line to the file:
USERNAME  ALL=(ALL:ALL) ALL

6. enable password access

$ sudo vi /etc/ssh/sshd_config
and change PasswordAuthentication from 'no' to 'yes'

7. restart ssh service by

$ sudo /etc/init.d/ssh restart

or

$ sudo /etc/init.d/sshd restart

9. log out and log in using

$ ssh USERNAME@ec2-your-instance.us-your-region.compute.amazonaws.com
Enter pwd:









Thursday, May 26, 2016

Trie


What is a Trie data structure?

If you know what a Tree data structure is, you can think of a Trie as an ordered tree, such that it is similar to a Binary Search Tree but it it not necessary to have both key and value given a node of the tree. The node has a value if and only if it has a valid key.

Let's look at an example.

Trie is especially powerful and commonly used when dealing with letters and words.


The above is a trie example from Wikipedia that uses words as key and integer as value.

It takes string as an input and outputs an integer value if there is a match. So trie_example.get("tea") will return an integer 3, and trie_example.get("te") will return null since there is no value associated with that node. "tea" is considered a valid key while "te" is not, since we define the key is valid if it is a valid English word.

This kind of trie can be very powerful in time complexity when dealing with words, but it also uses a lot of space especially when there are lengthy words. A space-optimized trie called Radix tree can be used to solve this problem. Use word matching as an example, radix tree will not only store one letter in each node but as many as its children they share.