article

Introduction to Relational Data in Python, R and SQL

Author: Johannes King

Automated feature engineering for relational business data? Sound great, but you don't really know what relational data is? This post is for you!

At getML we offer a tool for automated feature engineering on relational business data. In the following, I illustrate the relational data model at an introductory level. If you have no idea what relational data is, this article is for you. Simply put, the relational data model is a way of storing data spread over many tables and defining the relationship between those tables.

I introduce the whole concept step by step using a real-world example. After reading this article you will have a fair understanding of the bits and pieces that make up the relational data model. More concretely, I will introduce

  • Primary and foreign keys
  • Relations (one-to-one, one-to-many, many-to-many)
  • Left joins

For those of you who want to work with relational data in their data science projects, each concept comes with code examples in Python, R and SQL - the standard languages for modern data science projects.

  • Python is an interpreted, high-level, general-purpose programming language. A standard tool to represent data structures in Python is pandas, an open source, BSD-licensed library. A basic introduction to pandas is available here.
  • R is an interpreted, high-level programming language specialized in statistics, data analysis, and data exploration. It has a built-in data structure called data.frame, but in the past few years people were rather using tibbles for they provide some advanced features like and beyond pandas. A thorough guide on relational data in R is available here.
  • SQL is a standard language for storing, manipulating and retrieving data in databases. To run SQL commands one needs an actual database. This document uses sqlite. This allows connecting to a database stored in a file.

Introduction

Let’s start with a simple table. A table is an arrangement of data in rows and columns. Imagine you work in the analytics department of a big department store. In order to keep track of the available products, you might create the following table

prodnr prodname prodtype quantity
41234 banana fruit 400
44321 apple fruit 2000
53153 carrot vegetable 200
78234 hammer tool 10

This table states that there are 400 units of product 41234 that is of type fruit and is called banana, 2000 units of product 44321 that is also of type fruit but is called apple, and so on.

In order to identify a table, it should have a unique name. It’s good practice to associate a specific table with a real-world entity. The above table represents the products in the department store. Therefore, let’s call it: Product.

The rows of the table contain information about one specific product in the department store. Each row is a unique, ordered list of attributes. The attributes are specified in the column names of the table, in this case: prodnr, prodname, prodtype and quantity. Each cell of the table stores exactly one attribute, multiple values for an attribute are not allowed in the relational data model.

We can write the table like this

Product(prodnr, prodname, prodtype, quantity)

and one specific row as

Product(41234, banana, fruit, 400)
# Python

# Define data as dict
product_data = {'prodnr' : [41234, 44321, 53153, 78234],
                'prodname' : ['banana', 'apple', 'carrot', 'hammer'],
                'prodtype' : ['fruit','fruit','vegetable','tool'],
                'quantity': [400, 2000, 200, 10]
               }

# Create pandas DataFrame
product_ = pd.DataFrame(data=product_data)
display(product_)
# R 

# Create a tibble
product <- tibble(
    prodnr = c(41234, 44321, 53153, 78234),
    prodname = c('banana', 'apple', 'carrrot', 'hammer'),
    prodtype = c('fruit', 'fruit', 'vegetable', 'tool'),
    quantity = c(400, 2000, 200, 10)
)
print(product)
--"SQL"

--"Delete table if it already exists (allows re-executing of this cell)"
DROP TABLE IF EXISTS Product;

--"Create table with column definitions"
CREATE TABLE Product
(prodnr TEXT,
 prodname TEXT,
 prodtype TEXT, 
 quantity INTEGER);

--"Populate with data"
INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);

--"Show all entries"
SELECT * FROM Product

Keys

Keys are an important concept in the relational data model. A key is an attribute (or set of attributes) such that no two rows in a table have the same values for these attributes. In other words, a key is an attribute (or set of attributes) that allows you to uniquly identify each row in the table.

In the Product table, each product is uniquly identified by its prodnr or its prodname. These attributes fulfill the requirements to be a key and are thus called candidate keys. All other attributes (prodtype and quantity) are redundant because two rows can have the same value for these attributes.

Primary key

If there is more than one candidate key in a relation, you must choose one to be the primary key. The remaining ones are then called alternative keys. Let’s choose prodnr to be the primary key of the Product table.

# Python

# Check if 'prodnr' satisfies the requirement for a primary key
print(product_.prodnr.nunique() == len(product_))


# Define 'prodnr' as primary key
# In pandas the primary key is called 'index'
product = product_.set_index("prodnr")
display(product)
# R

# The `dplyr` package will shadow some base classes with the same name, 
# e.g. `filter`, to provide for more advanced features. Using the
# `warn.conflicts` options the corresponding warning messages will be
# suppressed.
library(dplyr, warn.conflicts = FALSE)

# There is no way to define a primary key in an R tibble. 
# We can, however, check if 'prodnr' satisfies the requirements for a
# primary key. We do so by counting the occurrences of each individual
# instance of the primary key and requiring the results to be 1.
all(select(count(product, prodnr), n))
--"SQL"

--"The common way to define primary keys in SQL is on table definiton"
DROP TABLE IF EXISTS Product;

--"Define primary key and unique (alternative) key which both must not be NULL"
CREATE TABLE Product
(prodnr TEXT NOT NULL PRIMARY KEY,
 prodname TEXT NOT NULL UNIQUE,
 prodtype TEXT,
 quantity INTEGER);

INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);
SELECT * from Product

Relations

Finally, we arrive at the concept that gives the relational data model its name: Relations. In a typical real-world data science project you have to deal with more than one table to solve a given problem. The department store, for example, might have two more tables containing information about suppliers and purchase orders:

supnr supname supadress supcity
32 Apple Farm 41, Lansing Rd Lansing
72 Tony's Tools 152, Jackson St Chicago
12 Water Delivery 12, Mississippi St Merillville
ponr podate
12 2019-11-10
13 2019-11-14
14 2019-12-24

Each supplier has a supnr and some further information given by supname, supadress and supcity. Each purchase order is defined by a ponr and a podate. For the sake of brevity, the dates are supplied as strings and not real date types specific to a certain language.

The entire database we have created so far looks like this (primary keys are marked with an asterisk).

Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate)
# Python 

supplier_data = {'supnr' : [32, 72, 12],
                 'supname' : ["Apple Farm", "Tony's Tools", "Water Delivery"],
                 'supadress' : ["41, Lansing Rd", "142, Jackson St", "12, Missisippi St"],
                 'supcity' : ["Lansing", "Chicago", "Merillville"]
                }

purchase_order_data = {'ponr' : [12, 13, 14],
                       'podate' : ["2019-11-10", "2019-11-14", "2019-12-24"],
                      }

supplier = pd.DataFrame(data=supplier_data).set_index("supnr")
purchase_order = pd.DataFrame(data=purchase_order_data).set_index("ponr")
display(supplier, purchase_order)
# R

supplier <- tibble(
    supnr = c(32, 72, 12),
    supname = c("Apple Farm", "Tony's Tools", "Water Delivery"),
    supadress = c("41, Lansing Rd", "142, Jackson St", "12, Missisippi St"),
    supcity = c("Lansing", "Chicago", "Merillville")
)

purchase.order <- tibble(
    ponr = c(12, 13, 14),
    podate = c("2019-11-10", "2019-11-14", "2019-12-24")
)

print(supplier)
print(purchase.order)
--"SQL"

DROP TABLE IF EXISTS Supplier;

CREATE TABLE Supplier
(supnr TEXT NOT NULL PRIMARY KEY,
 supname TEXT NOT NULL,
 supadress TEXT,
 supcity TEXT
);

INSERT into Supplier VALUES(32, "Apple Farm", "41, Lansing Rd", "Lansing");
INSERT into Supplier VALUES(72, "Tony's Tools", "142, Jackson St", "Chicago");
INSERT into Supplier VALUES(12, "Water Delivery", "12, Mississippi St", "Merillville");


DROP TABLE IF EXISTS Purchase_order;

CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
 podate TEXT
);

INSERT into Purchase_order VALUES(12, "2019-11-10");
INSERT into Purchase_order VALUES(13, "2019-11-10");
INSERT into Purchase_order VALUES(14, "2019-11-10");

SELECT * FROM Supplier;

Foreign key

A relation is always defined between a pair of tables. We want, for example, to associate each purchase order with a supplier and each supplier with one or more products. In order to establish such relations, we need to introduce the concept of a foreign key. Simply put, a foreign key is an attribute of one table that appears as primary key in another table. It can therefore be used to uniquely identify a row in another table.

The simplest example of a relation is a one-to-one relation that associates each row of a table with exactly one row from another table. In the department store example, we could have added a table

Tax_number(taxnr*, supnr_)

that has supnr as a foreign key (marked with a trailing underscore) and thus associates each tax number with exactly one supplier. You might wonder why we didn’t add this number as an attribute to the Supplier table. And you’re right, we could very well have proceeded that way. However, one might want to store the taxing information in another table for reasons of data protection or administration. Nonetheless, one-to-one relations appear in real-word datasets very seldomly and we discard the Tax_number table in the following.

One-to-many

Things become more interesting when we try to establish a relation between Supplier and Purchase_order. We might consider adding supnr as a foreign key to Purchase_order or adding ponr as a foreign key to Supplier. Having in mind that each attribute can only have one value, it becomes clear that adding ponr to Supplier is not an option because each supplier might be associated with more than one Purchase order. Therefore, we add supnr as foreign key to Purchase_order since each purchase order is associated with exactly one supplier. This is called a one-to-many relation.

ponr podate supnr
12 2019-11-10 32
13 2019-11-14 32
14 2019-12-24 72

The entire database now looks like this

Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate, supnr_)

Joins

Joins are a way to combine attributes from two tables that are having a relation. There are different types of joins, but in the scope of this article we only consider left joins. A left join between two tables on a foreign key replaces all values of this key in the first table with the attributes of the row that this key identifies in the other table. To give a concrete example: Left joining Purchase_order and Supplier on the foreign key supnr results in the following table

ponr podate supname supadress supcity
12 2019-11-10 Apple Farm 41, Lansing Rd Lansing
13 2019-11-14 Apple Farm 41, Lansing Rd Landing
14 2019-12-24 Tony's Tools 142, Jackson St Chicago

The information on a given Supplier has been added to each row in Purchase_order based on the matching supnr attribute.

# Python

# Add foreign key
purchase_order["supnr"] = [32, 32, 72]

# Left join
purchase_order.join(supplier, how='left', on='supnr')
# R

# Add foreign key
purchase.order["supnr"] <- c(32, 32, 72)

# Left join
purchase.order <- left_join(purchase.order, supplier, by='supnr')
--"SQL"

--"Foreign keys are also defined on table definiton"
DROP TABLE IF EXISTS Purchase_order;

CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
 podate TEXT,
 supnr TEXT NOT NULL,
 FOREIGN KEY (supnr) REFERENCES Supplier (supnr)
);

INSERT into Purchase_order VALUES(12, "2019-11-10", 32);
INSERT into Purchase_order VALUES(13, "2019-11-14", 32);
INSERT into Purchase_order VALUES(14, "2019-12-24", 72);

--"Left join"
SELECT * from Purchase_order
LEFT JOIN Supplier
ON Purchase_order.supnr = Supplier.supnr;

Many-to-many

A many-to-many relation exists if each row in one table is associated with many rows in a second table, and each row of the second table is related to many rows in the first table. You can imagine a many-to-many relation as a one-to-many plus a many-to-one relation.

In our example, a many-to-many relation between Supplier and Product exists. Each supplier can supply more than one product, and each product can be supplied by more than one supplier. However, since it is not allowed to have multivalued attributes, we can neither add supnr as foreign key to Product nor can we add prodnr as foreign key to Supplier.

We can solve this dilemma by adding a new table Supplies with two foreign keys that together form the primary key of a new relation

Supplies([supnr_, prodnr_]*, purchase_price)
supnr prodnr purchase_price
32 44321 2.99
32 78234 13.5
72 78234 12
12 41234 3.2
12 53153 2.13

Note the new attribute purchase_price that has been added to the relation.

# Python

# Generate primary key tuple
supnr = [32, 32, 72, 12, 12]
prodnr = [44321, 78234, 78234, 41234, 53153]
index_tuples = [(s_, p_) for (s_,p_) in zip(supnr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['supnr', 'prodnr'])

# Add attributes
supplies_data = {'purchase_price' : [2.99, 13.5, 12, 3.2, 2.13]}
supplies = pd.DataFrame(data=supplies_data).set_index(index)

# Left join
supplies.merge(supplier, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R

# Create new tibble
supplies <- tibble(
    supnr = c(32, 32, 72, 12, 12),
    prodnr = c(44321, 78234, 78234, 41234, 53153),
    purchase.price = c(2.99, 13.5, 12, 3.2, 2.13)
)

# Left join
supplies <- left_join(left_join(supplies, supplier, by='supnr'), 
                      product, by='prodnr')
print(supplies)
--"SQL"

DROP TABLE IF EXISTS Supplies;

CREATE TABLE Supplies
(supnr TEXT NOT NULL,
 prodnr TEXT NOT NULL,
 purchase_price REAL,
 PRIMARY KEY (supnr, prodnr),
 FOREIGN KEY (supnr) REFERENCES Supplier(supnr),
 FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);

INSERT into Supplies VALUES(32, 44321, 2.99);
INSERT into Supplies VALUES(32, 78234, 13.5);
INSERT into Supplies VALUES(72, 78234, 12);
INSERT into Supplies VALUES(12, 41234, 3.2);
INSERT into Supplies VALUES(12, 53153, 2.13);

select * from Supplies
LEFT JOIN Supplier
ON Supplies.supnr = Supplier.supnr
LEFT JOIN Product
ON Supplies.prodnr = Product.prodnr;

Another many-to-many relation in the supply management database of your department store is between Product and Purchase_order because you can ask the question which product(s) were purchased in which purchase_order(s).

PO_line([ponr_, prodnr_]*, quantitiy)
ponr prodnr quantity_purchased
12 44321 200
12 78234 50
13 78234 20
# Python

ponr = [12, 12, 13]
prodnr = [44321, 78234, 78234]
index_tuples = [(s_, p_) for (s_,p_) in zip(ponr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['ponr', 'prodnr'])

po_line_data = {'quantity_purchased' : [200, 50, 20]}
po_line = pd.DataFrame(data=po_line_data).set_index(index)

po_line.merge(purchase_order, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R

po.line <- tibble(
    ponr = c(12, 12, 13),
    prodnr = c(44321, 78234, 78234),
    quantity.purchased = c(200, 50, 20)
)

# Nested joins can be done elegantly using R pipes (%>%)
po.line <- po.line %>% left_join(purchase.order, by='ponr') %>% 
    left_join(product, by='prodnr')
print(po.line)
--"SQL"

DROP TABLE IF EXISTS Po_line;

CREATE TABLE Po_line
(ponr TEXT NOT NULL,
 prodnr TEXT NOT NULL,
 quantity_purchased INTEGER,
 PRIMARY KEY (ponr, prodnr),
 FOREIGN KEY (ponr) REFERENCES Purchase_order (ponr),
 FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);

INSERT into Po_line VALUES(12, 44321, 200);
INSERT into Po_line VALUES(12, 78234, 50);
INSERT into Po_line VALUES(13, 78234, 20);

select * from Po_line
LEFT JOIN Purchase_order
ON Po_line.ponr = Purchase_order.ponr
LEFT JOIN Product
ON Po_line.prodnr = Product.prodnr;

The complete database now looks like this

Supplier(*supnr, supname, supadress, supcity)
Product(*prodnr, prodname, prodtype, quantity)
Supplies(*[supnr_, prodnr_], purchase_price)
Purchase_order(*ponr, podate)
PO_line(*[ponr_, prodnr_], quantitiy)

The following figure illustrates the tables in the database and their relation to each other. It was created from the SQLite database file created by the commands above using sqleton.

Data schemes like this appear in every real-world data science project. A typical business case for the department store in our example would be forecasting the demand of storage space in a warehouse. Transferred to the data level, this is equivalent to predicting the quantity column in the Product table.

Machine learning algorithms like neural nets or random forests that can be used to solve this problem, however, cannot process the relational data in its raw form. Therefore, you would need to create what is called a flat feature table by hand. This happens by writing statements in Pyhon, R or SQL like demonstrated in this article.

One feature that could be interesting for the prediction of the quantity column is the sum of quantity_purchasedfor each product in the Po_line table. To this end, we left join po_line to product and get the following table as a result.

prodname prodtype quantity quantity purchased
banana fruit 400 NaN
apple fruit 2000 200.0
carrot vegetable 200 NaN
hammer tool 10 50.0
hammer tool 10 20.0

The information about purchases for each product has been added to the product table. Because there were two purchases for the product ‘hammer’ this row appreas two times. The products for which no purchases are recorded in po_line have a Nan value in quantity_purchased (there are still listed in the table, because we performed a left join).

Now we need to sum quantity_purchased for all rows that correspond to the same product. In this case, this effectvely means summing the last two rows that correspond to the product ‘hammer’.

prodnr quantity quantity purchased
41234 400 0.0
44321 2000 200.0
53153 200 0.0
78234 20 70.0

Technically, you can create features like this by grouping the joined table using prodnr. This results in four groups: The first three have only one entry and the last one (‘hammer’) has two. Then, we need to sum quantitiy_purchased for each of the groups. This is called an ‘aggregation’.

# Python
    
product.merge(po_line, on='prodnr', how='left').groupby('prodnr').sum()
#R
    
product %>% 
    left_join(po.line, by='prodnr') %>% 
    group_by(prodnr) %>%
    summarize(total.quantity.purchased = sum(quantity.purchased))
--"SQL"
    
SELECT Product.prodnr, SUM(quantity_purchased) as total_quantity_purchased
from Product
LEFT JOIN Po_line
ON Product.prodnr = Po_line.prodnr
GROUP BY Product.prodnr

This last example hopefully gave you a superficial impression of what it means to find features in a relational data scheme. More on this topic will be covered in a follow-up article.

This article has shown you that a relational data scheme is nothing more than a collection of tables that are related to each other using foreign keys. The nature of this relation can be ‘one-to-one’, ‘one-to-many’, or ‘many-to-many’. To visualize these relations or use them to create features on uses (left) joins to connect different tables in a database.