Automated feature engineering for relational business data? Sound great, but you don't really know what relational data is? This post is for you!
At getML we offer a tool for automated feature engineering on relational business data. In the following, I illustrate the relational data model at an introductory level. If you have no idea what relational data is, this article is for you. Simply put, the relational data model is a way of storing data spread over many tables and defining the relationship between those tables.
I introduce the whole concept step by step using a real-world example. After reading this article you will have a fair understanding of the bits and pieces that make up the relational data model. More concretely, I will introduce
For those of you who want to work with relational data in their data science projects, each concept comes with code examples in Python, R and SQL - the standard languages for modern data science projects.
Let’s start with a simple table. A table is an arrangement of data in rows and columns. Imagine you work in the analytics department of a big department store. In order to keep track of the available products, you might create the following table
prodnr | prodname | prodtype | quantity |
---|---|---|---|
41234 | banana | fruit | 400 |
44321 | apple | fruit | 2000 |
53153 | carrot | vegetable | 200 |
78234 | hammer | tool | 10 |
This table states that there are 400
units of product 41234
that is of type fruit
and is called banana
, 2000
units of product 44321
that is also of type fruit
but is called apple
, and so on.
In order to identify a table, it should have a unique name. It’s good practice to associate a specific table with a real-world entity. The above table represents the products in the department store. Therefore, let’s call it: Product
.
The rows of the table contain information about one specific product in the department store. Each row is a unique, ordered list of attributes. The attributes are specified in the column names of the table, in this case: prodnr
, prodname
, prodtype
and quantity
. Each cell of the table stores exactly one attribute, multiple values for an attribute are not allowed in the relational data model.
We can write the table like this
Product(prodnr, prodname, prodtype, quantity)
and one specific row as
Product(41234, banana, fruit, 400)
# Python
# Define data as dict
product_data = {'prodnr' : [41234, 44321, 53153, 78234],
'prodname' : ['banana', 'apple', 'carrot', 'hammer'],
'prodtype' : ['fruit','fruit','vegetable','tool'],
'quantity': [400, 2000, 200, 10]
}
# Create pandas DataFrame
product_ = pd.DataFrame(data=product_data)
display(product_)
# R
# Create a tibble
product <- tibble(
prodnr = c(41234, 44321, 53153, 78234),
prodname = c('banana', 'apple', 'carrrot', 'hammer'),
prodtype = c('fruit', 'fruit', 'vegetable', 'tool'),
quantity = c(400, 2000, 200, 10)
)
print(product)
--"SQL"
--"Delete table if it already exists (allows re-executing of this cell)"
DROP TABLE IF EXISTS Product;
--"Create table with column definitions"
CREATE TABLE Product
(prodnr TEXT,
prodname TEXT,
prodtype TEXT,
quantity INTEGER);
--"Populate with data"
INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);
--"Show all entries"
SELECT * FROM Product
Keys are an important concept in the relational data model. A key is an attribute (or set of attributes) such that no two rows in a table have the same values for these attributes. In other words, a key is an attribute (or set of attributes) that allows you to uniquly identify each row in the table.
In the Product
table, each product is uniquly identified by its prodnr
or its prodname
. These attributes fulfill the requirements to be a key and are thus called candidate keys. All other attributes (prodtype
and quantity
) are redundant because two rows can have the same value for these attributes.
If there is more than one candidate key in a relation, you must choose one to be the primary key. The remaining ones are then called alternative keys. Let’s choose prodnr
to be the primary key of the Product
table.
# Python
# Check if 'prodnr' satisfies the requirement for a primary key
print(product_.prodnr.nunique() == len(product_))
# Define 'prodnr' as primary key
# In pandas the primary key is called 'index'
product = product_.set_index("prodnr")
display(product)
# R
# The `dplyr` package will shadow some base classes with the same name,
# e.g. `filter`, to provide for more advanced features. Using the
# `warn.conflicts` options the corresponding warning messages will be
# suppressed.
library(dplyr, warn.conflicts = FALSE)
# There is no way to define a primary key in an R tibble.
# We can, however, check if 'prodnr' satisfies the requirements for a
# primary key. We do so by counting the occurrences of each individual
# instance of the primary key and requiring the results to be 1.
all(select(count(product, prodnr), n))
--"SQL"
--"The common way to define primary keys in SQL is on table definiton"
DROP TABLE IF EXISTS Product;
--"Define primary key and unique (alternative) key which both must not be NULL"
CREATE TABLE Product
(prodnr TEXT NOT NULL PRIMARY KEY,
prodname TEXT NOT NULL UNIQUE,
prodtype TEXT,
quantity INTEGER);
INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);
SELECT * from Product
Finally, we arrive at the concept that gives the relational data model its name: Relations. In a typical real-world data science project you have to deal with more than one table to solve a given problem. The department store, for example, might have two more tables containing information about suppliers and purchase orders:
supnr | supname | supadress | supcity |
---|---|---|---|
32 | Apple Farm | 41, Lansing Rd | Lansing |
72 | Tony's Tools | 152, Jackson St | Chicago |
12 | Water Delivery | 12, Mississippi St | Merillville |
ponr | podate |
---|---|
12 | 2019-11-10 |
13 | 2019-11-14 |
14 | 2019-12-24 |
Each supplier has a supnr
and some further information given by supname
, supadress
and supcity
. Each purchase order is defined by a ponr
and a podate
. For the sake of brevity, the dates are supplied as strings and not real date types specific to a certain language.
The entire database we have created so far looks like this (primary keys are marked with an asterisk).
Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate)
# Python
supplier_data = {'supnr' : [32, 72, 12],
'supname' : ["Apple Farm", "Tony's Tools", "Water Delivery"],
'supadress' : ["41, Lansing Rd", "142, Jackson St", "12, Missisippi St"],
'supcity' : ["Lansing", "Chicago", "Merillville"]
}
purchase_order_data = {'ponr' : [12, 13, 14],
'podate' : ["2019-11-10", "2019-11-14", "2019-12-24"],
}
supplier = pd.DataFrame(data=supplier_data).set_index("supnr")
purchase_order = pd.DataFrame(data=purchase_order_data).set_index("ponr")
display(supplier, purchase_order)
# R
supplier <- tibble(
supnr = c(32, 72, 12),
supname = c("Apple Farm", "Tony's Tools", "Water Delivery"),
supadress = c("41, Lansing Rd", "142, Jackson St", "12, Missisippi St"),
supcity = c("Lansing", "Chicago", "Merillville")
)
purchase.order <- tibble(
ponr = c(12, 13, 14),
podate = c("2019-11-10", "2019-11-14", "2019-12-24")
)
print(supplier)
print(purchase.order)
--"SQL"
DROP TABLE IF EXISTS Supplier;
CREATE TABLE Supplier
(supnr TEXT NOT NULL PRIMARY KEY,
supname TEXT NOT NULL,
supadress TEXT,
supcity TEXT
);
INSERT into Supplier VALUES(32, "Apple Farm", "41, Lansing Rd", "Lansing");
INSERT into Supplier VALUES(72, "Tony's Tools", "142, Jackson St", "Chicago");
INSERT into Supplier VALUES(12, "Water Delivery", "12, Mississippi St", "Merillville");
DROP TABLE IF EXISTS Purchase_order;
CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
podate TEXT
);
INSERT into Purchase_order VALUES(12, "2019-11-10");
INSERT into Purchase_order VALUES(13, "2019-11-10");
INSERT into Purchase_order VALUES(14, "2019-11-10");
SELECT * FROM Supplier;
A relation is always defined between a pair of tables. We want, for example, to associate each purchase order with a supplier and each supplier with one or more products. In order to establish such relations, we need to introduce the concept of a foreign key. Simply put, a foreign key is an attribute of one table that appears as primary key in another table. It can therefore be used to uniquely identify a row in another table.
The simplest example of a relation is a one-to-one relation that associates each row of a table with exactly one row from another table. In the department store example, we could have added a table
Tax_number(taxnr*, supnr_)
that has supnr
as a foreign key (marked with a trailing underscore) and thus associates each tax number with exactly one supplier. You might wonder why we didn’t add this number as an attribute to the Supplier
table. And you’re right, we could very well have proceeded that way. However, one might want to store the taxing information in another table for reasons of data protection or administration. Nonetheless, one-to-one relations appear in real-word datasets very seldomly and we discard the Tax_number
table in the following.
Things become more interesting when we try to establish a relation between Supplier
and Purchase_order
. We might consider adding supnr
as a foreign key to Purchase_order
or adding ponr
as a foreign key to Supplier
. Having in mind that each attribute can only have one value, it becomes clear that adding ponr
to Supplier
is not an option because each supplier might be associated with more than one Purchase order. Therefore, we add supnr
as foreign key to Purchase_order
since each purchase order is associated with exactly one supplier. This is called a one-to-many relation.
ponr | podate | supnr |
---|---|---|
12 | 2019-11-10 | 32 |
13 | 2019-11-14 | 32 |
14 | 2019-12-24 | 72 |
The entire database now looks like this
Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate, supnr_)
Joins are a way to combine attributes from two tables that are having a relation. There are different types of joins, but in the scope of this article we only consider left joins. A left join between two tables on a foreign key replaces all values of this key in the first table with the attributes of the row that this key identifies in the other table. To give a concrete example: Left joining Purchase_order
and Supplier
on the foreign key supnr
results in the following table
ponr | podate | supname | supadress | supcity |
---|---|---|---|---|
12 | 2019-11-10 | Apple Farm | 41, Lansing Rd | Lansing |
13 | 2019-11-14 | Apple Farm | 41, Lansing Rd | Landing |
14 | 2019-12-24 | Tony's Tools | 142, Jackson St | Chicago |
The information on a given Supplier
has been added to each row in Purchase_order
based on the matching supnr
attribute.
# Python
# Add foreign key
purchase_order["supnr"] = [32, 32, 72]
# Left join
purchase_order.join(supplier, how='left', on='supnr')
# R
# Add foreign key
purchase.order["supnr"] <- c(32, 32, 72)
# Left join
purchase.order <- left_join(purchase.order, supplier, by='supnr')
--"SQL"
--"Foreign keys are also defined on table definiton"
DROP TABLE IF EXISTS Purchase_order;
CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
podate TEXT,
supnr TEXT NOT NULL,
FOREIGN KEY (supnr) REFERENCES Supplier (supnr)
);
INSERT into Purchase_order VALUES(12, "2019-11-10", 32);
INSERT into Purchase_order VALUES(13, "2019-11-14", 32);
INSERT into Purchase_order VALUES(14, "2019-12-24", 72);
--"Left join"
SELECT * from Purchase_order
LEFT JOIN Supplier
ON Purchase_order.supnr = Supplier.supnr;
A many-to-many relation exists if each row in one table is associated with many rows in a second table, and each row of the second table is related to many rows in the first table. You can imagine a many-to-many relation as a one-to-many plus a many-to-one relation.
In our example, a many-to-many relation between Supplier
and Product
exists. Each supplier can supply more than one product, and each product can be supplied by more than one supplier. However, since it is not allowed to have multivalued attributes, we can neither add supnr
as foreign key to Product
nor can we add prodnr
as foreign key to Supplier
.
We can solve this dilemma by adding a new table Supplies
with two foreign keys that together form the primary key of a new relation
Supplies([supnr_, prodnr_]*, purchase_price)
supnr | prodnr | purchase_price |
---|---|---|
32 | 44321 | 2.99 |
32 | 78234 | 13.5 |
72 | 78234 | 12 |
12 | 41234 | 3.2 |
12 | 53153 | 2.13 |
Note the new attribute purchase_price
that has been added to the relation.
# Python
# Generate primary key tuple
supnr = [32, 32, 72, 12, 12]
prodnr = [44321, 78234, 78234, 41234, 53153]
index_tuples = [(s_, p_) for (s_,p_) in zip(supnr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['supnr', 'prodnr'])
# Add attributes
supplies_data = {'purchase_price' : [2.99, 13.5, 12, 3.2, 2.13]}
supplies = pd.DataFrame(data=supplies_data).set_index(index)
# Left join
supplies.merge(supplier, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R
# Create new tibble
supplies <- tibble(
supnr = c(32, 32, 72, 12, 12),
prodnr = c(44321, 78234, 78234, 41234, 53153),
purchase.price = c(2.99, 13.5, 12, 3.2, 2.13)
)
# Left join
supplies <- left_join(left_join(supplies, supplier, by='supnr'),
product, by='prodnr')
print(supplies)
--"SQL"
DROP TABLE IF EXISTS Supplies;
CREATE TABLE Supplies
(supnr TEXT NOT NULL,
prodnr TEXT NOT NULL,
purchase_price REAL,
PRIMARY KEY (supnr, prodnr),
FOREIGN KEY (supnr) REFERENCES Supplier(supnr),
FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);
INSERT into Supplies VALUES(32, 44321, 2.99);
INSERT into Supplies VALUES(32, 78234, 13.5);
INSERT into Supplies VALUES(72, 78234, 12);
INSERT into Supplies VALUES(12, 41234, 3.2);
INSERT into Supplies VALUES(12, 53153, 2.13);
select * from Supplies
LEFT JOIN Supplier
ON Supplies.supnr = Supplier.supnr
LEFT JOIN Product
ON Supplies.prodnr = Product.prodnr;
Another many-to-many relation in the supply management database of your department store is between Product
and Purchase_order
because you can ask the question which product(s) were purchased in which purchase_order(s).
PO_line([ponr_, prodnr_]*, quantitiy)
ponr | prodnr | quantity_purchased |
---|---|---|
12 | 44321 | 200 |
12 | 78234 | 50 |
13 | 78234 | 20 |
# Python
ponr = [12, 12, 13]
prodnr = [44321, 78234, 78234]
index_tuples = [(s_, p_) for (s_,p_) in zip(ponr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['ponr', 'prodnr'])
po_line_data = {'quantity_purchased' : [200, 50, 20]}
po_line = pd.DataFrame(data=po_line_data).set_index(index)
po_line.merge(purchase_order, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R
po.line <- tibble(
ponr = c(12, 12, 13),
prodnr = c(44321, 78234, 78234),
quantity.purchased = c(200, 50, 20)
)
# Nested joins can be done elegantly using R pipes (%>%)
po.line <- po.line %>% left_join(purchase.order, by='ponr') %>%
left_join(product, by='prodnr')
print(po.line)
--"SQL"
DROP TABLE IF EXISTS Po_line;
CREATE TABLE Po_line
(ponr TEXT NOT NULL,
prodnr TEXT NOT NULL,
quantity_purchased INTEGER,
PRIMARY KEY (ponr, prodnr),
FOREIGN KEY (ponr) REFERENCES Purchase_order (ponr),
FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);
INSERT into Po_line VALUES(12, 44321, 200);
INSERT into Po_line VALUES(12, 78234, 50);
INSERT into Po_line VALUES(13, 78234, 20);
select * from Po_line
LEFT JOIN Purchase_order
ON Po_line.ponr = Purchase_order.ponr
LEFT JOIN Product
ON Po_line.prodnr = Product.prodnr;
The complete database now looks like this
Supplier(*supnr, supname, supadress, supcity)
Product(*prodnr, prodname, prodtype, quantity)
Supplies(*[supnr_, prodnr_], purchase_price)
Purchase_order(*ponr, podate)
PO_line(*[ponr_, prodnr_], quantitiy)
The following figure illustrates the tables in the database and their relation to each other. It was created from the SQLite database file created by the commands above using sqleton.
Data schemes like this appear in every real-world data science project. A typical business case for the department store in our example would be forecasting the demand of storage space in a warehouse. Transferred to the data level, this is equivalent to predicting the quantity
column in the Product
table.
Machine learning algorithms like neural nets or random forests that can be used to solve this problem, however, cannot process the relational data in its raw form. Therefore, you would need to create what is called a flat feature table by hand. This happens by writing statements in Pyhon, R or SQL like demonstrated in this article.
One feature that could be interesting for the prediction of the quantity
column is the sum of quantity_purchased
for each product in the Po_line
table. To this end, we left join po_line
to product
and get the following table as a result.
prodname | prodtype | quantity | quantity purchased |
---|---|---|---|
banana | fruit | 400 | NaN |
apple | fruit | 2000 | 200.0 |
carrot | vegetable | 200 | NaN |
hammer | tool | 10 | 50.0 |
hammer | tool | 10 | 20.0 |
The information about purchases for each product has been added to the product table. Because there were two purchases for the product ‘hammer’ this row appreas two times. The products for which no purchases are recorded in po_line
have a Nan
value in quantity_purchased
(there are still listed in the table, because we performed a left join).
Now we need to sum quantity_purchased
for all rows that correspond to the same product. In this case, this effectvely means summing the last two rows that correspond to the product ‘hammer’.
prodnr | quantity | quantity purchased |
---|---|---|
41234 | 400 | 0.0 |
44321 | 2000 | 200.0 |
53153 | 200 | 0.0 |
78234 | 20 | 70.0 |
Technically, you can create features like this by grouping the joined table using prodnr
. This results in four groups: The first three have only one entry and the last one (‘hammer’) has two. Then, we need to sum quantitiy_purchased
for each of the groups. This is called an ‘aggregation’.
# Python
product.merge(po_line, on='prodnr', how='left').groupby('prodnr').sum()
#R
product %>%
left_join(po.line, by='prodnr') %>%
group_by(prodnr) %>%
summarize(total.quantity.purchased = sum(quantity.purchased))
--"SQL"
SELECT Product.prodnr, SUM(quantity_purchased) as total_quantity_purchased
from Product
LEFT JOIN Po_line
ON Product.prodnr = Po_line.prodnr
GROUP BY Product.prodnr
This last example hopefully gave you a superficial impression of what it means to find features in a relational data scheme. More on this topic will be covered in a follow-up article.
This article has shown you that a relational data scheme is nothing more than a collection of tables that are related to each other using foreign keys. The nature of this relation can be ‘one-to-one’, ‘one-to-many’, or ‘many-to-many’. To visualize these relations or use them to create features on uses (left) joins to connect different tables in a database.