Automate Machine Learning on time series with getML
Most of the material and information provided about getML is focusing on relational data. But what if one has to deal with time series? Will the software be applicable and of any help?
Short answer: Yes, it is applicable and suites your use case too! Regardless of whether you are using getML as an all-in-one solution or only deploy its generated features into your custom analysis workflow.
For those interested in the long answer the remainder of this document will first explain why a time series is indeed a subset of relational data. Then, we will have a look at how to make getML support your time series and how the most important aspects and operations, like causality and time windows, are achieved internally and within relation databases in general.
Let's assume we do have a set of files containing measurements of one or more quantities generated in their individual context, like daily temperature measurements at different observation stations. Classically we would represent the data in a table like this:
time_stamp | temperature |
---|---|
01.01.2019 | 2.3 |
02.01.2019 | 1.8 |
03.01.2019 | -0.4 |
Being a table already, we can insert the data into a relational database system as is. But we can do more. Way more.
By adding an additional column specifying the particular measurement station and appending the data of all other files - which could very well be of different length - to our table, we can integrate all data at hand into a single object. This enables us to harness the power of modern database technology and allows our analysis to scale to vast amounts of data as well as to be both more flexible and powerful. The individual series can be easily be retrieved by filtering a particular station.
time_stamp | temperature | station |
---|---|---|
01.01.2019 | 2.3 | Berlin |
02.01.2019 | 1.8 | Berlin |
03.01.2019 | -0.4 | Berlin |
01.01.2019 | 3.8 | Paris |
02.01.2019 | 4.2 | Paris |
03.01.2019 | 1.2 | Paris |
01.01.2019 | 1.3 | Brussels |
02.01.2019 | 1.2 | Brussels |
03.01.2019 | 1.4 | Brussels |
But we still have just one table. Using getML you can easily add further information using additional tables in order to produce better features and more accurate predictions. Be it metadata about the measurement stations, costumer profiles, or third-party measurements. You name it! Imagine it as a many-to-one AI connector retrieving data from various sources and handing only the most promising combinations to your well-known machine learning models or time series analysis tools.
getML fully respects your time stamps and makes sure causality and
temporal ordering won't be violated while generating features. The
only thing you have to do is to tell the engine which column should be
handled as time stamps using the time_stamps
argument while
constructing a
DataFrame
and to add its basic time
unit. That's all!
In case you are wondering how this can be achieved internally or in relation database systems in general, the key idea is the so-called self-join. It allows a table to reference to itself. Let's have a look at our example above to understand what's going on. Disclaimer: you will find some SQL code in this section. This is only for illustration purposes. You won't need to write a single line yourself since getML will bear this burden for you!
Let's start with causality: Under no circumstances any information about future events should be leaked by the generated features. Therefore, regardless which time stamp to look at, only events of present and past are allowed to be aggregated. This will be illustrated with the following lines of code calculating the mean temperature at a certain point in time at a particular measurement station.
SELECT t1.time_stamp, AVG( t1.temperature ), t1.station
FROM tableAll t1
LEFT JOIN tableAll t2
ON t1.station = t2.station
WHERE t2.time_stamp <= t1.time_stamp
We'll not cover every single detail of the SQL syntax but literally
this statement says: give me the time stamps, station ids, and the
average temperatures of our table tableAll over all temperatures at
the particular station having a time stamp equal or smaller than the
current one. t1 and t2 are used as aliases referencing the same
table. It's a little bit much, so let's have a look at the individual
lines in our table. The first line deals with Berlin at the
01.01.2019. Since there are no time stamps prior to this date and
measurements done at other sites do not match our ON
condition, the
average will be the measured temperature itself. In the second row
there are two time stamps matching the WHERE
condition and yielding
an average of (2.3+1.8)/2 = 2.05. The result will look like this.
time_stamp | temperature | station |
---|---|---|
01.01.2019 | 2.3 | Berlin |
02.01.2019 | 2.05 | Berlin |
03.01.2019 | 1.233 | Berlin |
01.01.2019 | 3.8 | Paris |
02.01.2019 | 4 | Paris |
03.01.2019 | 3.067 | Paris |
01.01.2019 | 1.3 | Brussels |
02.01.2019 | 1.25 | Brussels |
03.01.2019 | 1.3 | Brussels |
Performing an average over a time window, e.g. just taking the last week of our daily data set into account, is a little bit more complicated (in terms of syntax).
SELECT t1.time_stamp, AVG( t1.temperature ), t1.station
FROM tableAll t1
LEFT JOIN tableAll t2
ON t1.station = t2.station
WHERE t2.time_stamp <= t1.time_stamp AND t2.time_stamp + 7 > t1.time_stamp
Here all rows considered in the average must, again, have a time stamp equal or smaller than the current one. In addition the time stamp must not older than 7 days with respect to the former one.
But remember, all of this will be done automatically by getML so you won't have to care about a single line of SQL (if you not intend to).