DataFrame
Underdog's DataFrame combines tools for working with columnar data as tables (dataframes) and columns (series). And also has extra features such statistical functions and visualizations via Underdog's plots module.
Tutorial
Prerequisites
Dependencies
To be able to follow the tutorial you need the underdog-dataframe
module. If you're using Gradle in your project:
Data
You can find the data used in this tutorial here
Loading data
First of all we need to load the csv file with the data. Underdog infers the column types by sampling the data.
import underdog.Underdog
def tornadoes = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")
Note the file, in this case, is addressed relative to the current working directory. In this case it must point to the csv file path whether it is a relative path of an absolute path.
Metadata
Often, the best way to start is to print the column names for reference:
[Date, Time, State, State No, Scale, Injuries, Fatalities, Start Lat, Start Lon, Length, Width]
The shape() method displays the row and column counts:
structure() shows the index, name and type of each column
Structure of tornadoes_1950-2014.csv
Index | Column Name | Column Type |
-----------------------------------------
0 | Date | LOCAL_DATE |
1 | Time | LOCAL_TIME |
2 | State | STRING |
3 | State No | INTEGER |
4 | Scale | INTEGER |
5 | Injuries | INTEGER |
6 | Fatalities | INTEGER |
7 | Start Lat | DOUBLE |
8 | Start Lon | DOUBLE |
9 | Length | DOUBLE |
10 | Width | INTEGER |
Like many DataFrame methods, schema() returns another DataFrame. You can then produce a string representation for display. To display the DataFrame then, you can simply call.
You can also perform other DataFrame operations on the schema. For example, the code below removes all columns whose type isn’t DOUBLE
:
def customSchema = schema[schema['Column Type'] == 'DOUBLE']
Structure of tornadoes_1950-2014.csv
Index | Column Name | Column Type |
-----------------------------------------
7 | Start Lat | DOUBLE |
8 | Start Lon | DOUBLE |
9 | Length | DOUBLE |
Of course, that also returned another DataFrame. We’ll cover selecting rows in more detail later.
Previewing
The first(n) method returns a new DataFrame containing the first n rows.
tornadoes_1950-2014.csv
Date | Time | State | State No | Scale | Injuries | ... |
--------------------------------------------------------------------------------
1950-01-03 | 11:00:00 | MO | 1 | 3 | 3 | ... |
1950-01-03 | 11:00:00 | MO | 1 | 3 | 3 | ... |
1950-01-03 | 11:10:00 | IL | 1 | 3 | 0 | ... |
Transforming
Mapping operations take one or more series (columns) as inputs and produce a new column as output. The method below extracts the Month name from the date column into a new column.
def monthSeries = tornadoes["Date"](LocalDate, String) {
it.format("MMMM")
}
Now that you have a new column, you can add it to the DataFrame:
Of course nothing prevents you from doing everything altogether.
You can remove columns from DataFrames to save memory or reduce clutter:
Sorting
Now lets sort the DataFrame in reverse order by the id column. The negative sign before the name indicates a descending sort.
You can also sort in descending order:
and even sorting by more than one field:
Descriptive statistics
Descriptive statistics are calculated using the summary() method:
Showing the following output:
Column: Fatalities
Measure | Value |
------------------------------------
n | 59945 |
sum | 6802 |
Mean | 0.11347068145800349 |
Min | 0 |
Max | 158 |
Range | 158 |
Variance | 2.901978053261765 |
Std. Dev | 1.7035193140266314 |
Filtering
The preferred way of filtering DataFrames in Underdog is to use the list notation. Look at the following example:
// reading tornadoes
def ts = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")
// adding a new series to dataframe with the name of the month
ts['month'] = ts["Date"](LocalDate, String) { it.format("MMMM") }
// filtering
def result = ts[
ts['Fatalities'] > 0 & // at least 1 fatalities
ts['month'] == "April" & // in the month of April
(ts['Width'] > 300 | ts['Length'] > 10) // a tornado with a
]
// selecting only two columns
def stateAndDate = result['State', 'Date']
tornadoes_1950-2014.csv
State | Date |
------------------------
MO | 1950-01-03 |
IL | 1950-01-03 |
OH | 1950-01-03 |
The last example filters the tornadoes DataFrame with predicates of type dataFrame[seriesName] op (series | object)
. Where op can be a comparison operators such as >=,<=,==
etc. Of course these expressions can be combined with or
or and
operators |
and &
.
Grouping
Series metrics can be calculated using grouping methods like sum(), product(), mean(), max(), etc. You can apply those methods to a DataFrame, calculating results on one column, grouped by the values in another.
def tornadoes = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")
def injuriesByScale = tornadoes
.rename("Median Injuries by Tornado Scale")
.agg(Injuries: "median")
.by("Scale")
.sort_values(by: "Scale")
This produces the following DataFrame, in which Group represents the Tornado Scale and Median the median injures for that group:
Median injuries by Tornado Scale
Scale | Median [Injuries] |
-------------------------------
-9 | 0 |
0 | 0 |
1 | 0 |
2 | 0 |
3 | 1 |
4 | 12 |
5 | 107 |
Cross Tabs
Underdog lets you easily produce two-dimensional cross-tabulations (“cross tabs”) of counts and proportions with row and column subtotals. Here’s a count example where we look at the interaction of tornado severity and US state:
def crossTab = tornadoes.xTabCounts(labels: 'State', values: 'Scale')
crossTab.head()
Crosstab Counts: State x Scale
[labels] | -9 | 0 | 1 | 2 | 3 | 4 | 5 | total |
----------------------------------------------------------------------------
AL | 0 | 624 | 770 | 425 | 142 | 38 | 12 | 2011 |
AR | 1 | 486 | 667 | 420 | 162 | 29 | 0 | 1765 |
AZ | 1 | 146 | 71 | 16 | 3 | 0 | 0 | 237 |
CA | 1 | 271 | 117 | 23 | 2 | 0 | 0 | 414 |
CO | 3 | 1322 | 563 | 112 | 22 | 1 | 0 | 2023 |
CT | 0 | 18 | 53 | 22 | 4 | 2 | 0 | 99 |
DC | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 |
DE | 0 | 22 | 26 | 12 | 1 | 0 | 0 | 61 |
FL | 2 | 1938 | 912 | 319 | 37 | 3 | 0 | 3211 |
GA | 0 | 413 | 700 | 309 | 74 | 11 | 0 | 1507 |
Putting it all together
Now that you’ve seen the pieces, we can put them together to perform a more complex data analysis. Lets say we want to know how frequently Tornadoes occur in the summer. Here’’s one way to approach that:
Let’s start by getting only those tornadoes that occurred in the summer.
def ts = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")
// adding some series to the dataframe to make filtering easier
ts['month'] = ts['Date'](Date, String) { it.format("MMMM") }
ts['dayOfMonth'] = ts['Date'](Date, Integer) { it.format("dd").toInteger() }
// filtering
def summer = ts[
(ts['month'] == 'June' & ts['dayOfMonth'] > 21) | // after June the 21st or...
(ts['month'] in ['July', 'August']) | // in July or August or...
(ts['month'] == 'September' & ts['dayOfMonth'] < 22) // before September the 22nd
]
Then to get the frequency:
- We calculate the difference in days between successive tornadoes.
- The lag() method creates a column where every value equals the previous value (the prior row) of the source column.
- Then we can simply get the difference in days between the two dates. DateColumn has a method daysUntil() that does this. It returns a NumberColumn that we’ll call “delta”.
// sorting by Date and Time series
summer = summer.sort_values(by: ['Date', 'Time'])
// creating a series with lagged dates
summer['Lagged'] = summer['Date'].lag(1)
// creating a series with delta days between lagged dates and summer dates
summer['Delta'] = summer['Lagged'] - summer['Date']
Now we simply calculate the mean of the delta column. Splitting on year keeps us from inadvertently including the time between the last tornado of one summer and the first tornado of the next.
// creating year series to be able to group by it
summer['year'] = summer['Date'](Date, String) { it.format("YYYY") }
// aggregating delta
def summary = summer.agg(Delta: ["mean", "count"]).by("year")
// print out summary
println(summary)
Printing summary gives us the answer by year.
tornadoes_1950-2014.csv summary
Date year | Mean [Date lag(1) - Date[DAYS]] | Count [Date lag(1) - Date[DAYS]] |
--------------------------------------------------------------------------------------
1950 | 2.0555555555555545 | 162 |
1951 | 1.7488584474885829 | 219 |
1952 | 1.8673469387755088 | 196 |
1953 | 0.983870967741935 | 372 |
1954 | 0.8617283950617302 | 405 |
To get a DOUBLE for the entire period, we can take the average of the annual means.
Saving your data
To save a DataFrame, you can write it as a CSV file:
And that’s it for the introduction. Please see the User Guide for more information.
DataFrame
Underdog's DataFrame combines tools for working with columnar data as tables (dataframes) and columns (series). And also has extra features such statistical functions and visualizations via Underdog's plots module.
Creation
The easiest way to create a Dataframe is using the Underdog extension method Underdog.df()
. Here we're creating an empty DataFrame:
We can create a dataframe with a series of map entries representing series. In this case the key entry is the name of the series and the value is a collection which will become the content of the series.
// creating a map
def map = [
names: ["John", "Laura", "Ursula"],
ages: [22, 34, 83]
]
// creating a dataframe from the map
DataFrame map2DataFrame = Underdog.df().from(map, "people-dataframe")
Underdog dataframe library adds additional methods to collection types so that you can convert from collections to Dataframes. And example is invoking the toDataFrame(...)
method from the map directly:
// creating a map
def map = [
names: ["John", "Laura", "Ursula"],
ages: [22, 34, 83]
]
// creating a dataframe from a map
DataFrame map2DataFrame = map.toDataFrame("people-dataframe")
You can also pass a list of maps to the Underdog.df().from(col, name)
method. The method assumes all entries are maps with the same keys:
// creating a list of maps
def list = [
[name: "John", age: 22],
[name: "Laura", age: 34],
[name: "Ursula", age: 83]
]
// creating a dataframe from the list
DataFrame colOfMaps2DataFrame = Underdog.df().from(list, "people-dataframe")
Here there is also an extension method for collections so that, IF your list complies to this structure you can call to the method toDataFrame(name)
and create a DataFrame from that collection.
// creating a list of maps
def list = [
[name: "John", age: 22],
[name: "Laura", age: 34],
[name: "Ursula", age: 83]
]
// creating a dataframe from the list
DataFrame colOfMaps2DataFrame = list.toDataFrame("people-dataframe")
Filtering
In a dataframe you can filter data by any of the Series the dataframe has.
Numbers
The following example creates a hypothetical population progression in ten years:
def df = [
years: (1991..2000),
population: (1..10).collect { 1000 * it }
].toDataFrame("population increase")
population increase
years | population |
------------------------
1991 | 1000 |
1992 | 2000 |
1993 | 3000 |
1994 | 4000 |
1995 | 5000 |
1996 | 6000 |
1997 | 7000 |
1998 | 8000 |
1999 | 9000 |
2000 | 10000 |
If we wanted to take the records after year 1995:
population increase
years | population |
------------------------
1996 | 6000 |
1997 | 7000 |
1998 | 8000 |
1999 | 9000 |
2000 | 10000 |
Or getting records with population less than 4000:
population increase
years | population |
------------------------
1991 | 1000 |
1992 | 2000 |
1993 | 3000 |
String
Of course we can filter by strings. Follow up we've got a dataframe with some employee data:
def df = [
employees: ['Udo', 'John', 'Albert', 'Ronda'],
department: ['sales', 'it', 'sales', 'it'],
payroll: [10_000, 12_000, 11_000, 13_000]
].toDataFrame("employees")
employees
employees | department | payroll |
----------------------------------------
Udo | sales | 10000 |
John | it | 12000 |
Albert | sales | 11000 |
Ronda | it | 13000 |
Getting employees from sales department:
employees
employees | department | payroll |
----------------------------------------
Udo | sales | 10000 |
Albert | sales | 11000 |
You can also use to filter by a list of possible choices:
employees
employees | department | payroll |
----------------------------------------
Udo | sales | 10000 |
Ronda | it | 13000 |
You can even try by a regular expression. Lets look for employees with an 'o' in their name:
employees
employees | department | payroll |
----------------------------------------
Udo | sales | 10000 |
John | it | 12000 |
Ronda | it | 13000 |
Dates
Of course in time series is crucial to allow searches by time frame.
// Using a given date as the beginning of our df dates series
def initialDate = LocalDate.parse('01/01/2000', 'dd/MM/yyyy')
// a dataframe containing the simulation of bicycles rented through 2000
def df = [
dates: (1..365).collect(initialDate::plusDays),
rented: (1..365).collect { new Random().nextInt(200) }
].toDataFrame("rented bicycles 2000")
rented bicycles 2000
dates | rented |
-------------------------
2000-01-02 | 41 |
2000-01-03 | 47 |
2000-01-04 | 27 |
2000-01-05 | 95 |
2000-01-06 | 30 |
2000-01-07 | 162 |
2000-01-08 | 52 |
2000-01-09 | 197 |
2000-01-10 | 125 |
2000-01-11 | 15 |
... | ... |
What if we'd like to get only those records of december 2000 ?
rented bicycles 2000
dates | rented |
-------------------------
2000-12-01 | 104 |
2000-12-02 | 193 |
2000-12-03 | 107 |
2000-12-04 | 108 |
2000-12-05 | 193 |
2000-12-06 | 165 |
2000-12-07 | 82 |
2000-12-08 | 77 |
2000-12-09 | 176 |
2000-12-10 | 158 |
... | ... |
2000-12-31 | 150 |
& and |
You can combine any predicate with the operators &
and |
. For example, lets filter by all years after 1995 AND entries with population less or equals 8000:
Now filter by years less or equals to 1995 OR entries with population greater than 9000.
Summary
Here you have the tables with the supported operators:
Arithmetic
Left | Right | Operator | Example | Status |
---|---|---|---|---|
Series | Series | + | df['a'] + df['b'] |
Yes |
Series | Series | - | df['a'] - df['b] |
Yes |
Series | Object | + | df['a'] + 1 |
Yes |
Series | Object | - | df['a'] - 1 |
Yes |
Series | Object | * | df['a'] * 2 |
Yes |
Series | Object | / | df['a'] / 2 |
Yes |
filtering operators
Type | Operator | Example | Status |
---|---|---|---|
String | == | df['a'] == 'x' |
Yes |
String | != | df['a'] != 'x' |
Yes |
String | ==~ | df['a'] ==~ /.*/ |
Yes |
String | in | df['a'] in ['x'] |
Yes |
Number | == | df['a'] == 1 |
Yes |
Number | != | df['a'] != 1 |
Yes |
Number | > | df['a'] > 1 |
Yes |
Number | >= | df['a'] >= 1 |
Yes |
Number | < | df['a'] < 1 |
Yes |
Number | <= | df['a'] <= 1 |
Yes |
LocalDate | > | df['a'] > date |
Yes |
LocalDate | >= | df['a'] >= date |
Yes |
LocalDate | < | df['a'] < date |
Yes |
LocalDate | <= | df['a'] <= date |
Yes |
Sorting
TODO
Mapping
TODO
Import / Export
CSV
Reading
You can read csv files via Underdog.df().read_csv(...)
method. Here we are importing a csv files containing tornado
incidents in the USA:
DataFrame dataframe = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")
tornadoes_1950-2014.csv
Date | Time | State | State No | Scale | Injuries | ... |
--------------------------------------------------------------------------------
1950-01-03 | 11:00:00 | MO | 1 | 3 | 3 | ... |
1950-01-03 | 11:00:00 | MO | 1 | 3 | 3 | |
1950-01-03 | 11:10:00 | IL | 1 | 3 | 0 | |
1950-01-03 | 11:55:00 | IL | 2 | 3 | 3 | |
1950-01-03 | 16:00:00 | OH | 1 | 1 | 1 | |
1950-01-13 | 05:25:00 | AR | 1 | 3 | 1 | |
1950-01-25 | 19:30:00 | MO | 2 | 2 | 5 | |
1950-01-25 | 21:00:00 | IL | 3 | 2 | 0 | |
1950-01-26 | 18:00:00 | TX | 1 | 2 | 2 | |
1950-02-11 | 13:10:00 | TX | 2 | 2 | 0 | |
... | ... | ... | ... | ... | ... | ... |
Separator
By default the csv reader assumes the csv file is using comma (,) as the separator character, but you can provide a custom separator. For example the following csv file content:
Can be read by using the sep argument:
Duplicated names
Sometimes you can find a csv where columns are repeated, by default if you don't specify you allow repeated columns the import process will fail. Imagine we've got the following csv:
bronze,silver,gold,summer_total,bronze,silver,gold,winter_total
1,2,1,4,1,1,1,3
To allow repeated columns you should set the allowDuplicatedNames
flag to true.
Then all repeated names will be prefixed in order with a number to avoid collisions:
io_repeated_cols.csv
bronze | silver | gold | summer_total | bronze-2 | silver-2 | gold-2 | winter_total|
-----------------------------------------------------------------------------------------------
1 | 2 | 1 | 4 | 1 | 1 | 1 | 3 |
Missing values
If a csv file contains values which should be considered as well as missing values, we can pass this information before reading the csv file.
from,to,id
N/C,Madrid,123
Madrid,Paris,124
Paris,London,125
London,NONE,126
Here we're considering missing data the values N/C and NONE:
def dataframe = Underdog.df().read_csv(filePath, nanValues: ['NONE', 'N/C'])
That will inform the reader to consider cells containing that value as missing values:
io_custom_missing_data.csv
from | to | id |
-----------------------------
| Madrid | 123 |
Madrid | Paris | 124 |
Paris | London | 125 |
London | | 126 |
Date format
If your csv files have a custom date format you can provide the date pattern as a parameter. Here we have a file with a custom format:
Date,Close
2014-12-05 00:00:00+00:00,0.5267500281333923
2014-12-08 00:00:00+00:00,0.5199999809265137
2014-12-09 00:00:00+00:00,0.5182499885559082
Passing the pattern as parameter:
def dataframe = Underdog.df().read_csv(filePath, dateFormat: "yyyy-MM-dd HH:mm:ss+00:00")
Gives the following output:
io_custom_date_format.csv
Date | Close |
-------------------------------------
2014-12-05 | 0.5267500281333923 |
2014-12-08 | 0.5199999809265137 |
2014-12-09 | 0.5182499885559082 |
Skip rows/footer
If you're sure that there is data you'd like to avoid parsing, like nonsense data, you can skip parsing those rows. Check the following example:
csv file with comments | |
---|---|
There are lines we don't want to consider when creating our dataframe:
- comments in the beginning of the file (lines 1-3)
- comments in the end of the file (line 15)
- rows we don't want to parse because they don't add any meaningful information (4-8 and 11-14)
To avoid parsing any of these lines we can instruct the csv reader to skip lines in the header and/or in the footer of the file:
def dataframe = Underdog.df()
.read_csv(filePath,
header: false, // not using first row as header
skipRows: 8, // skipping rows at the beginning of the file
skipFooter: 4 // skipping rows at the end of the file
).renameSeries(columns: ['city', 'id']) // renaming series names with the list passed as parameter
Max chars x col
You can instruct the csv reader to avoid parsing columns with more than a number of characters.
Warning
If a column exceeds the number of characters the process will throw an exception
Max cols
You can instruct the csv reader to avoid parsing more than a given number of columns.
Warning
If the number of columns exceeds the number specified the process will throw an exception
Series
Intro
A Series object represents a named one-dimensional array. It also supports operations and statistical methods. It also has operations to deal with missing values. You can create a Series object from different sources:
// from a range of numbers
def numbers = (1..4).toSeries("numbers")
// from a range of letters
def letters = ('A'..'C').toSeries("letters")
// from a list
def stuff = [1, 2, null, 3, 4].toSeries("stuff")
You can use operator symbols to apply simple operations over the Series object:
// multiplying a series by a number
def doubleSeries = numbers * 2
// multiplying a series by another series
def rowProduct = numbers * stuff.dropna()
// dividing a series
def halves = stuff / 2
// using custom transformation to create a new series
def custom = letters(String, String) { "letter-$it".toString() }
Sometimes you may want to analyze a given Series object by using statistical methods:
def mean = doubleSeries.mean()
def max = doubleSeries.max()
def min = doubleSeries.min()
def avg = doubleSeries.avg()
You can find all statistical available methods in the underdog.impl.extensions.SeriesStatsExtensions
class.
Creating
Series are meant to be created from collections or as a transformation from another Series.
The only way to create a Series from a collection is invoking the extension method toSeries()
from a list:
Most of the time we will be dealing with a Series creation inside the scope of a Dataframe. Sometimes as the result of the transformation of another series, sometimes because we would like to fill a series from a constant value.
Lets say we have a DataFrame with some Series:
And we want to create a new series named by_two with the result of multiplying all numbers in the numbers series:
numbers
numbers | by_two |
----------------------
1 | 2 |
2 | 4 |
3 | 6 |
4 | 8 |
5 | 10 |
6 | 12 |
7 | 14 |
8 | 16 |
9 | 18 |
10 | 20 |
You can also create a new Series inside a dataframe filling all rows with the same value:
numbers
numbers | by_two | one |
------------------------------
1 | 2 | 1 |
2 | 4 | 1 |
3 | 6 | 1 |
4 | 8 | 1 |
5 | 10 | 1 |
6 | 12 | 1 |
7 | 14 | 1 |
8 | 16 | 1 |
9 | 18 | 1 |
10 | 20 | 1 |
Statistics
TODO