DataFrame

Underdog's DataFrame combines tools for working with columnar data as tables (dataframes) and columns (series). And also has extra features such statistical functions and visualizations via Underdog's plots module.

Tutorial

Prerequisites

Dependencies

To be able to follow the tutorial you need the underdog-dataframe module. If you're using Gradle in your project:

GradleMavenGrapes

implementation "com.github.grooviter:underdog-dataframe:VERSION"

<dependency>
    <groupId>com.github.grooviter</groupId>
    <artifactId>underdog-dataframe</artifactId>
    <version>VERSION</version>
</dependency>

@Grab("com.github.grooviter:underdog-dataframe:VERSION")

Data

You can find the data used in this tutorial here

Loading data

First of all we need to load the csv file with the data. Underdog infers the column types by sampling the data.

reading

import underdog.Underdog

def tornadoes = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")

Note the file, in this case, is addressed relative to the current working directory. In this case it must point to the csv file path whether it is a relative path of an absolute path.

Metadata

Often, the best way to start is to print the column names for reference:

list dataframe column names

println tornadoes.columns

output

[Date, Time, State, State No, Scale, Injuries, Fatalities, Start Lat, Start Lon, Length, Width]

The shape() method displays the row and column counts:

.shape of the dataframe

def (int rows, int cols) = tornadoes.shape()

shape of the dataframe

println tornadoes.shape()

output

59945 rows X 11 cols

structure() shows the index, name and type of each column

tornadoes schemas

// getting tornadoes schema
def schema = tornadoes.schema()

output

  Structure of tornadoes_1950-2014.csv
 Index  |  Column Name  |  Column Type  |
-----------------------------------------
     0  |         Date  |   LOCAL_DATE  |
     1  |         Time  |   LOCAL_TIME  |
     2  |        State  |       STRING  |
     3  |     State No  |      INTEGER  |
     4  |        Scale  |      INTEGER  |
     5  |     Injuries  |      INTEGER  |
     6  |   Fatalities  |      INTEGER  |
     7  |    Start Lat  |       DOUBLE  |
     8  |    Start Lon  |       DOUBLE  |
     9  |       Length  |       DOUBLE  |
    10  |        Width  |      INTEGER  |

Like many DataFrame methods, schema() returns another DataFrame. You can then produce a string representation for display. To display the DataFrame then, you can simply call.

println tornadoes

You can also perform other DataFrame operations on the schema. For example, the code below removes all columns whose type isn’t DOUBLE:

using schema to change dataframe structure

def customSchema = schema[schema['Column Type'] == 'DOUBLE']

output

  Structure of tornadoes_1950-2014.csv
 Index  |  Column Name  |  Column Type  |
-----------------------------------------
     7  |    Start Lat  |       DOUBLE  |
     8  |    Start Lon  |       DOUBLE  |
     9  |       Length  |       DOUBLE  |

Of course, that also returned another DataFrame. We’ll cover selecting rows in more detail later.

Previewing

The first(n) method returns a new DataFrame containing the first n rows.

getting first 3 lines

tornadoes.head(3)

output

                                tornadoes_1950-2014.csv
   Date     |    Time    |  State  |  State No  |  Scale  |  Injuries  |  ... |
--------------------------------------------------------------------------------
1950-01-03  |  11:00:00  |     MO  |         1  |      3  |         3  |  ... |
1950-01-03  |  11:00:00  |     MO  |         1  |      3  |         3  |  ... |
1950-01-03  |  11:10:00  |     IL  |         1  |      3  |         0  |  ... |

Transforming

Mapping operations take one or more series (columns) as inputs and produce a new column as output. The method below extracts the Month name from the date column into a new column.

creating a new series to dataframe

def monthSeries = tornadoes["Date"](LocalDate, String) {
    it.format("MMMM")
}

Now that you have a new column, you can add it to the DataFrame:

adding month series to tornadoes dataframe

tornadoes['month'] = monthSeries

Of course nothing prevents you from doing everything altogether.

You can remove columns from DataFrames to save memory or reduce clutter:

remove series from dataframe

tornadoes = tornadoes - tornadoes['Date']

Sorting

Now lets sort the DataFrame in reverse order by the id column. The negative sign before the name indicates a descending sort.

sort asc

def df1 = tornadoes.sort_values(by: 'Fatalities')

You can also sort in descending order:

sort desc

def df2 = tornadoes.sort_values(by: '-Fatalities')

and even sorting by more than one field:

sort by n-fields

def df3 = tornadoes.sort_values(by: ['Fatalities', 'State'])

Descriptive statistics

Descriptive statistics are calculated using the summary() method:

printing column insights

def columnStats = tornadoes["Fatalities"].describe()

println(columnStats)

Showing the following output:

column describe output

         Column: Fatalities
 Measure   |         Value         |
------------------------------------
        n  |                59945  |
      sum  |                 6802  |
     Mean  |  0.11347068145800349  |
      Min  |                    0  |
      Max  |                  158  |
    Range  |                  158  |
 Variance  |    2.901978053261765  |
 Std. Dev  |   1.7035193140266314  |

Filtering

The preferred way of filtering DataFrames in Underdog is to use the list notation. Look at the following example:

filtering dataframe by column expressions

// reading tornadoes
def ts = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")

// adding a new series to dataframe with the name of the month
ts['month'] = ts["Date"](LocalDate, String) { it.format("MMMM") }

// filtering
def result = ts[
    ts['Fatalities'] > 0 &                   // at least 1 fatalities
    ts['month'] == "April" &                 // in the month of April
    (ts['Width'] > 300 | ts['Length'] > 10)  // a tornado with a
]

// selecting only two columns
def stateAndDate = result['State', 'Date']

output

 tornadoes_1950-2014.csv
 State  |     Date     |
------------------------
    MO  |  1950-01-03  |
    IL  |  1950-01-03  |
    OH  |  1950-01-03  |

The last example filters the tornadoes DataFrame with predicates of type dataFrame[seriesName] op (series | object). Where op can be a comparison operators such as >=,<=,== etc. Of course these expressions can be combined with or or and operators | and &.

Grouping

Series metrics can be calculated using grouping methods like sum(), product(), mean(), max(), etc. You can apply those methods to a DataFrame, calculating results on one column, grouped by the values in another.

def tornadoes = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")

def injuriesByScale = tornadoes
    .rename("Median Injuries by Tornado Scale")
    .agg(Injuries: "median")
    .by("Scale")
    .sort_values(by: "Scale")

This produces the following DataFrame, in which Group represents the Tornado Scale and Median the median injures for that group:

output

Median injuries by Tornado Scale
 Scale  |  Median [Injuries]  |
-------------------------------
    -9  |                  0  |
     0  |                  0  |
     1  |                  0  |
     2  |                  0  |
     3  |                  1  |
     4  |                 12  |
     5  |                107  |

Cross Tabs

Underdog lets you easily produce two-dimensional cross-tabulations (“cross tabs”) of counts and proportions with row and column subtotals. Here’s a count example where we look at the interaction of tornado severity and US state:

crosstabs (count)

def crossTab = tornadoes.xTabCounts(labels: 'State', values: 'Scale')

crossTab.head()

output

                       Crosstab Counts: State x Scale
 [labels]  |  -9  |   0    |   1   |   2   |   3   |  4   |  5   |  total  |
----------------------------------------------------------------------------
       AL  |   0  |   624  |  770  |  425  |  142  |  38  |  12  |   2011  |
       AR  |   1  |   486  |  667  |  420  |  162  |  29  |   0  |   1765  |
       AZ  |   1  |   146  |   71  |   16  |    3  |   0  |   0  |    237  |
       CA  |   1  |   271  |  117  |   23  |    2  |   0  |   0  |    414  |
       CO  |   3  |  1322  |  563  |  112  |   22  |   1  |   0  |   2023  |
       CT  |   0  |    18  |   53  |   22  |    4  |   2  |   0  |     99  |
       DC  |   0  |     2  |    0  |    0  |    0  |   0  |   0  |      2  |
       DE  |   0  |    22  |   26  |   12  |    1  |   0  |   0  |     61  |
       FL  |   2  |  1938  |  912  |  319  |   37  |   3  |   0  |   3211  |
       GA  |   0  |   413  |  700  |  309  |   74  |  11  |   0  |   1507  |

Putting it all together

Now that you’ve seen the pieces, we can put them together to perform a more complex data analysis. Lets say we want to know how frequently Tornadoes occur in the summer. Here’’s one way to approach that:

Let’s start by getting only those tornadoes that occurred in the summer.

tornadoes in the summer

def ts = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")

// adding some series to the dataframe to make filtering easier
ts['month']      = ts['Date'](Date, String) { it.format("MMMM") }
ts['dayOfMonth'] = ts['Date'](Date, Integer) { it.format("dd").toInteger() }

// filtering
def summer = ts[
    (ts['month'] == 'June' & ts['dayOfMonth'] > 21) |    // after June the 21st or...
    (ts['month'] in ['July', 'August']) |                // in July or August or...
    (ts['month'] == 'September' & ts['dayOfMonth'] < 22) // before September the 22nd
]

Then to get the frequency:

We calculate the difference in days between successive tornadoes.
The lag() method creates a column where every value equals the previous value (the prior row) of the source column.
Then we can simply get the difference in days between the two dates. DateColumn has a method daysUntil() that does this. It returns a NumberColumn that we’ll call “delta”.

lag and delta dates

// sorting by Date and Time series
summer = summer.sort_values(by: ['Date', 'Time'])

// creating a series with lagged dates
summer['Lagged'] = summer['Date'].lag(1)

// creating a series with delta days between lagged dates and summer dates
summer['Delta'] = summer['Lagged'] - summer['Date']

Now we simply calculate the mean of the delta column. Splitting on year keeps us from inadvertently including the time between the last tornado of one summer and the first tornado of the next.

create summary

// creating year series to be able to group by it
summer['year'] = summer['Date'](Date, String) { it.format("YYYY") }

// aggregating delta
def summary = summer.agg(Delta: ["mean", "count"]).by("year")

// print out summary
println(summary)

Printing summary gives us the answer by year.

summary output

                           tornadoes_1950-2014.csv summary
 Date year  |  Mean [Date lag(1) - Date[DAYS]]  |  Count [Date lag(1) - Date[DAYS]]  |
--------------------------------------------------------------------------------------
      1950  |               2.0555555555555545  |                               162  |
      1951  |               1.7488584474885829  |                               219  |
      1952  |               1.8673469387755088  |                               196  |
      1953  |                0.983870967741935  |                               372  |
      1954  |               0.8617283950617302  |                               405  |

To get a DOUBLE for the entire period, we can take the average of the annual means.

average second column

def meanOfSeries = summary.iloc[__, 1].mean()

Average days between tornadoes in the summer:

0.5931137164104612

Saving your data

To save a DataFrame, you can write it as a CSV file:

saving csv file

tornadoes.write().csv("rev_tornadoes_1950-2014.csv");

And that’s it for the introduction. Please see the User Guide for more information.

DataFrame

Underdog's DataFrame combines tools for working with columnar data as tables (dataframes) and columns (series). And also has extra features such statistical functions and visualizations via Underdog's plots module.

Creation

The easiest way to create a Dataframe is using the Underdog extension method Underdog.df(). Here we're creating an empty DataFrame:

empty dataframe

DataFrame emptyDataFrame = Underdog.df().empty()

We can create a dataframe with a series of map entries representing series. In this case the key entry is the name of the series and the value is a collection which will become the content of the series.

dataframe from a map

// creating a map
def map = [
    names: ["John", "Laura", "Ursula"],
    ages: [22, 34, 83]
]

// creating a dataframe from the map
DataFrame map2DataFrame = Underdog.df().from(map, "people-dataframe")

output

people-dataframe
 name   |  age  |
------------------
  John  |   22  |
 Laura  |   34  |
Ursula  |   83  |

Underdog dataframe library adds additional methods to collection types so that you can convert from collections to Dataframes. And example is invoking the toDataFrame(...) method from the map directly:

map extension

// creating a map
def map = [
    names: ["John", "Laura", "Ursula"],
    ages: [22, 34, 83]
]

// creating a dataframe from a map
DataFrame map2DataFrame = map.toDataFrame("people-dataframe")

You can also pass a list of maps to the Underdog.df().from(col, name) method. The method assumes all entries are maps with the same keys:

collection of maps

// creating a list of maps
def list = [
    [name: "John", age: 22],
    [name: "Laura", age: 34],
    [name: "Ursula", age: 83]
]

// creating a dataframe from the list
DataFrame colOfMaps2DataFrame = Underdog.df().from(list, "people-dataframe")

output

people-dataframe
 name   |  age  |
------------------
  John  |   22  |
 Laura  |   34  |
Ursula  |   83  |

Here there is also an extension method for collections so that, IF your list complies to this structure you can call to the method toDataFrame(name) and create a DataFrame from that collection.

collection extension

// creating a list of maps
def list = [
    [name: "John", age: 22],
    [name: "Laura", age: 34],
    [name: "Ursula", age: 83]
]

// creating a dataframe from the list
DataFrame colOfMaps2DataFrame = list.toDataFrame("people-dataframe")

Filtering

In a dataframe you can filter data by any of the Series the dataframe has.

Numbers

The following example creates a hypothetical population progression in ten years:

numbers

def df = [
    years: (1991..2000),
    population: (1..10).collect { 1000 * it }
].toDataFrame("population increase")

output

  population increase
years  |  population  |
------------------------
 1991  |        1000  |
 1992  |        2000  |
 1993  |        3000  |
 1994  |        4000  |
 1995  |        5000  |
 1996  |        6000  |
 1997  |        7000  |
 1998  |        8000  |
 1999  |        9000  |
 2000  |       10000  |

If we wanted to take the records after year 1995:

greater than

def last5Years = df[df['years'] > 1995]

output

  population increase
years  |  population  |
------------------------
 1996  |        6000  |
 1997  |        7000  |
 1998  |        8000  |
 1999  |        9000  |
 2000  |       10000  |

Or getting records with population less than 4000:

less than

def yearsWithLessThan = df[df['population'] < 4000]

output

population increase
years  |  population  |
------------------------
 1991  |        1000  |
 1992  |        2000  |
 1993  |        3000  |

String

Of course we can filter by strings. Follow up we've got a dataframe with some employee data:

employees

def df = [
    employees: ['Udo', 'John', 'Albert', 'Ronda'],
    department: ['sales', 'it', 'sales', 'it'],
    payroll: [10_000, 12_000, 11_000, 13_000]
].toDataFrame("employees")

output

               employees
 employees  |  department  |  payroll  |
----------------------------------------
       Udo  |       sales  |    10000  |
      John  |          it  |    12000  |
    Albert  |       sales  |    11000  |
     Ronda  |          it  |    13000  |

Getting employees from sales department:

sales

def salesPeople = df[df['department'] == 'sales']

output

               employees
 employees  |  department  |  payroll  |
----------------------------------------
       Udo  |       sales  |    10000  |
    Albert  |       sales  |    11000  |

You can also use to filter by a list of possible choices:

in list

def employeesByName = df[df['employees'] in ['Ronda', 'Udo']]

output

               employees
 employees  |  department  |  payroll  |
----------------------------------------
       Udo  |       sales  |    10000  |
     Ronda  |          it  |    13000  |

You can even try by a regular expression. Lets look for employees with an 'o' in their name:

regex

def employeesWithAnO = df[df['employees'] ==~ /.*o.*/ ]

output

               employees
 employees  |  department  |  payroll  |
----------------------------------------
       Udo  |       sales  |    10000  |
      John  |          it  |    12000  |
     Ronda  |          it  |    13000  |

Dates

Of course in time series is crucial to allow searches by time frame.

dates

// Using a given date as the beginning of our df dates series
def initialDate = LocalDate.parse('01/01/2000', 'dd/MM/yyyy')

// a dataframe containing the simulation of bicycles rented through 2000
def df = [
    dates: (1..365).collect(initialDate::plusDays),
    rented: (1..365).collect { new Random().nextInt(200) }
].toDataFrame("rented bicycles 2000")

output

rented bicycles 2000
  dates     |  rented  |
-------------------------
2000-01-02  |      41  |
2000-01-03  |      47  |
2000-01-04  |      27  |
2000-01-05  |      95  |
2000-01-06  |      30  |
2000-01-07  |     162  |
2000-01-08  |      52  |
2000-01-09  |     197  |
2000-01-10  |     125  |
2000-01-11  |      15  |
       ...  |     ...  |

What if we'd like to get only those records of december 2000 ?

after

def lastMonth = df[df['dates'] >= LocalDate.parse('01/12/2000', 'dd/MM/yyyy')]

output

  rented bicycles 2000
  dates     |  rented  |
-------------------------
2000-12-01  |     104  |
2000-12-02  |     193  |
2000-12-03  |     107  |
2000-12-04  |     108  |
2000-12-05  |     193  |
2000-12-06  |     165  |
2000-12-07  |      82  |
2000-12-08  |      77  |
2000-12-09  |     176  |
2000-12-10  |     158  |
       ...  |     ...  |
2000-12-31  |     150  |

& and |

You can combine any predicate with the operators & and |. For example, lets filter by all years after 1995 AND entries with population less or equals 8000:

AND

def result1 = df[df['years'] > 1995 & df['population'] <= 8000]

Now filter by years less or equals to 1995 OR entries with population greater than 9000.

OR

def result2 = df[df['years'] <= 1998 | df['population'] > 9000]

Summary

Here you have the tables with the supported operators:

Arithmetic

Left	Right	Operator	Example	Status
Series	Series	+	`df['a'] + df['b']`	Yes
Series	Series	-	`df['a'] - df['b]`	Yes
Series	Object	+	`df['a'] + 1`	Yes
Series	Object	-	`df['a'] - 1`	Yes
Series	Object	*	`df['a'] * 2`	Yes
Series	Object	/	`df['a'] / 2`	Yes

filtering operators

Type	Operator	Example	Status
String	==	`df['a'] == 'x'`	Yes
String	!=	`df['a'] != 'x'`	Yes
String	==~	`df['a'] ==~ /.*/`	Yes
String	in	`df['a'] in ['x']`	Yes
Number	==	`df['a'] == 1`	Yes
Number	!=	`df['a'] != 1`	Yes
Number	>	`df['a'] > 1`	Yes
Number	>=	`df['a'] >= 1`	Yes
Number	<	`df['a'] < 1`	Yes
Number	<=	`df['a'] <= 1`	Yes
LocalDate	>	`df['a'] > date`	Yes
LocalDate	>=	`df['a'] >= date`	Yes
LocalDate	<	`df['a'] < date`	Yes
LocalDate	<=	`df['a'] <= date`	Yes

Sorting

TODO

Mapping

TODO

Import / Export

CSV

Reading

You can read csv files via Underdog.df().read_csv(...) method. Here we are importing a csv files containing tornado incidents in the USA:

import csv

DataFrame dataframe = Underdog.df().read_csv("src/test/resources/data/tornadoes_1950-2014.csv")

output

                               tornadoes_1950-2014.csv
   Date     |    Time    |  State  |  State No  |  Scale  |  Injuries  |  ... |
--------------------------------------------------------------------------------
1950-01-03  |  11:00:00  |     MO  |         1  |      3  |         3  |  ... |
1950-01-03  |  11:00:00  |     MO  |         1  |      3  |         3  |      |
1950-01-03  |  11:10:00  |     IL  |         1  |      3  |         0  |      |
1950-01-03  |  11:55:00  |     IL  |         2  |      3  |         3  |      |
1950-01-03  |  16:00:00  |     OH  |         1  |      1  |         1  |      |
1950-01-13  |  05:25:00  |     AR  |         1  |      3  |         1  |      |
1950-01-25  |  19:30:00  |     MO  |         2  |      2  |         5  |      |
1950-01-25  |  21:00:00  |     IL  |         3  |      2  |         0  |      |
1950-01-26  |  18:00:00  |     TX  |         1  |      2  |         2  |      |
1950-02-11  |  13:10:00  |     TX  |         2  |      2  |         0  |      |
       ...  |       ...  |    ...  |       ...  |    ...  |       ...  |  ... |

Separator

By default the csv reader assumes the csv file is using comma (,) as the separator character, but you can provide a custom separator. For example the following csv file content:

separator

name;age
Lorna;34
Yule;63
Tom;28

Can be read by using the sep argument:

custom separator

def dataframe = Underdog.df().read_csv(filePath, sep: ";")

Duplicated names

Sometimes you can find a csv where columns are repeated, by default if you don't specify you allow repeated columns the import process will fail. Imagine we've got the following csv:

csv with repeated cols

bronze,silver,gold,summer_total,bronze,silver,gold,winter_total
1,2,1,4,1,1,1,3

To allow repeated columns you should set the allowDuplicatedNames flag to true.

allow repeated cols

def dataframe = Underdog.df().read_csv(filePath, allowedDuplicatedNames: true)

Then all repeated names will be prefixed in order with a number to avoid collisions:

output

                                        io_repeated_cols.csv
 bronze |  silver |  gold |  summer_total |  bronze-2  |  silver-2  |  gold-2  |  winter_total|
-----------------------------------------------------------------------------------------------
      1 |       2 |     1 |             4 |         1  |         1  |       1  |           3  |

Missing values

If a csv file contains values which should be considered as well as missing values, we can pass this information before reading the csv file.

csv file with missing data

from,to,id
N/C,Madrid,123
Madrid,Paris,124
Paris,London,125
London,NONE,126

Here we're considering missing data the values N/C and NONE:

considering missing data

def dataframe = Underdog.df().read_csv(filePath, nanValues: ['NONE', 'N/C'])

That will inform the reader to consider cells containing that value as missing values:

output

io_custom_missing_data.csv
  from   |    to    |  id   |
-----------------------------
         |  Madrid  |  123  |
 Madrid  |   Paris  |  124  |
  Paris  |  London  |  125  |
 London  |          |  126  |

Date format

If your csv files have a custom date format you can provide the date pattern as a parameter. Here we have a file with a custom format:

custom date format

Date,Close
2014-12-05 00:00:00+00:00,0.5267500281333923
2014-12-08 00:00:00+00:00,0.5199999809265137
2014-12-09 00:00:00+00:00,0.5182499885559082

Passing the pattern as parameter:

custom date format

def dataframe = Underdog.df().read_csv(filePath, dateFormat: "yyyy-MM-dd HH:mm:ss+00:00")

Gives the following output:

output

      io_custom_date_format.csv
   Date     |        Close         |
-------------------------------------
2014-12-05  |  0.5267500281333923  |
2014-12-08  |  0.5199999809265137  |
2014-12-09  |  0.5182499885559082  |

Skip rows/footer

If you're sure that there is data you'd like to avoid parsing, like nonsense data, you can skip parsing those rows. Check the following example:

csv file with comments
# some information about the data
# col1: city
# col2: weight
col1,col2
NC,0
NC,0
NC,0
NC,0
Madrid,1
Paris,2
NC,0
NC,0
NC,0
NC,0
# another comment here

There are lines we don't want to consider when creating our dataframe:

comments in the beginning of the file (lines 1-3)
comments in the end of the file (line 15)
rows we don't want to parse because they don't add any meaningful information (4-8 and 11-14)

To avoid parsing any of these lines we can instruct the csv reader to skip lines in the header and/or in the footer of the file:

skipping rows

def dataframe = Underdog.df()
    .read_csv(filePath,
        header: false,    // not using first row as header
        skipRows: 8,      // skipping rows at the beginning of the file
        skipFooter: 4     // skipping rows at the end of the file
    ).renameSeries(columns: ['city', 'id']) // renaming series names with the list passed as parameter

output

io_skipping_rows.csv
 city   |  id  |
-----------------
Madrid  |   1  |
 Paris  |   2  |

Max chars x col

You can instruct the csv reader to avoid parsing columns with more than a number of characters.

limiting col chars

def dataframe = Underdog.df().read_csv(filePath, maxCharsPerColumn: 20)

Warning

If a column exceeds the number of characters the process will throw an exception

Max cols

You can instruct the csv reader to avoid parsing more than a given number of columns.

limiting number of cols

def dataframe = Underdog.df().read_csv(filePath, maxNumberOfColumns: 2)

Warning

If the number of columns exceeds the number specified the process will throw an exception

Series

Intro

A Series object represents a named one-dimensional array. It also supports operations and statistical methods. It also has operations to deal with missing values. You can create a Series object from different sources:

create

// from a range of numbers
def numbers = (1..4).toSeries("numbers")

// from a range of letters
def letters = ('A'..'C').toSeries("letters")

// from a list
def stuff = [1, 2, null, 3, 4].toSeries("stuff")

You can use operator symbols to apply simple operations over the Series object:

operations

// multiplying a series by a number
def doubleSeries = numbers * 2

// multiplying a series by another series
def rowProduct = numbers * stuff.dropna()

// dividing a series
def halves = stuff / 2

// using custom transformation to create a new series
def custom = letters(String, String) { "letter-$it".toString() }

Sometimes you may want to analyze a given Series object by using statistical methods:

statistics

def mean = doubleSeries.mean()
def max = doubleSeries.max()
def min = doubleSeries.min()
def avg = doubleSeries.avg()

You can find all statistical available methods in the underdog.impl.extensions.SeriesStatsExtensions class.

Creating

Series are meant to be created from collections or as a transformation from another Series.

The only way to create a Series from a collection is invoking the extension method toSeries() from a list:

collection extension

Series series = [1, 2, 3].toSeries("numbers")

Most of the time we will be dealing with a Series creation inside the scope of a Dataframe. Sometimes as the result of the transformation of another series, sometimes because we would like to fill a series from a constant value.

Lets say we have a DataFrame with some Series:

sample dataframe

def numbers = Underdog.df().from([numbers: 1..10], "numbers")

output

 numbers
 numbers  |
-----------
       1  |
       2  |
       3  |
       4  |
       5  |
       6  |
       7  |
       8  |
       9  |
      10  |

And we want to create a new series named by_two with the result of multiplying all numbers in the numbers series:

new series

numbers['by_two'] = numbers['numbers'] * 2

output

       numbers
 numbers  |  by_two  |
----------------------
       1  |       2  |
       2  |       4  |
       3  |       6  |
       4  |       8  |
       5  |      10  |
       6  |      12  |
       7  |      14  |
       8  |      16  |
       9  |      18  |
      10  |      20  |

You can also create a new Series inside a dataframe filling all rows with the same value:

series from value

numbers['one'] = 1

output

           numbers
 numbers  |  by_two  |  one  |
------------------------------
       1  |       2  |    1  |
       2  |       4  |    1  |
       3  |       6  |    1  |
       4  |       8  |    1  |
       5  |      10  |    1  |
       6  |      12  |    1  |
       7  |      14  |    1  |
       8  |      16  |    1  |
       9  |      18  |    1  |
      10  |      20  |    1  |

Statistics

TODO