1. Introduction
Tablesaw combines tools for working with tables and columns with the ability to create statistical models and visualizations. In other words, it’s a data frame, with added features.
1.1. History
This project was born as a fork of the original Tablesaw project. You can find the original repository at https://github.com/jtablesaw/tablesaw. At the moment it can be still be used as a replacement from the original as it respects the same package naming.
1.2. What’s a dataframe?
A dataframe is an in-memory, tabular data structure in which each column holds a single datatype, while rows can contain a variety of types. Tablesaw provide these operations:
-
Importing and exporting data from text files and databases
-
Creating new columns by applying functions to existing ones (mapping)
-
Summarizing columns or tables (reducing)
-
Calculating descriptive statistics
1.3. Support for visualization
-
Plotting data
1.4. Looking ahead
In the rest of this User Guide we discuss each category of dataframe operation, as well as the visualization capabilities Tablesaw offers. If you haven’t already tried it, we strongly recommend the Getting Started section as the best way to see how it all fits together.
2. Getting started
Java is a great language, but it wasn’t designed for data analysis. Tablesaw makes it easy to do data analysis in Java.
This tutorial will help you get up and running, and introduce some of Tablesaw’s basic functionality.
2.1. Setup
First, Tableasaw requires Java 11 or newer. Second, you need to add the dependency to your pom file. It’s available on Maven Central.
<dependencies>
<dependency>
<groupId>com.github.grooviter</groupId>
<artifactId>tablesaw-core</artifactId>
<version>VERSION_HERE</version>
</dependency>
</dependencies>
dependencies {
implementation "com.github.grooviter:tablesaw:VERSION_HERE"
}
That’s it for setup. On to design. As you would expect, Tablesaw is all about tables, and tables are made of columns. We’ll start with columns.
2.2. Columns
A column is a named, one-dimensional collection of data. It may or may not be part of a table. All data in a column must be of the same type.
Tablesaw supports columns for Strings, floats, doubles, ints, shorts, longs, booleans, LocalDates, LocalTimes, Instants, and LocalDateTimes. The date and time columns are comparable with the java.time classes introduced in Java 8.
To create a column you can use one of its static create() methods:
double[] numbers = {1, 2, 3, 4};
DoubleColumn nc = DoubleColumn.create("nc", numbers);
System.out.println(nc.print());
which produces:
Column: nc
1
2
3
4
Each column has an associated 0-based index. To get an individual value call get() with the index.
double three = nc.get(2);
which returns 3.0.
Array Operations
Tablesaw makes columns easy to work with. Operations that work on numbers in standard Java, for example, often work on columns of numbers in Tablesaw. To multiply every value in a column by 4, we use the multiply() method, which returns a new column like the original.
DoubleColumn nc2 = nc.multiply(4);
System.out.println(nc2.print());
producing:
Column: nc * 4.0
4
8
12
16
As you can see, the values are 4x the values in the original. The new column’s name is made by combining the original
"Test" and the operation (* 4). You can change it if you like using setName(aString)
.
There are so many columnar operations in Tablesaw that, as a general rule, if you find yourself writing a for loop to process a column or table, you may be missing something.
2.3. Selections
Before going on to tables, we should talk about selections. Selections are used to filter both tables and columns.
Often they work behind the scenes, but you can use them directly. For example, consider our DoubleColumn
containing
the values {1, 2, 3, 4}. You can filter that column by sending it a message. For example:
nc.isLessThan(3);
This operation returns a Selection. Logically, it’s a bitmap of the same size as the original column. The method above, effectively, returns 1, 1, 0, 0, since the first two values in the column are less than three, and the last two are not.
What you probably wanted was not a Selection
object, but a new DoubleColumn
that contains only the values that passed the filter. To get this, you use the where(aSelection) method to apply the selection:
DoubleColumn filtered = nc.where(nc.isLessThan(3));
Column: nc
1
2
Doing this in two steps provides many benefits. For one, it lets us combine filters. For example:
DoubleColumn filteredPositive = nc.where(nc.isLessThan(3).and(nc.isPositive()));
If the methods returned columns directly, they couldn’t be combined this way. It also lets us use the same method for filtering both tables and columns, as you’ll see below.
2.4. Selecting by index
These examples show how to select using predicates. You can also use a selection to retrieve the value at a specific index, or indexes. Both of the following are supported:
nc.where(Selection.with(0, 2)); // returns 2 rows with the given indexes
nc.where(Selection.withRange(1, 3)); // returns rows 1 inclusive to 3 exclusive
If you have several columns of the same length as you would in a table of data, you can make a selection with one column and use it to filter another:
StringColumn sc = StringColumn.create("sc", new String[] {"foo", "bar", "baz", "foobar"});
DoubleColumn result = nc.where(sc.startsWith("foo"));
Note the methods startsWith(aString), isLessThan(aNumber), and isPositive(). These were predefined for your use. There are many such methods that can be used in building queries. |
For StringColumn, they’re defined in the tech.tablesaw.columns.strings.StringFilters interface. It also includes endsWith(), isEmpty(), isAlpha(), containsString()[^1], etc. Each column has a similar set of filter operations. They can all be found in the filter interfaces located in sub-folders of tech.tablesaw.columns (e.g. tech.tablesaw.columns.dates.DateFilters).
2.5. Map functions
Map functions are methods defined on columns that return new Columns as their result. You’ve already seen one: The column multiply(aNumber) method above is a map function with a scalar argument. To multiple the values in two columns, use multiply(aNumberColumn):
DoubleColumn newColumn = nc1.multiply(nc2);
DoubleColumn other = DoubleColumn.create("other", new Double[] {10.0, 20.0, 30.0, 40.0});
DoubleColumn newColumn = nc2.multiply(other);
System.out.println(newColumn.print());
Column: nc * 4.0 * other
40
160
360
640
Each value in column nc1 is multiplied by the corresponding value in nc2, rather than by a scalar value in the earlier example.
There are many map functions built-in for the various column types. Here are some examples for StringColumn:
StringColumn s = StringColumn.create("sc", new String[] {"foo", "bar", "baz", "foobarbaz"});
StringColumn s2 = s.copy();
s2 = s2.replaceFirst("foo", "bar");
s2 = s2.upperCase();
s2 = s2.padEnd(5, 'x'); // put 4 x chars at the end of each string
s2 = s2.substring(1, 5);
// this returns a measure of the similarity (levenshtein distance) between two columns
DoubleColumn distance = s.distance(s2);
As you can see, for many String methods that return a new String. StringColumn provides an equivalent map method that returns a new StringColumn. It also includes other helpful methods found in Guava’s String library and in the Apache Commons String library.
Every column type has a set of map operations like multiply(aNumber). For StringColumn, |
these methods are defined in the tech.tablesaw.columns.strings.StringMapFunctions interface. It includes many methods beyond those shown above. Methods for all column types can all be found in their filter interfaces located in the sub-folders of tech.tablesaw.columns (e.g. tech.tablesaw.columns.dates.DateMapFunctions, which provides date methods like plusDays(anInt), year(), and month()).
2.6. Reduce functions
Sometimes you want to derive a singly value that summarizes in some sense the data in a column. Aggregate functions do just that. Each such function scan all the values in a column and returns a single scalar value as a result. All columns support some aggregate functions: min() and max(), for example, plus count(), countUnique(), and countMissing(). Some also support type-specific functions. BooleanColumn, for example, supports all(), which returns true if all of the values in the column are true. The functions any(), and none(), return true if any or none the values in the column are true, respectively. The functions countTrue(), and countFalse() are also available.
NumberColumn has many more aggregate functions. For example, to calculate the standard deviation of the values in a column, you would call:
double stdDev = nc.standardDeviation();
NumberColumn supports many aggregation functions, including many of the most useful. Among those available are sum, count, mean, median, percentile(n), range, variance, sumOfLogs, and so on. These are defined in the NumericColumn class. |
When we discuss tables below, we’ll show how to calculate sub-totals in one or more numeric columns by the values in one or more grouping columns.
2.7. Tables
A table is a named collection of columns. All columns in the table must have the same number of elements, although missing values are allowed. A table can contain any combination of column types.
2.7.1. Creating Tables
You can create a table in code. Here we create a table and add two new columns to it:
String[] animals = {"bear", "cat", "giraffe"};
double[] cuteness = {90.1, 84.3, 99.7};
Table cuteAnimals =
Table.create("Cute Animals")
.addColumns(
StringColumn.create("Animal types", animals),
DoubleColumn.create("rating", cuteness));
2.7.2. Importing data
More frequently, you will load a table from a CSV or other delimited text file.
Table bushTable = Table.read().csv("../../data/bush.csv");
Tablesaw does a pretty good job at guessing the column types for many data sets, but you can specify them if it guesses wrong, or to improve performance. Numerous other options are available, such as specifying whether or not there’s a header, using a non-standard delimiter, supplying a custom missing value indicator, and so on.
Getting data loaded is sometimes the hardest part of data analysis. Advanced options for loading data are described in the documentation on importing / exporting section . That section also shows how you can read data from a database, a stream, or an HTML table. The stream interfaces lets you read data from a Web site or an S3 bucket. |
2.7.3. Exploring Tables
Because Tablesaw excels at manipulating tables, we use them whenever we can. When you ask tablesaw for the structure
of a table, the answer comes back in the form of another table where one column contains the column names, etc.
The methods ` structure(), `shape()
, first(n)
, and last(n)
can help you get to know a new data set. Here are some examples.
System.out.println(bushTable.structure())
Structure of bush.csv
Index | Column Name | Column Type |
-----------------------------------------
0 | date | LOCAL_DATE |
1 | approval | INTEGER |
2 | who | STRING |
System.out.println(bushTable.shape())
323 rows X 3 cols
System.out.println(bushTable.first(3))
bush.csv
date | approval | who |
-----------------------------------
2004-02-04 | 53 | fox |
2004-01-21 | 53 | fox |
2004-01-07 | 58 | fox |
System.out.println(bushTable.last(3))
bush.csv
date | approval | who |
-------------------------------------
2001-03-27 | 52 | zogby |
2001-02-27 | 53 | zogby |
2001-02-09 | 57 | zogby |
Table’s toString() method returns a String representation like those shown above. It returns a limited number of rows by default, but you can also use table.printAll(), or table.print(n) to get the output you want.
Of course, this is just the beginning of exploratory data analysis. You can also use numeric and visual tools to explore your data. These facilities are described in the documentation on statistics and Plotting
2.7.4. Working with columns
Often you’ll work with specific columns in a table. Here are some useful methods:
List<String> columnNames = table.columnNames(); // returns all column names
List<Column<?>> columns = table.columns(); // returns all the columns in the table
// removing columns
table.removeColumns("Foo"); // keep everything but "foo"
table.retainColumns("Foo", "Bar"); // only keep foo and bar
table.removeColumnsWithMissingValues();
// adding columns
table.addColumns(column1, column2, column3);
In tablesaw, column names are case-insensitive. You get the same column if you ask for any of these:
table.column("FOO");
table.column("foo");
table.column("foO");
remembering column names is enough of a burden without having to remember exactly which characters are capitalized.
Getting specific column types from a table
Columns can be retrieved from tables by name or position. The simplest method column() returns a object of type Column. This may be good enough, but often you want to get a column of a specific type. For example, you would need to cast the value returned to a NumberColumn to use its values in a scatter plot.
table.column("Foo"); // returns the column named 'Foo' if it's in the table.
// or
table.column(0); // returns the first column
When a variable type is "Column" it only provides methods that are available on all columns. You can’t perform math or do a string replace directly on a Column type. If you need a StringColumn you could cast the column, for example:
StringColumn sc = (StringColumn) table.column(0);
Table also supports methods that return columns of the desired type directly:
StringColumn strings = table.stringColumn(0);
DateColumn dates = table.dateColumn("start date");
DoubleColumn doubles = table.doubleColumn("doubles");
You may want a specific kind of column to work with. Either use the standard column() method and cast the result or use one of the type specific methods (like numberColumn()) that handle the cast for you. There are also methods or getting columns of a specific type. |
2.7.5. Working with rows
As with columns, many options exist for working with tables in row-wise fashion. Here are some useful ones:
Table result = table.dropDuplicateRows();
result = table.dropRowsWithMissingValues();
// drop rows using Selections
result = table.dropWhere(table.numberColumn(0).isLessThan(100));
// add rows
destinationTable.addRow(43, sourceTable); // adds row 43 from sourceTable to the receiver
// sampling
table.sampleN(200); // select 200 rows at random from table
You can also perform arbitrary operations on each row in the table. One way is to just iterate over the rows and work with each column individually.
for (Row row : table) {
System.out.println("On " + row.getDate("date") + ": " + row.getDouble("approval"));
}
Another approach lets you skip the iteration and just provide a Consumer for each row.
table.stream()
.forEach(
row -> {
System.out.println("On " + row.getDate("date") + ": " + row.getDouble("approval"));
});
If you need to process more than one row at a time, there are several methods to help.
// Consumer prints out the max of a window.
Consumer<Row[]> consumer =
rows ->
System.out.println(Arrays.stream(rows).mapToDouble(row -> row.getDouble(0)).max());
// Streams over rolling sets of rows. I.e. 0 to n-1, 1 to n, 2 to n+1, etc.
table.rollingStream(3).forEach(consumer);
// Streams over stepped sets of rows. I.e. 0 to n-1, n to 2n-1, 2n to 3n-1, etc. Only returns
// full sets of rows.
table.steppingStream(5).forEach(consumer);
See [Rows](https://jtablesaw.github.io/tablesaw/userguide/rows) for more information and other options.
2.7.6. Sorting
To sort a table, you can just use the sortOn()
method and give it one or more column name:
Table sorted = table.sortOn("foo", "bar", "bam"); // Sorts Ascending by Default
sorted = table.sortAscendingOn("bar"); // just like sortOn(), but makes the order explicit.
sorted = table.sortDescendingOn("foo");
// sort on foo ascending, then bar descending. Note the minus sign preceding the name of
// column bar.
sorted = table.sortOn("foo", "-bar");
See Sorting for more information and other options.
2.7.7. Filtering
Query filters can be combined using the logical operations and, or, and not.
These are implemented on the QuerySupport
class.
import static tech.tablesaw.api.QuerySupport.and;
import static tech.tablesaw.api.QuerySupport.or;
import static tech.tablesaw.api.QuerySupport.not;
Each method accepts a function with the following signature Function<Table, Selection>
. Lambadas work nicely.
Table result =
table.where(
and(
or(
t -> t.doubleColumn("nc1").isGreaterThan(4),
t -> t.doubleColumn("nc1").isNegative()
),
not(t -> t.doubleColumn("nc2").isLessThanOrEqualTo(5))));
2.7.8. Summarizing
// import aggregate functions.
import static tech.tablesaw.aggregate.AggregateFunctions.*;
The usual way to calculate values is to use the summarize() method:
Table summary = table.summarize("sales", mean, sum, min, max).by("province", "status");
It’s important to recognize, that the column need not exist when summarize is invoked. Any map function can be used in the by() statement to group on calculated values. A common use case is in handling dates. You can summarize sales by day-of-week, as follows:
summary = table.summarize("sales", mean, median)
.by(table.dateColumn("sales date").dayOfWeek());
which says "return the mean and median sales by day of week."
Tables are usually split based on columns, but the columns can be calculated on the fly |
See the documentation on Summarizing data, and the classes in the aggregate package for more detail.
2.7.9. Crosstabs
If you’re only interested in how the frequently observations appear in different categories, you can use cross-tabulations. In the example below we show the table percents, but you can also get row and column percents and raw counts.
Table percents = table.xTabTablePercents("month", "who");
// make table print as percents with no decimals instead of the raw doubles it holds
percents.columnsOfType(ColumnType.DOUBLE)
.forEach(x -> ((DoubleColumn)x).setPrintFormatter(NumberColumnFormatter.percent(0)));
System.out.println(percents);
The formatted output is shown below.
Crosstab Table Proportions:
[labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
----------------------------------------------------------------------------------------
APRIL | 2% | 3% | 1% | 0% | 0% | 1% | 7% |
AUGUST | 1% | 2% | 1% | 0% | 0% | 1% | 5% |
DECEMBER | 1% | 3% | 1% | 1% | 1% | 2% | 8% |
FEBRUARY | 2% | 3% | 1% | 1% | 0% | 1% | 9% |
JANUARY | 2% | 4% | 2% | 1% | 2% | 2% | 13% |
JULY | 2% | 3% | 1% | 1% | 0% | 1% | 8% |
JUNE | 2% | 3% | 0% | 0% | 0% | 1% | 7% |
MARCH | 2% | 4% | 1% | 1% | 0% | 2% | 9% |
MAY | 1% | 3% | 2% | 1% | 0% | 0% | 7% |
NOVEMBER | 1% | 3% | 2% | 1% | 0% | 0% | 7% |
OCTOBER | 2% | 3% | 2% | 1% | 0% | 1% | 10% |
SEPTEMBER | 2% | 3% | 2% | 1% | 0% | 1% | 9% |
Total | 20% | 37% | 17% | 9% | 3% | 14% | 100% |
See the section on #_crosstabs, and the JavaDocs for the CrossTab class.
2.8. Conclusion
We’ve covered a lot of ground. To learn more, please keep taking a look to rest of the guide and the Java Docs.
that containsString(String subString) is different from contains(). The first method looks at each string in the column to see if it conains the substring. The second method looks at every row in the column and returns true if any matches the entire string. In other words, contains is like contains as defined on List<String>. , etc. |
3. Columns
3.1. Overview
Tablesaw is all about tables and tables are made of columns. You’ll often need to work with individual columns and Tablesaw provides a large collection of tools for that. We’ll cover the basics here.
Let’s start with a definition. A column is a named vector of data, all of a single type. Some elements may be missing, and it’s important to deal with those. We cover that later.
Here are the supported column types. All concrete column types are in the api package. For the details on each kind see the appropriate Javadoc files.
-
BooleanColumn: which holds true and false values
Textual Data
-
StringColumn: for categorical values that appear multiple times in the column, such as "New York".
-
TextColumn: for unique text values as in
Numeric Data
-
NumberColumn: an interface for numeric data types.
-
ShortColumn: is for small integral values.
-
IntColumn: is the best integral type for most cases
-
LongColumn: for large integral values.
-
FloatColumn: for single-precision floating point.
-
DoubleColumn: is usually best floating point type for most values.
Dates
-
DateColumn: A "local date". That is, a date without a timezone. April 10, 2018, for example.
-
DateTimeColumn: A local date and time combined. April 10, 2018 at 9:07.
-
TimeColumn: A local time, like 12:47:03
-
InstantColumn: A single point in time without reference to time zones
All mathematical operations return double values or instances of DoubleColumn. As you’d expect, it holds 8-byte floating point numbers.
We’ll begin by looking at the operations that are common to all column types.
3.1.1. Create a column
Columns are usually created by importing a data file. They can also be instantiated by calling one of the static create() methods defined on the appropriate class. For example, you can create an empty DateColumn as follows:
DateColumn column = DateColumn.create("test");
The new column has the name "test", and a ColumnType of LOCAL_DATE. Names are important. We often ask a table for a column by name. All the columns within a given table must have unique names. You can always get the name of a column by calling name(), and its type by calling type().
To create a column with data, you can initialize the column with an array:
double[] values = {1, 2, 3, 7, 9.44242, 11};
DoubleColumn column = DoubleColumn.create("my numbers", values);
Once you have a column, you can add it to a table using the addColumns() method on Table.
table.addColumns(column);
3.1.2. Add, edit, remove data
You can add data to columns as shown below, but if your column is part of a table, you must take care to ensure that each column has the same number of elements.
DateColumn.append(LocalDate.of(2016, 2, 28));
To change the value of an element in a column you can use the set(index, value) method. This will replace the existing value at the given position with the new value.
doubleColumn.set(4, 123.2);
Normally, you don’t remove data from a column in the normal sense. To remove elements from the middle of column would cause problems if the column is part of a table. However, if you do want to get rid of some elements you have two choices. The first is to set the value to missing as shown below.
doubleColumn.setMissing(4);
Your other option is to create a new column without the offending data elements. This is done with filters as described below.
3.1.3. Other common operations
Columns do all the things you expect, here’s an incomplete list of standard operations:
name() // returns the name of the column
type() // returns the ColumnType, e.g. LOCAL_DATE
size() // returns the number of elements
isEmpty() // returns true if column has no data; false otherwise
first(n) and last(n) // returns the first and last n elements
max() and min() // returns the largest and smallest elements
top(n) and bottom(n) // returns the n largest and smallest elements
print() // returns a String representation of the column
copy() // returns a deep copy of the column
emptyCopy() // returns a column of the same type and name, but no data
unique() // returns a column of only the unique values
countUnique() // returns the number of unique values
asSet() // returns the unique values as a java Set
summary() // returns a type specific summary of the data
void sortAscending() // sorts the column in ascending order
void sortDescending() // sorts the column in ascending order
append(value) // appends a single value to the column
appendCell(string) // converts the string to the correct type and appends the result
append(otherColumn) // Appends the data in other column to this one
removeMissing() // returns a column with all missing values removed
These operations are available on nearly all column types. Each operates on an entire column.
To operate on the values of a column, you have two choices. You can work with individual values, or use column-wise operations to work with all the values in a column in the same way. To work with individual values, you can just iterate over the column:
DateColumn weekLater = DateColumn.create("Week Later");
for (LocalDate date: dates) {
weekLater.append(date.plusDays(7));
}
Just about anything you can do with an individual LocalDate you can do with an entire DateColumn, using column-wise operations. For example, the above loop could be written as:
DateColumn weekLater = dates.plusDays(7);
This is an example of a mapping function. You can find the date mapping functions in the interface DateMapFunctions. Many of the methods there deal with adding and subtracting units of time (days, weeks, months, etc), and calculating the column-wise differences between two date columns. Others provide access to elements of a date. The method month(), for example, returns a StringColumn containing the month for a given date. The methods year(), dayOfWeek(), dayOfMonth(), etc. function similarly.
Other columns have similar mapping functions.
3.1.4. Filtering
You can filter two ways. The first is with the built-in predicates, like IsMonday(). See the end of this post for a full list of the built-in predicates for LocalDateColumn.
Writing Predicates
You can write a Predicate class to filter a date column using
. For example, if you want all the leap days in a column, you could create this Java 8 predicate.where(Predicate<LocalDate>)
LocalDatePredicate leapDays = new Predicate<LocalDate>() {
int dayOfMonth = 29;
int monthValue = 2;
@Override
public boolean test(LocalDate i) {
return i.getDayOfMonth() == dayOfMonth && i.getMonthValue() = 2;
}
};
which you can use as:
DateColumn filtered = dates.where(dates.eval(leapDays);
In the line above, the call to dates.eval(aPredicate) returns a Selection object holding the position of every element in the column that passes the predicate’s test() method. The surrounding call to where(aSelection), applies that selection to the column and returns a new column with all the passing values.
Built-in Date Predicates
There are numerous built-in date predicates. For example:
DateColumn filtered = dates.isMonday();
DateColumn filtered = dates.isInQ2();
DateColumn filtered = dates.isLastDayOfTheMonth();
Perhaps not surprisingly, there are a number that find specific dates or date ranges:
LocalDate date1 = LocalDate.of(2016, 2, 20);
LocalDate date2 = LocalDate.of(2016, 4, 29);
DateColumn filtered = dates.isEqualTo(date1);
DateColumn filtered = dates.isAfter(date1);
DateColumn filtered = dates.isOnOrAfter(date1);
DateColumn filtered = dates.isBetweenIncluding(date1, date2);
The built-in method in this case is preferable as it has been optimized. But you can write your own if you need something not already provided. You can find a full list in the JavaDoc for DateColumn.
Conditionally editing data
The section on editing values above assumes you’ve identified the specific values you want to change. Often with large datasets, you know you want to change some values, without knowing where they are, or even how many are in the dataset. The easiest way to perform a bulk update of values meeting some condition is with set(aSelection, aNewValue)
. Each column implements an appropriate variation of this method. DoubleColumn, for example, has a version that takes a double as the second argument, and StringColumn has a version that takes a string.
You can use a built-in filter method like those discussed above to provide the selection. Here’s one example:
doubleColumn.set(doubleColumn.isGreaterThan(100), 100);
This would set any value above 100 to equal 100 exactly. This approach can be very helpful for dealing with missing data, which you might want to set to an average value for example.
double avg = doubleColumn.mean();
doubleColumn.set(doubleColumn.isMissing(), avg)
When working with missing values, always test with the isMissing() method, rather than test using the column type’s MISSING_VALUE constant. For doubles, MISSING_VALUE returns Double.NaN, and since Double.NaN does not equal Double.NaN, a test like doubleValue == MISSING_VALUE will fail to detect missing values.
|
3.1.5. Formatting data
You can print data as individual values, columns or tables. The output format can be controlled by setting a type-specific formatter on a column. For example, to change how numbers are displayed you can call setPrintFormatter() on a NumberColumn, passing in a NumberColumnFormatter. Each formatter serves two functions, displaying true values and handling of missing ones. NumberColumnFormatter has several pre-configured options, including printing as currency or percents.
See the table section documentation for how to add and remove columns
3.2. Map functions
A map is a function that when applied to one or more columns, produces a new Column with the same number of elements.
One way to think about them, is based on how many columns are involved in the function.
3.2.1. Unary
Unary map functions operate on a single column, the method receiver. For example:
StringColumn column = columnA.upperCase();
produces a new column that contains the values in columnA, but converted to upper case.
While unary mappers operate only on a single Column, they may have an additional parameter (or parameters) that are not Columns, for example:
StringColumn column = columnA.substring(startingPosition);
or
StringColumn column1 = columnA.startsWith("foo");
3.2.2. Binary
Binary mappers operate on two columns. In the example below, the result produced is a new column that contains the row-wise sums of the values in the receiver (columnB) and the parameter (columnC):
NumberColumn column = columnB.add(columnC);
3.2.4. Adding columns to table
The new Column is not added to the original columns table by default. To add it, use the addColumn() method defined on Table:
StringColumn newColumn = table.column("Name").upperCase();
table.addColumn(newColumn);
3.2.5. Custom map functions
You can create functions that build new columns using doWithEach():
Table table = Table.read().csv("../../data/bush.csv");
DateColumn dc1 = table.dateColumn("date");
DateColumn dc2 = DateColumn.create("100 days later");
dc1.doWithEach(localDate -> dc2.append(localDate.plusDays(100)));
The method doWithEach(Consumer<T>) is defined on AbstractColumn. It takes a Consumer that accepts values of the same type as the column holds: LocalDates in the case of DateColumn. If the lambda syntax above is a bit too cryptic, here’s a version that does the same thing with the Consumer written out as an anonymous class:
DateColumn dc2 = DateColumn.create("100 days later");
dc1.doWithEach(new Consumer<LocalDate>() {
@Override
public void accept(LocalDate localDate) {
if (localDate == null) {
dc2.appendMissing();
} else {
dc2.append(localDate.plusDays(100));
}
}
});
The accept() method here calls plusDays(100) on each value in the column that receives the doWithEach method invocation and appends each result to the column d2. The results are added to the column dc2 declared just before the method is called.
In writing your own map functions it’s good practice to handle missing data as we do above.
3.2.6. Columns are iterable
You can also use a for loop to create new columns based on the values of existing ones. All columns in Tablesaw are iterable. For example:
StringColumn season = StringColumn.create("Season");
for (LocalDate date : dateColumn) {
if (date == null) {
newColumn.append(StringColumn.MISSING_VALUE);
}
else if(date.month.equals("May") {
newColumn.append("Flower Season");
}
else {
newColumn.append("Not Flower Season");
}
}
myTable.addColumns(season);
3.2.7. Map functions reference
String Mappers
Unary
upperCase()
lowerCase()
trim()
replaceAll(String regex, replacementString)
replaceFirst(String regex, String replacement)
substring(int startPosition)
substring(int startPosition, int endPosition)
abbreviate()
padStart(int minimumLength, paddingCharacter)
padEnd(int minimumLength, char paddingCharacter)
Date Mappers
3.3. Table processing without loops
Tablesaw has long supported column-wise operations that allow you process all the data in a column in a single operation, without an explicit loop. For example, they support
mapping:
StringColumn sc = myStringColumn.concat("foo");
reducing:
double mean = numberColumn.mean();
and filtering
NumberColumn filtered = numberColumn.where(numberColumn.isLessThan(4);
in this way. If you don’t find in the library a function that does exactly what you want, you can write one and use it as if it were built in.
With tables, the most common use of the analogous "table-wise" operations is for SQL-like operations. However, tables also support arbitrary operations on individual rows and on collections of contiguous rows.
3.3.1. Window functions
While do() applies an operation to every row in the table individually, it is often useful to work with sets of contiguous rows: Calculating the difference in values between pairs of rows is a common example. These type of operations are called window operations. The size of the window refers to the number of rows it considers in evaluating the operation. Tablesaw supports two approaches, one for rolling windows, and one for stepping windows.
3.3.2. Rolling windows
Rolling window operations move through the table one row at a time. For example, a rolling window with size of two, first looks at row 0 and row 1. Then it evaluates rows 1 and 2, and then rows 2 and 3. Any given row may be considered n times, where n is the size of the window.
3.4. Adding a new column type
Note: This is a work in progress.
It may be useful to add custom column types. Possible applications include columns for geographic coordinates, zoned date-times, characters, sequence data, email addresses, urls, distances (with units), specialized numeric types, lists, JSON, XML, or whatever objects are important in your application.
There are several steps to be performed. This document walks you through them.
-
Add a class that implements ColumnType.
-
Add String parsing support
-
Add a class that implements Column
-
Add functions for filtering, mapping, and summarizing
-
Extending Row
-
Handling SQL results sets
-
Misc.
3.4.1. ColumnType
To add a new ColumnType you implement the ColumnType interface. The simplest way is to subclass AbstractColumnType.
3.4.2. StringParser
StringParsers are used by the CsvReader to load data from CSV files. StringParser is an abstract class.
3.4.3. Update CsvReader
To use automatic type detection for the column type, it must be accessible to CsvReader, you should probably update CsvReader for the new type detection.
3.4.4. Implementing Column
To add a new column class you must implement the Column interface. One way is to subclass AbstractColumn.
4. Tables
4.1. Overview
Tables are the primary data-type and general focus of Tablesaw. Here we’ll provide an overview of the operations they provide. Coverage of most of the topics below is introductory. They often will have their own section of the User Guide where they are examined more fully. For still more detail, see the JavaDoc for tech.tablesaw.api.Table.
4.1.1. Tables "all the way down"
Tablesaw has a huge number of methods for creating, querying, manipulating, displaying, and saving tables, so it makes sense that we use tables widely, and that many operations on tables return other tables. For example, when you ask a table to describe its structure, it returns a new table that contains the column names, types, and order.
4.1.2. Creating tables
You can create tables programmatically or by loading data from an external source.
Programmatically
Table t = Table.create("name")
It’s often convenient to add columns while you’re creating the table.
Table t = Table.create("name", column1, column2, column3...)
You can also add columns later.
Importing data
Tablesaw can load data from character delimited text files (including CSV and Tab-separated files), from streams, and from any data source that can create a JDBC result set. As this includes essentially all relational databases (and many non-relational ones), most of the world’s structured data in can be loaded without a prior transformation. You can load a table from a CSV file by providing the file name.
Table t = Table.read().csv("myFile.csv");
This simple method supplies default values for a number of parameters like the type of the separator character (a comma). It also attempts to infer the types for each column. If the inferred types are incorrect, you can specify the types at import time. See [Importing data](https://jtablesaw.github.io/tablesaw/userguide/importing_data) for other options and more detail.
4.1.3. Displaying data
The simplest way to display a table is to call "print()" on it, which return a formatted String representation.
aTable.print();
The default implementation of print displays the first ten and last ten records. To specifically control the output, the methods first(n) and last(n) are available. These return a copy of the table that contains only the first n or last n rows respectively.
aTable.first(3); aTable.last(4);
Table overides toString() to return print(). This makes for rather funky output in a debugger, but during analysis, you frequently want to look at the table data so frequently that the shortcut is worth the hassle it causes people programming Tablesaw.
4.1.4. Getting table metadata
There are a number of ways to get familiar with a new dataset. Here are some of the most useful.
table.name() returns its name, which defaults to the name of the file it was created from. You can change it if you like using setName(aString).
t.columnNames() returns an array of column-name strings.
t.structure() returns a list of columns with their position and types:
Structure of data/tornadoes_1950-2014.csv Index Column Names Column Type 0 The Date LOCAL_DATE 1 The Time LOCAL_TIME 2 State CATEGORY 3 State No INTEGER 4 Scale INTEGER
table.shape() returns the table’s size in rows and columns:
59945 rows X 10 cols
You can also get the rowCount() and columnCount() individually from a table.
4.1.5. Adding / removing columns
You can add one or more columns to a Table using the addColumns() method:
t.addColumns(aColumn...)
You can also specify that the column be inserted at a particular place by providing an index:
t.addColumn(3, aColumn);
As usual in java, column numbering begins at 0, rather than 1.
The column you add must either be empty or have the same number of elements as the other columns in the table.
To remove a column or columns:
t.removeColumns(aColumn...)
t.removeColumns("columnName"...)
Columns can also be removed by referencing them by name. Alternately, you can specify just the columns to retain:
t.retainColumns(aColumn);
Again you can specify the columns either directly, or by using their names.
While addColumns() and removeColumns() update the receiver in place, you can also create a new table with a subset of the columns in the receiver. This can be done by specifying the names of the columns that you want to retain, in the order you want them to appear.
Table reduced = t.select("Name", "Age", "Height", "Weight");
You can also create a new table by specifying the columns in the current table that you don’t want, which might save some typing:
Table reduced = t.rejectColumns("Street Address");
In this case the columns in the result table are in the same order as in the original.
4.1.6. Selecting columns
Often you will want a reference to a column in the table. To get all the columns as a list:
t.columns();
Columns can also be selected by index or name:
t.columns("column1", "column2");
Often you want just one column, which you can get using t.column(“columnName”).
Since Tablesaw columns are typed, you often need to cast the returned column to something more specific. For example:
DoubleColumn dc = (NumberColumn) t.column();
as a convenience, tables have column accessors that are type specific: The do the casting for you.
DoubleColumn dc = t.doubleColumn();
4.1.7. Combining Tables
Tables can be combined in one of several ways. The most basic is to append the rows of one table to another. This is only possible if the two tables have the same columns in the same order, but can be useful when, for example, you have the same data from two time periods.
Table result = t.append(t2);
You can concatenate two tables, adding the columns of one to the other by using the concat() method. The method returns the receiver rather than a new table. Two tables can be concatenated only if they have the same number of rows.
t.concat(t2)
4.1.10. Filter
One of the most useful operations is filtering. Queries are created by forming expressions that produce a Selection, which effectively turns the query result into an object that can be used to filter by index. For example, the code below
Table result = t.where(t.stringColumn("Foo").startsWith("A"));
This would produce a table containing every row in t where the value in the column named "Foo" contains a string that starts with "A".
Filters are covered in detail in the section on [Filtering](https://jtablesaw.github.io/tablesaw/userguide/filters).
4.1.12. Summarize
The summarize() method and its variants let you specify the columns to summarize.
Table summary =
t.summarize("age", "weight", mean, median, range).apply();
Summarize returns a Summarizer object.
The apply() method sent to summary above returns the result of applying the function to the table, and combining the results into a new table. It computes one summary for the original table.
Groups
To calculate subtotals, we use by() instead of apply().
By takes a list of columns that are used to group the data. The example below calculates the average delay for each airport in the table.
Table result = t.summarize("delay", mean).by("airport");
Cross Tabs
Cross tabs (or cross-tabulations) are like groups, but return the data in a layout that faciliates interpretation. A cross tab in Tablesaw takes two grouping columns and returns the number of observations for each combination of the two columns. They can also produce the proportions, and subtotals by row or column.
Cross Tabs are covered in detail in the section on [CrossTabs](https://jtablesaw.github.io/tablesaw/userguide/crosstabs).
4.1.13. Sort
Table can be sorted on any combination of columns, in any combination of ascending or descending order, or by supplying a comparator for complete flexibility. A simple example is shown below.
t.sortDescending("column1","column2");
Sorting is covered in detail in the section on Sorting
4.2. Filters
4.2.1. where()
Filters select a subset of the rows in a table. Given a filter, a table will (usually) return a table like itself, but having only the rows that pass the filter criteria.
The main methods for applying filters are:
Table t = table.where(aSelection);
Column x = column.where(aSelection);
which includes all rows for which the filter returns true. And
table.dropWhere(aSelection)
which excludes all rows for which the filter returns true.
As you can see, for any given selection dropWhere() returns the complement of the records returned by where().
There are several other methods like where to explore. Before digging into table filters, though, we should look Selections, and at the column filters table filters build on.
Key point: One way that where() in tablesaw differs from the where clause in SQL, is that Tablesaw always returns records in the order they appear in the original column or table. This can be used to good advantage when working with time-series data.
4.2.2. Selections
Both columns and tables are filtered using selections. A selection is a logical bitmap (like a boolean array) with an entry for each element in the column. For any given element, if the corresponding bitmap entry is "true", the element is included in the result.
Here’s what you need to know about selections.
-
Selections are like bitmaps, in which 'true' means include 'include the corresponding value(s)' in the result.
-
They’re usually applied to columns, but are used to filter tables as well
-
Columns have many built-in selection methods
-
Any operation that returns an appropriately-sized bitmap can be used
-
You can write your own selection methods
-
Selections are readily combined, using their and(), or(), and andNot().
Lets take a look at each of these.
Applying to columns
Imagine a student table with a column named "birth date" and that we want to find all the birth dates in the year 2011. It turns out that this is easy to do, because there is a built in method (isInYear(anInt)) for that.
DateColumn bd = student.dateColumn("birth date");
Selection bdYear = bd.isInYear(2011);
Lets say that column bd contains these values:
Jan 22, 2011
April 14, 2010
March 9, 2011
August 4, 2010
The selection bdYear above contains 1, 0, 1, 0, since the first and third birth dates in the column are in year 2011 and the others are not.
To return a DateColumn containing birth dates in 2011, we could apply the selection to bd using the where() method, and passing the selection bdYear as an argument.
DateColumn bd2011 = bd.where(bdYear);
Generally, a filtered column, rather than a selection, is the result you actually want. You can, of course, inline the call:
DateColumn bd2011 = bd.where(bd.isInYear(2011));
This begs the question, why not just have isInJanuary() return a filtered column? There are several reasons. The next section covers the first.
Applying to tables
Because DateColumn method isInYear(anInt) return a Selection, rather than a new column, we can use the same method to filter either the column itself or the table that contains the column.
To filter the entire student table we simply apply the same selection to the table, again using the method where(aSelection).
Table studentsBornIn2011 = students.where(bdYear)
The only constraint is that the column used to create the Selection and the table where it’s applied must have the same number of rows.
Built-in methods
Much of Tablesaw’s power comes from the number and range of filters built into the library. When you need one, there’s often a method that does what you want. StringColumn, for example, has the methods startsWith(aString) and endsWith(aString), both returning a Selection.
StringColumn email = unfiltered.stringColumn("email");
filtered = unfiltered.where(email.endsWith("google.com"));
StringColumn has other methds as well, while both DateColumn and DateTimeColumn support isInJanuary(). It works as follows:
DateColumn januaries = dateColumn.where(dateColumn.isInJanuary());
In fact, the general approach to filtering table rests on column filters, using the logical operators and(), or(), and andNot() to combine them into complex, multi-column queries. This is illustrated below.
Using and(), or(), andNot()
Selections are easy to combine to create more complex selection. You can, for example, get only the birth dates from January that were also on Monday.
bd.isInJanuary().and(bd.isMonday())
or, all the birth dates in January, and all the Mondays:
bd.where(bd.isInJanuary().or(bd.isMonday()))
or, all the dates in January that were not Mondays:
bd.where(bd.isInJanuary().andNot(bd.isMonday()))
A list of the built-in filters is below.
Finally, you can combine these "where clauses" with methods that filter by index. For example:
Table t1 = t.where(Selection.withRange(100, 300).and(sc.startsWith("Foo")));
first selects the rows in the range 100 to 300, and then intersects that result with the query sc.startsWIth("Foo")
.
4.2.3. Custom filters
To write a custom filter method for a column, you first create a predicate, and then pass it to an eval() method on your column. Here’s an example with NumberColumns.
public abstract Selection apply(Table relation);
Here’s an example. We write a filter that only selects prime numbers:
// first we create a predicate
DoublePredicate isPrime = new DoublePredicate() {
@Override
public boolean test(double value) {
// is it's not an int return false
if (!((value == Math.floor(value)) && !Double.isInfinite(value))) {
return false;
}
int n = (int) value;
if (n < 2 || n % 2 == 0)
return false;
// only odd factors need to be tested up to n^0.5
for (int i = 3; i * i <= value; i += 2) {
if (value % i == 0)
return false;
}
return true;
}
};
// then use eval to return a selection
4.2.4. Combining filters
You can combine filters to query a table on the values in multiple columns.
Table filtered = aTable.where(
aTable.stringColumn("Status").isEqualTo("Ok")
.and(aTable.numberColumn("Age").isGreaterThan(21)));
4.2.5. Filter by index
You can select rows by specifying the index (zero-based):
t.rows(i...)
You can also select by range:
t.inRange(start, end)
You can also select a random sample of data. See the section on Sampling for more detail.
4.2.6. Sampling
The line below returns a table containing 50 randomly sampled rows from table t.
Table sample = t.sampleN(50);
Alternately, you can specify the sample size as a proportion of the table size using sampleX(aDouble):
Table sample = t.sampleX(.40);
You can also divide the table in two, assigning rows randomly to each, and return both sub-tables in an array. The code below puts ~ 1/3 of the rows in the results[0], and the other 2/3rds in results[1]. This is handy for separating data into a training and test subsets for machine learning applications.
Table[] results = Table.sampleSplit(.333);
4.2.7. Excluding columns
You may want to exclude some of the columns in the original from the new table. To do this, you could simply execute the queries as above, and then eliminate columns from the new table as a separate step:
filtered = aTable.where(aTable.stringColumn("Status").isEqualTo("Ok"));
filtered = filtered.removeColumns("startDate", "value");
Alternately, you could specify the desired subset of columns as part of the query:
Table filtered = aTable.select("name","status")
.where(aTable.stringColumn("Status").isEqualTo("Ok"));
Given a list of columns as arguments, the select() statement returns a table containing only those columns. By chaning select() and where(), you get something that looks a lot like a sql statement that returns a subset of the data in the original table.
4.2.8. Column filter reference
All the methods below return a Selection.
General Filters (apply to all types)
isEqualTo(Comparable c)
greaterThan(Comparable c)
greaterThanOrEqualTo(Comparable c)
lessThan(Comparable c)
lessThanOrEqualTo(Comparable c)
between(Comparable a, Comparable b)
isIn(List aList)
isMissing()
isNotMissing()
String Filters
equalToIgnoringCase(String string)
startsWith(String string)
endsWith(String string)
contains(String string)
matchesRegex(String string)
isEmpty(String string)
isAlpha()
isNumeric()
isAlphaNumeric()
isUpperCase()
isLowerCase()
hasLengthEqualTo(int lengthChars)
hasLengthLessThan(int lengthChars)
hasLengthGreaterThan(int lengthChars)
Number Filters
isPositive()
isNegative()
isNonNegative()
isZero()
isEven()
isOdd()
isCloseTo(float target);
isCloseTo(double target)
4.3. Sorting Theory
Sorting is one of the most common operations on tables and columns. Tablesaw makes sorting easy, yet offers complete flexibility. We’ll start here with the simplest approaches, then show how to build more complex sorts.
In the examples below, we use a dataset with four columns (recipe, mix, temp, y) from an analysis of what causes cracks in chocolate cakes.
4.3.1. Ascending / descending
The simplest way to sort is just to provide the name of the sort column or columns as arguments to a method that specifies the order. For example, to sort from lowest to highest:
Table ascending = t.sortAscending("mix", "temp");
This sorts the table in ascending order, in the order the column names are listed. To see the result, we can use the first() method:
ascending.first(8);
which produces:
recipe mix temp y
1 1 175 42
2 1 175 39
3 1 175 46
1 1 185 46
2 1 185 46
3 1 185 44
1 1 195 47
2 1 195 51
The method sortDescending() provides the analogous operation:
Table descending = t.sortDescending("mix", "temp");
sortAscending() and sortDescending() are limiting in that you must use the same order for every column. The advantage is that the sort logic is perfectly clear when reading the code.
4.3.2. Mixing Asc and Desc order
If you need to sort such that some columns are order high-to-low, and others low-to-high, you can use sortOn().
Table sorted = t.sortOn("mix", "temp");
sortOn() sorts in ascending order by default, so this code produces the same results as sortAscending(). To sort one of the columns in descending order, simply preface the column name with a minus sign ‘-‘, as in:
table.sortOn("-recipe","y", "mix");
This sorts the table first on recipe in descending order, then on y and mix, in ascending order.
4.3.3. Using a Comparator
Finally, if you need complete control over sorting, you can create an Compator<VRow> and pass that to the sort() method. As Tablesaw is column-oriented, it does not have true rows. Materializing each row into some grouping of standard Java objects would be very inefficient. A VRow is a kind of virtual row that serves as a proxy for a single row in a table.
For example, a row comparator that sorts in ascending temperature order can be constructed and used as shown below:
Comparator<VRow> tempComparator = new Comparator<VRow>() {
@Override
public int compare(VRow o1, VRow o2) {
return Double.compare(o1.getDouble("temp"), o2.getDouble("temp"));
}
};
It can be used to sort the cake table by:
Table t = table.sortOn(tempComparator);
Obviously, a custom comparator isn’t needed in this example, but the ability is there when you need it. With a Comparator<VRow>, sort logic can include transformations and combinations of the data when making ordering decisions.
Data: The cake dataset is from http://www.stats4stem.org/r-cake-data.html.
4.4. Summarizing data
An important set of table operations compute totals and subtotals. In the relational world, these are implemented by combining summary operations (sum, max…) and group by. Data scientists often refer to these as Split-Apply-Combine functions. Tablesaw makes calculating summaries easy, using the summarize() method. The general form of summarize() is
t.summarize(column, functions...)
4.4.1. Multiple values for 1-column
The column parameter is the column you want to summarize, and the functions parameter is an array of AggregateFunction, so you can calculate multiple statistics with a single method invocation as shown below.
import static tech.tablesaw.aggregate.AggregateFunctions.*;
...
NumberColumn age = t.nCol("age");
t.summarize(age, mean, max, min);
In this example, the mean, max, and min functions are instances of AggregateFunction, which we access via a static import of AggregateFunctions. AggregateFunctions is a static utility class that defines many common (and uncommon) functions.
The return value from summarize is a SummaryFunction, to get the results in the form of a table, we use apply():
Table results = t.summarize(approval, mean, max, min).apply();
The results table is shown below. As you can see, there is a column for each of the functions. The default column name is `function-name [summarized-column-name] `
bush.csv summary
Mean [approval] | Max [approval] | Min [approval] |
-----------------------------------------------------------
64.88235294117646 | 90.0 | 45.0 |
4.4.2. Values for n-columns
You can also summarize multiple columns in a single call, using one of the following variants:
t.summarize(column1, column2, function...)
t.summarize(column1, column2, column3, function...)
t.summarize(column1, column2, column3, column4, function...)
Usually, you’ll be summarizing numeric columns, but some functions are available for other column types. BooleanColumn, for example, supports countTrue() and countFalse(). All column types support countMissing().
What if you want to apply countTrue() to a boolean column, while calculationg the standard deviation of a numeric column. You can achieve this with one of the options shown above:
t.summarize(booleanColumn, numericColumn, standardDeviation, countTrue)
The summary function will compute the results for every column type that supports the given function, so in this example, booleanColumn returns a value for countTrue, and numericColumn returns a value for standardDeviation.
In the above examples, we’ve been using columns as the first n arguments to summarize(). You can also refer to the column by its name.
t.summarize(columnName…)
4.4.3. Subtotals with by()
In this example, we’ll use a Tornado dataset from the NOAA’s Storm Prediction Center. It contains records of every recorded US tornado from 1950-2014. Once we’ve loaded the data, computing stats on subgroups is easy.
The argument to by() is a column or columns, which may be given by the column’s name. In the example below, we calculate the average number of injuries, subtotaling by the tornado’s scale attribute.
Table avgInjuries = table.summarize("Injuries", mean).by("Scale");
That’s all there is to it. Note that when we use by(), we no longer need to call apply() to get the table. We give the result table a more descriptive name, and then to see the result, we can use print():
avgInjuries.setName("Average injuries by Tornado Scale"); avgInjuries.print();
Average injuries by Tornado Scale Scale | Mean [Injuries] | ---------------------------------- -9.0 | 0.1052631578947369 | 0.0 | 0.028963191083737013 | 1.0 | 0.34967825609816594 | 2.0 | 1.7487066593285947 | 3.0 | 9.95538461538462 | 4.0 | 59.61855670103088 | 5.0 | 195.23170731707316 |
In this dataset, missing scale values are indicated by a -9. A scale of 0 to 5 indicates the size of the storm, with 5 being the largest/most severe. As you can see, injuries increase dramatically with the most severe storms.
If we provide multiple columns to the by() method, the resulting table has a row for each unique combination. To total the number of fatalities by state and scale, we would write:
sumFatalities = table.summarize("Fatalities", sum).by("State", "Scale");
which produces:
data/1950-2014_torn.csv summary State | Scale | Sum [Injuries] | -------------------------------------- AL | 0.0 | 16.0 | AL | 1.0 | 454.0 | AL | 2.0 | 1091.0 | AL | 3.0 | 2422.0 | AL | 4.0 | 3617.0 | AL | 5.0 | 1612.0 | AR | -9.0 | 5.0 | AR | 0.0 | 6.0 | AR | 1.0 | 210.0 | AR | 2.0 | 933.0 | ... | ... | ... |
etc.
Since the result returned is also a Table, you can easily perform other operations. For example, to see only results for storms in Texas and Oklahoma, you could do the following.
List states = Lists.newArrayList("TX", "OK"); sumFatalities.selectIf(column("State").isContainedIn(states));
producing:
data/1950-2014_torn.csv summary State Scale Sum Fatalities OK 0.0 0.0 OK 1.0 5.0 OK 2.0 22.0 OK 3.0 71.0 OK 4.0 143.0 OK 5.0 96.0 TX -9.0 0.0 TX 0.0 2.0 TX 1.0 21.0 TX 2.0 40.0 TX 3.0 88.0 TX 4.0 219.0 TX 5.0 174.0
Data: The tornado dataset is from [NOAA’s Storm Prediction Center Severe Weather GIS](http://www.spc.noaa.gov/gis/svrgis/).
4.4.4. Grouping
On calculated columns
It may seem limiting to only be able to subtotal on column values, but in practice it’s quite flexible. The flexibility comes from using map functions to produce new columns on the fly, and then using those columns to compute the summaries. For example, you might calculate average salary for subgroups based on years of employment, as in the code below:
t.summarize(salary, mean).by(yearsOfEmployment.bin(20));
This code assigns rows to one of 20 bins, based on one numeric column (yearsOfEmployment) and returns the average per bin of another (salary). The approach can be used with any kind of column or map function, but it is especially useful when working with dates and times.
On time units
Let’s say, for example, that you have a table of sales data, and want to calculate the highest dollar sales by month and region. The table has columns for sales_datetime, amount, product, and region. The answer can be had by:
t.summarize(amount, max).by(region, sales_datetime.month())
In this example, a temporary column containing the sales month is created and used to summarize the data, but not added to the original table.
On constant time ranges
You are not limited to grouping on standard time units like a specific month or year. You can instead create TimeWindows based on some number n of standard units. One might, for example, want to look at sales data patterns during a day in 15 minute windows. The following code does this.
t.summarize(amount, sales_datetime.timeWindows(ChronoUnit.MINUTE, 15)
4.4.5. Summarizing calculated columns
If you can group on calculated columns, why not summarize calculated columns? Lets say that you are analyzing text. Each row contains a single sentence, and we would like to characterize the length of the sentences in our dataset. You could create a column called "sentence length" and add it to the table, but might prefer to analyze the data directly.
t.summarize(sentence.length(), min, q1, q2, q3, max, range)
In this example, a standard map function (length()) creates a NumberColumn containing the number of characters in each value of the StringColumn sentence. Various statistics (min, q1, etc.) are calculated on the resulting column.
4.4.6. Conditional Summarization
Another useful facility is to summarize only that data that passes some test. For example, you could extract the common suffixes between two columns of String data, and then count the number that are longer than two characters. This can be done using the summarizeIf() method as follows:
t.summarizeIf(c1.suffix(c2).length().isGreaterThan(2), count())
TODO: THIS NEEDS REVIEW. WHAT COLUMN GETS SUMMARIZED.
4.5. Rows
For efficiency reasons, Tablesaw is column-oriented, but sometimes it’s important to work on rows, rather than individual columns. Generally speaking, you can do this, by iterating the indexes on the table and getting values from the columns it contains:
for (int i = 0; i < t.rowCount(); i++) {
String s = t.stringColumn("s").get(i);
LocalDate d = t.dateColumn("d").get(i);
System.out.println(s + " happened on " + d);
}
This is not super convenient, however.
4.6. CrossTabs
If you’re interested in how frequently observations appear in different categories, you can use cross-tabulations, also known as contingency tables. Tablesaw supports one and two dimensional crossTabs.
The Table class contains the methods you need. In the example below we show the observation counts for each combination.
// preparation: load the data, and add a string column to hold the months in the date col
Table table = Table.read().csv("../../data/bush.csv");
StringColumn month = table.dateColumn("date").month();
month.setName("month");
table.addColumns(month);
// perform the crossTab operation
Table counts = table.xTabCounts("month", "who");
System.out.println(counts);
Crosstab Counts: month x who
[labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
----------------------------------------------------------------------------------------
APRIL | 6 | 10 | 3 | 1 | 0 | 3 | 23 |
AUGUST | 3 | 8 | 2 | 1 | 0 | 2 | 16 |
DECEMBER | 4 | 9 | 4 | 3 | 2 | 5 | 27 |
FEBRUARY | 7 | 9 | 4 | 4 | 1 | 4 | 29 |
JANUARY | 7 | 13 | 6 | 3 | 5 | 8 | 42 |
JULY | 6 | 9 | 4 | 3 | 0 | 4 | 26 |
JUNE | 6 | 11 | 1 | 1 | 0 | 4 | 23 |
MARCH | 5 | 12 | 4 | 3 | 0 | 6 | 30 |
MAY | 4 | 9 | 5 | 3 | 0 | 1 | 22 |
NOVEMBER | 4 | 9 | 6 | 3 | 1 | 1 | 24 |
OCTOBER | 7 | 10 | 8 | 2 | 1 | 3 | 31 |
SEPTEMBER | 5 | 10 | 8 | 3 | 0 | 4 | 30 |
Total | 64 | 119 | 55 | 30 | 10 | 45 | 323 |
Note the total column on the right, which shows that 23 polls were conducted in April, etc., across all pollsters. Similarly, the column totals at the bottom show that, Fox conducted 64 polls, Gallup 119, etc.
4.6.1. Single variable totals
You can get single variable counts using the xTabCounts() method that takes only one column name argument .
Table whoCounts = table.xTabCounts("who");
producing:
Column: who
Category | Count |
----------------------
zogby | 45 |
upenn | 10 |
time.cnn | 30 |
fox | 64 |
newsweek | 55 |
gallup | 119 |
4.6.2. Calculating Percents
You may want to see the percent of polls conducted by each pollster, rather than raw counts. The xTabPercents() method is used for that.
Table whoPercents = table.xTabPercents("who");
Actually, percents is a misnomer. The results produced are the proportions in decimal format. To get percent-formatted output we use a different NumericColumnFormatter.
whoPercents
.columnsOfType(ColumnType.DOUBLE) // format to display as percents
.forEach(x -> ((NumberColumn) x).setPrintFormatter(NumberColumnFormatter.percent(0)));
Column: who
Category | Percents |
-------------------------
zogby | 14% |
upenn | 3% |
time.cnn | 9% |
fox | 20% |
newsweek | 17% |
gallup | 37% |
4.6.3. Table Percents
When you have two variables, you can display the percent that falls into each combination as shown below.
Table tablePercents = table.xTabTablePercents("month", "who");
tablePercents
.columnsOfType(ColumnType.DOUBLE)
.forEach(x -> ((NumberColumn) x).setPrintFormatter(NumberColumnFormatter.percent(1)));
Because the percents are small, we updated the formatter to show a single fractional digit after the decimal point.
<br>
The output can best be understood by looking at an example. Of all the polls in the dataset, 1.9% were conducted by Fox in April, 3.1% by Gallup in April, 0.9% by Fox in August, etc.
Crosstab Table Proportions:
[labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
-------------------------------------------------------------------------------------------
APRIL | 1.9% | 3.1% | 0.9% | 0.3% | 0.0% | 0.9% | 7.1% |
AUGUST | 0.9% | 2.5% | 0.6% | 0.3% | 0.0% | 0.6% | 5.0% |
DECEMBER | 1.2% | 2.8% | 1.2% | 0.9% | 0.6% | 1.5% | 8.4% |
FEBRUARY | 2.2% | 2.8% | 1.2% | 1.2% | 0.3% | 1.2% | 9.0% |
JANUARY | 2.2% | 4.0% | 1.9% | 0.9% | 1.5% | 2.5% | 13.0% |
JULY | 1.9% | 2.8% | 1.2% | 0.9% | 0.0% | 1.2% | 8.0% |
JUNE | 1.9% | 3.4% | 0.3% | 0.3% | 0.0% | 1.2% | 7.1% |
MARCH | 1.5% | 3.7% | 1.2% | 0.9% | 0.0% | 1.9% | 9.3% |
MAY | 1.2% | 2.8% | 1.5% | 0.9% | 0.0% | 0.3% | 6.8% |
NOVEMBER | 1.2% | 2.8% | 1.9% | 0.9% | 0.3% | 0.3% | 7.4% |
OCTOBER | 2.2% | 3.1% | 2.5% | 0.6% | 0.3% | 0.9% | 9.6% |
SEPTEMBER | 1.5% | 3.1% | 2.5% | 0.9% | 0.0% | 1.2% | 9.3% |
Total | 19.8% | 36.8% | 17.0% | 9.3% | 3.1% | 13.9% | 100.0% |
As you can see, this also gives you the 'total' percents by month and pollster.
4.6.4. Column and row percents
The final option is to show column percents or row percents. We’ll start with column percents. You calculate them as shown below.
Table columnPercents = table.xTabColumnPercents("month", "who");
Look at the column for "fox", the values you see are the percentages for fox alone: 9% of fox’s polls were conducted in April, 5% in August, etc.
Looking across the columns on the other hand is not very intuitive (or useful, probably) until you get to the total, which shows the average across all pollsters by month.
Crosstab Column Proportions:
[labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
-----------------------------------------------------------------------------------------
APRIL | 9% | 8% | 5% | 3% | 0% | 7% | 7% |
AUGUST | 5% | 7% | 4% | 3% | 0% | 4% | 5% |
DECEMBER | 6% | 8% | 7% | 10% | 20% | 11% | 8% |
FEBRUARY | 11% | 8% | 7% | 13% | 10% | 9% | 9% |
JANUARY | 11% | 11% | 11% | 10% | 50% | 18% | 13% |
JULY | 9% | 8% | 7% | 10% | 0% | 9% | 8% |
JUNE | 9% | 9% | 2% | 3% | 0% | 9% | 7% |
MARCH | 8% | 10% | 7% | 10% | 0% | 13% | 9% |
MAY | 6% | 8% | 9% | 10% | 0% | 2% | 7% |
NOVEMBER | 6% | 8% | 11% | 10% | 10% | 2% | 7% |
OCTOBER | 11% | 8% | 15% | 7% | 10% | 7% | 10% |
SEPTEMBER | 8% | 8% | 15% | 10% | 0% | 9% | 9% |
Total | 100% | 100% | 100% | 100% | 100% | 100% | 100% |
Row percents show the opposite viewpoint.
Table rowPercents = table.xTabRowPercents("month", "who");
Here we see that, of all the polls conducted in April, fox conducted 26%, Gallup 43%, and The University of Pennsylvania conducted 0% with rounding.
Crosstab Row Proportions:
[labels] | fox | gallup | newsweek | time.cnn | upenn | zogby | total |
----------------------------------------------------------------------------------------
APRIL | 26% | 43% | 13% | 4% | 0% | 13% | 100% |
AUGUST | 19% | 50% | 12% | 6% | 0% | 12% | 100% |
DECEMBER | 15% | 33% | 15% | 11% | 7% | 19% | 100% |
FEBRUARY | 24% | 31% | 14% | 14% | 3% | 14% | 100% |
JANUARY | 17% | 31% | 14% | 7% | 12% | 19% | 100% |
JULY | 23% | 35% | 15% | 12% | 0% | 15% | 100% |
JUNE | 26% | 48% | 4% | 4% | 0% | 17% | 100% |
MARCH | 17% | 40% | 13% | 10% | 0% | 20% | 100% |
MAY | 18% | 41% | 23% | 14% | 0% | 5% | 100% |
NOVEMBER | 17% | 38% | 25% | 12% | 4% | 4% | 100% |
OCTOBER | 23% | 32% | 26% | 6% | 3% | 10% | 100% |
SEPTEMBER | 17% | 33% | 27% | 10% | 0% | 13% | 100% |
Total | 20% | 37% | 17% | 9% | 3% | 14% | 100% |
And that’s all there is to Tablesaw CrossTabs.
5. Importing / exporting
5.1. Supported Formats
Tablesaw supports importing and exporting data to and from a variety of data types and sources.
Format |
Import |
Export |
CSV (and other delimited text) |
Yes |
Yes |
JSON |
Yes |
Yes |
RDBMS (via JDBC) |
Yes |
? |
Fixed Width Text |
Yes |
Yes |
Excel |
Yes |
? |
HTML |
Yes |
Yes |
Note: To minimize the size of the core library, some of the readers (currently JSON, HTML tables, and Excel files) are packaged in separate modules. You will need to include them in your project to use the Table.read()
methods.
See the Javadoc for DataFrameReader for a listing of all the Table.read()
methods that are available.
5.2. CSV
Most text formats are treated similarly. This section covers rectangular text formats like CSV, but much of the information is also relevant for JSON, Excel, and HTML tables.
The easiest way to load data from a CSV file on disk is to use:
Table t = Table.read().file("myFile.csv");
This method supplies defaults for everything but the filename. We assume that columns are separated by commas, and that the file has a header row, which we use to create column names. If one or more defaults are incorrect, you can customize the loading process using the class CsvReadOptions.
You can create an options object with a builder:
CsvReadOptions.Builder builder =
CsvReadOptions.builder("myFile.csv")
.separator('\t') // table is tab-delimited
.header(false) // no header
.dateFormat("yyyy.MM.dd"); // the date format to use.
CsvReadOptions options = builder.build();
Table t1 = Table.read().usingOptions(options);
The header option indicates whether or not there’s a one-line header row at the top of the file. If header is false, we treat all the rows as data.
The separator option allows you to specify a delimiter other than a comma, in case you’re loading a Tab-delimited file, for example.
The dateFormat lets you provide a format for reading dates. All dates in the file should use the same format, and the format is as defined in java.time.format.DateTimeFormatter.
When the table is created, it is given a default name based on the name of the file it was loaded from. You can change the name at any time using
. The table name is used in printing the table and information about it.table.setName(aString);
5.2.1. Column types
With all these methods, Tablesaw looks at the data in each column in the file and takes a wild guess at the type. Actually, it looks at a sample of the data and applies some heuristics. Of course, it’s possible that the data set includes rare values that are missed in the type inference sample. If that happens, you can set the option
to consider all the data when performing type inference.sample(false)
If nothing else seems to fit, the column is read as a StringColumn. Usually, Tablesaw gets it right, but sometimes it needs a little help.
Specifying the datatypes for each column
By a little help, we mean you could specify the types explicitly, by passing an array of ColumnType objects to the read().csv() method. For example:
ColumnType[] types = {LOCAL_DATE, INTEGER, FLOAT, FLOAT, CATEGORY};
Table t = Table.read().usingOptions(CsvReadOptions
.builder("myFile.csv")
.columnTypes(types));
If that seems like a lot of work, it does have some advantages.
First, it reduces the loading time as the system does not need to infer the column types. The time saved can be significant if the file is large.
Second, it gives you complete control over the types for your columns.
In some cases, you must specify the column type, because Tablesaw can’t guess correctly. For example, if a file has times encoded as HHmm so that noon appears as ‘1200’, it’s impossible to infer that this means 12:00 noon, and not the integer 1,200.
Guessed column types
If the table has many columns, it can be tedious to build the column type array by hand. To help, CsvReader has methods that return the inferred ColumnTypes in the form of an array, or as a String. The String is formatted so that it resembles a Java array literal. This method can be used even if reading the file fails.
String types = CsvReader.printColumnTypes("data/bush.csv", true, ','));
System.out.println(types);
> ColumnType[] columnTypes = {
LOCAL_DATE, // 0 date
SHORT_INT, // 1 approval
CATEGORY, // 2 who
}
Note that the returned String is a legal array literal you can paste into Java code: the types are comma separated, and the index position and the column name are provided such that they would be interpreted as comments. You can paste it into your code and then edit it to fix whatever column types are incorrect.
Skipping columns
Another advantage to specifying the column types is that you can skip some if you don’t need them. You can prevent those columns from being imported by using the special “SKIP” column type as shown below:
ColumnType[] types = {SKIP, INTEGER, FLOAT, FLOAT, SKIP};
Table t = Table.read().usingOptions(CsvReadOptions
.builder("myFile.csv")
.columnTypes(types));
In this example, the first and last columns are not loaded.
Missing data
Tablesaw has a predefined set of strings that it interprets as missing data when reading from a CSV file. These are: “NaN”, “*”, “NA”, “null” and, of course, the empty string “”.
When one of these strings is encountered, it is replaced by a type-specific missing indicator inside Tablesaw. For Strings, it’s an empty string. For doubles it’s Double.NaN. See the JavaDoc for ColumnType for more information.
If your file has an unsupported missing value indicator (e.g. "-"), you can provide it in the options builder.
Table t = Table.read().usingOptions(CsvReadOptions
.builder("myFile.csv")
.missingValueIndicator("-"));
Dates and Times
Importing dates and times can be tricky because of Locales and the wide variety of possible formats. As with other Column types, Tablesaw does its best to determine what type is represented and import it correctly. When this fails, two things can help. The first is to specify a locale. A locale can also help with number formats.
The second is to specify the precise format for each temporal column.
Table t = Table.read().usingOptions(CsvReadOptions
.builder("myFile.csv")
.locale(Locale.FRENCH)
.dateFormat("yyyy.MM.dd")
.timeFormat("HH:mm:ss)
.dateTimeFormat("yyyy.MM.dd::HH:mm:ss");
5.2.2. Using the Stream API
All the examples above attempt to streamline the loading process when you have a CSV file stored on your file system. A more flexible way to load a CSV is using the Stream interface, which takes a java.io.InputStream as a parameter.
Table.read().csv(InputStream stream, String tableName);
It can be used to read local files, but also files read across the net, in S3, etc. Here’s some examples of how it can be used.
5.2.3. Examples
Loading a CSV from a Website:
ColumnType[] types = {SHORT_INT, FLOAT, SHORT_INT};
String location =
"https://raw.githubusercontent.com/jtablesaw/tablesaw/master/data/bush.csv";
Table table = Table.read().usingOptions(CsvReadOptions.builder(new URL(location))
.tableName("bush")
.columnTypes(types)));
5.2.4. Alternate encodings
By default, we assume a UTF-8 encoding for your files. If your files use another encoding, the loading process is slightly different. You need to open a reader on a FileInputStream that was constructed with the correct encoding. Here’s an example.
// file has a latin-1 encoding so, special sauce
InputStreamReader reader = new InputStreamReader(
new FileInputStream("somefile.csv"), Charset.forName("ISO-8859-1"));
Table restaurants = Table.read()
.usingOptions(CsvReadOptions.builder(reader, "restaurants"));
5.3. JDBC
It’s equally easy to create a table from the results of a database query. In this case, you never need to specify the column types, because they are inferred from the database column types.
Table t = Table.read().db(ResultSet resultSet, String tableName);
Here’s a more complete example that includes the JDBC setup:
String DB_URL = "jdbc:derby:CoffeeDB;create=true";
Connection conn = DriverManager.getConnection(DB_URL);
Table customer = null;
try (Statement stmt = conn.createStatement()) {
String sql = "SELECT * FROM Customer";
try (ResultSet results = stmt.executeQuery(sql)) {
customer = Table.read().db(results, "Customer");
}
}
5.4. HTML
Tablesaw supports importing data from HTML, JSON, and Excel. See the Javadoc for the Table.read() methods for more info. You will need to add the corresponding optional dependency:
<dependency>
<groupId>com.github.grooviter</groupId>
<artifactId>tablesaw-html</artifactId>
<version>VERSION</version>
</dependency>
implementation 'com.github.grooviter:tablesaw-html:VERSION'
6. Plotting
6.1. Introduction
From the beginning, Tablesaw supported plots for quick, basic, exploratory data analysis. It served its purpose, but fell far short of what the best visualization tools provide. Unfortunately, none of those tools are written in Java.
With version 0.2.0 we introduced a new plotting framework, which provides a Java wrapper around the Plot.ly open source JavaScript visualization library. Plot.ly is based on the extraordinary D3 (Data-Driven Documents) framework, and is certainly among the best open-source visualization packages available in any language. Plot.ly is so good, it has become widely used in languages other than JavaScript such as Python and R, which already had solid options for visualization.
We’ve taken a similar approach to these other languages, providing a wrapper that makes it easy to construct plots in pure Java, And while you can create plots from standard Java objects and primitives, we’ve also ensured that you can build them directly from Tablesaw tables and columns.
This is a huge step forward for Tablesaw in many ways as high quality visualization tools are essential in analytical effort. These are some of the most important benefits:
-
Plot.ly supports a huge range of visualization types. Tablesaw already supports many kinds, including time-series plots, histograms, 2D histograms, box plots, line charts, area charts, scatter plots, 3D scatterplots, bubble plots, Pareto charts, pie charts, and bar charts.
-
We support numerous variations on the plots. For bar charts, for example, we support vertical and horizontal, one-variable, and grouped or stacked charts.
-
We are continually adding support for many additional types, including geographic maps, tree-plots, network diagrams, heat maps, dedograms, distplots, 3D surfaces, 3D ribbons, wind-rose charts, tertiary plots, 2D density plots, contour plots, tree-maps, violin plots, scatterplot matrices, etc.
-
We provide a single, consistent Java API for constructing these plots using builders, type safe enums, and other Java features to minimize the spelling and other issues that are common in working with JavaScript-based tools.
-
The plots have a single, consistent, and professional appearance.
-
Each chart is interactive. They render with a common family of interactive tools for saving, printing, panning, zooming, selecting points, etc.
-
The range of supported customizations is enormous, including, fonts, legends, custom axis, spikes, hover effects, and so on. You can almost always get the visualization you want.
-
And, of course, you can use the output in a Web page when you’re ready to share your visualizations with the wider world.
Please be aware that we don’t support the entire plot.ly API. We do, however, support a large and growing portion.
6.1.1. How it works
There are two ways to work with plotting. You can use the predefined "canned" plots in the api package, or you can roll-your-own custom visualizations.
Pre-defined plots
The api package contains simplified interfaces for producing a number of common plot types. The goal for these plots is that they can be created and displayed in one or two lines of code. Here’s an example bubble plot:
And here’s the code to create and display it. In the simple API, rendering a bubble plot requires two steps. First, a call to BubblePlot.create() returns a Figure object, and then the figure is rendered by passing it to a Plot.show() method.
Finding Example Code
You can find example code for all kinds of plots in the test folder of the jsplot project. The code for this specific plot is in the class BubbleExample2:
This writes a generated HTML page containing the necessary JavaScript to a file on the local filesystem, and opens it in a browser window.
Custom visualizations
Custom visualizations are assembled using "figures", "traces", and "layouts". Custom visualizations are described [here](https://jtablesaw.github.io/tablesaw/userguide/Visualization_custom).
6.1.2. Rendering
In your IDE
When you’re doing data analysis, it’s important to be able to create plots easily in your IDE, so we provide a way to render plots without a servlet engine or web server. To do that, we write an output HTML file to disk and use the default browser on the desktop to load it . The effect is similar to testing Web apps from within an IDE. This is what the call to Plot.show()
in the example does
As a web page
You can render plots to web pages easily by getting the JavaScript from figure you want to display. All figures, traces, and layouts implement asJavaScript()
which returns a String.
As a Java UI
Unfortunately, this isn’t as easy to do as it should be. The JavaFx WebView component is ok for very simple pages, but is too fragile handling JavaScript and CSS-heavy pages for us to support. There are a few other browser components available commercially or as open source if you need to use Tablesaw’s visualization tools, but you may be better off with JavaFx Charts or another pure Java library.
6.2. Bars, Pies, and Pareto charts
For this section, we’ll use a Tornado dataset from NOAA, which can be found in the data folder of the Tablesaw project.
While Tablesaw can produce publication quality graphics, the focus here is on exploratory analysis, where content and ease are more important than polish. Here we focus on some common plot types for working with univariate data:
-
Bar charts
-
Pie charts
-
Pareto charts
6.2.1. Example Code
All of the code used to produce the plots in this document can be found in the class BarPieAndParetoExample.java, which you can find here.
We recommend you open that class to follow along.
6.2.2. Getting setup
First we load and clean the Tornado dataset. We use Table.read().csv() to open the file.
Next we clean the file. We want to work with the scale column. Tornadoes are assigned a scale from 0 to 5 according to their wind speed, but our dataset has missing values encoded as -9, which would throw off the graph. To fix this we set each value of -9 in the scale column to a missing-value indicator. Then we can create our plots.
6.2.3. Bar Plots
We start with the ubiquitous bar chart. Bar charts generally display data that has been summarized into groups. To create a bar chart, you need two things:
-
Some numeric values
-
Some categories to group them
We start by counting tornado-related fatalities according to the intensity of the tornadoes.
Then we plot the results:
Plotting means, medians and other summary statistics
In the example above, we created plots that displayed the sum of values. We did this by first calling summarize() on our table, passing in the name of the column to summarize, the aggregation function sum, and applying a clause by() that specified how to group the data.
There are many aggregation functions defined in the class AggregationFunctions, such as sum, mean, median, standardDeviation, percentile(n), max, kurtosis, etc. We could have used any of those instead of sum. Here we’ll look at the mean values.
We follow the same strategy as above, using the summarize method to create a new table that collects the average number of injuries for each value of "scale". Once we have that, we create and display the plot
In an upcoming section on advanced plotting features, we’ll cover how to create stacked and grouped bar plots.
6.2.4. Pie Plots
Pie plots are both widely criticized and ubiquitous. As a general rule, bar plots are easier to interpret, and so, generally, are to be preferred. We’d be remiss, however to not provide Pie Chart support. This example shows a pie plot that displays the same data as the first bar plot above:
6.3. Histograms & Box Plots
Understanding the distribution of data within a column is often essential. Tablesaw provides one and two dimensional histograms, as well as box plots.
The complete code for this example can be found here: https://github.com/jtablesaw/tablesaw/blob/master/jsplot/src/test/java/tech/tablesaw/examples/DistributionVisualizations.java
We start by loading the data we’ll use in our examples below.
Table property = Table.read().csv("sacremento_real_estate_transactions.csv");
6.3.1. Histograms
A one dimensional histogram of property prices is shown below.
This plot shows the distribution of injury counts for the most powerful tornadoes. To produce it, we simply filter the table to include only level 5, and call Histogram.create();
Plot.show(Histogram.create("Distribution of prices", property, "price"));
We also take a histogram of sizes, after setting any sizes of 0 square feet to "missing".
NumberColumn sqft = property.numberColumn("sq__ft");
sqft.set(sqft.isEqualTo(0), DoubleColumnType.missingValueIndicator());
Plot.show(Histogram.create("Distribution of property sizes", property, "sq__ft"));
6.4. Scatter Plots & Bubble Plots
Scatter plots are among the most popular and useful visualization options. We will use a wine dataset to demonstrate, starting with a simple scatter plot relating California champagne vintages and retail prices.
Yes, we are using champagne data to demonstrate bubble charts. How could we use anything else?
First we get our dataset and filter it.
Table wines = Table.read().csv("wine_test.csv");
Table champagne =
wines.where(
wines.stringColumn("wine type").isEqualTo("Champagne & Sparkling")
.and(wines.stringColumn("region").isEqualTo("California")));
The plotting code is straightforward. This line creates and displays the plot.
Plot.show(
ScatterPlot.create("Champagne prices by vintage",
champagne, "mean retail", "vintage"));
This kind of plot is easy to interpret but limited in expressiveness. By modifying it, you can show complex relationships between up to five variables (four numeric and one categorical).
6.4.1. Three variables
Adding a categorical column to the plot produces a series for each category.
Plot.show(
ScatterPlot.create("Wine prices and ratings",
wines, "Mean Retail", "highest pro score", "wine type"));
The legend on the right serves a dual purpose. It associates the colors with their categories, and it also allows you to remove or add a category to the plot by clicking its name.
To plot three numeric variables we can use a bubble plot or a 3D scatter. First we’ll use the bubble
Plot.show(
BubblePlot.create("Average retail price for champagnes by vintage and rating",
champagne, // table
"highest pro score", // x
"vintage", // y
"Mean Retail")); // bubble size
The size of the bubble is given by the last column "mean retail." By default, values are mapped to the diameter of the bubble, but it’s possible to use area when creating a custom scatter plot.
To represent the same data in a 3D Scatter, you would use the Scatter3DPlot instead of BubblePlot. The rest of the code is the same. The variable represented by bubble size above is now represented on the z axis:
Plot.show(
Scatter3DPlot.create("Average retail price for champagnes by vintage and rating",
champagne, // table
"highest pro score", // x
"vintage", // y
"Mean Retail")); // z
We can’t show it here, but these plots are rotatable in 3D space, and supporting panning and zooming like the 2D plots. Hovering over a point produces a label and draws a box connecting the highlighted point to each of the three axes.
6.4.2. Four variables
We can add a categorical variable to either the Bubble or the 3D scatter. First we’ll show the bubble version.
Plot.show(
BubblePlot.create("Average retail price for champagnes by vintage and rating",
champagne,
"highest pro score",
"vintage",
"Mean Retail",
"appellation"));
The grouping column is added to the end of the method. The result is shown below.
Let’s see the same four variables using a 3D scatter. First the code, and then the plot:
Plot.show(
Scatter3DPlot.create("Average retail price for champagnes by vintage and rating",
champagne,
"highest pro score",
"vintage",
"Mean Retail",
"appellation"));
6.4.3. Five variables
Now we’ll take one step further, just because we can. Here we add another numeric variable to the categorical 3D scatter plot above. As with the 2D scatter, we use point size for the new numeric variable.
Plot.show(
Scatter3DPlot.create("High & low retail price for champagne by vintage and rating",
champagne,
"vintage",
"highest pro score",
"highest retail",
"lowest retail",
"appellation"));
6.4.4. Going Further
Use the code in the api package as a starting for your own explorations. The process of creating custom visualizations is covered [here](https://jtablesaw.github.io/tablesaw/userguide/Visualization_custom). Also be sure to look at related api classes for specialized plots like time series, line plots, or area plots.
6.5. Time Series, Line charts, and Area charts
Time series data is essential in finance, healthcare, business operations, server log analysis, and many other areas. Here we take a broad look at crunching temporal data in Java using the Tablesaw data science library.
6.5.1. Time Series
You can create time series plots easily. The time axis adjusts the scale according to the size of the display area and the number of points displayed. As always, we start with loading data.
Table bush = Table.read().csv("bush.csv")
This loads a CSV file and creates a table with typed columns.
Here’s an example:
Table foxOnly = bush.where(bush.stringColumn("who").equalsIgnoreCase("fox"));
Plot.show(TimeSeriesPlot.create("Fox approval ratings", foxOnly, "date", "approval"));
To see more than one series, we add a grouping column to the call to *TimeSeriesPlot.show().*This creates a separate line for each distinct value in that column. Here the grouping column is "who", which holds the names of the organizations who conducted the poles.
Plot.show(
TimeSeriesPlot.create("George W. Bush approval", bush, "date", "approval", "who"));
6.5.2. Line Charts
Perhaps the simplest way to present time-oriented data is to simply plot a set of observations in the order in which they occurred.
Table robberies = Table.read().csv("../../data/boston-robberies.csv");
Plot.show(
LinePlot.create("Monthly Boston Robberies: Jan 1966-Oct 1975",
robberies, "Record", "Robberies"));
6.5.3. Area Charts
When the observations represent a level, an area chart can be a good choice.
Table robberies = Table.read().csv("../../data/boston-robberies.csv");
Plot.show(
LinePlot.create("Monthly Boston Robberies: Jan 1966-Oct 1975",
robberies, "Record", "Robberies"));
6.5.4. Examples
Financial time-series
Many financial datasets, especially those dealing with market prices, are in the form of a time series where price changes during a time period of a day, hour, or minute are represented by four variables: Open, High, Low, and Close. For example:
ohlcdata.csv
Date | Open | High | Low | Close | Volume | Adj Close |
-----------------------------------------------------------------------------------
2009-03-31 | 17.83 | 18.79 | 17.78 | 18.37 | 92,095,500 | 17.81 |
2009-03-30 | 17.74 | 17.76 | 17.27 | 17.48 | 49,633,000 | 16.95 |
2009-03-27 | 18.54 | 18.62 | 18.05 | 18.13 | 47,670,400 | 17.58 |
2009-03-26 | 18.17 | 18.88 | 18.12 | 18.83 | 63,775,100 | 18.26 |
2009-03-25 | 17.98 | 18.31 | 17.52 | 17.88 | 73,927,100 | 17.34 |
...
Open and Close are the prices at the beginning and end of the period, respectively. High and Low are the highest and lowest prices during the period. For these datasets, several specialized variations of time series have been created that show all four variables for each time point. We will look at two here: OHLC charts and Candlestick charts.
First we need to load the new dataset:
Table priceTable = Table.read().csv("../../data/ohlcdata.csv");
Creating these charts can be done in a single line of code. We’ll look at OHLC chart first.
Plot.show(OHLCPlot.create("Prices", // The plot title
priceTable, // the table we loaded earlier
"date", // our time variable
"open", // the price data...
"high",
"low",
"close"));
Candlestick Charts
Plot.show(CandlestickPlot.create("Prices", priceTable, "date","open", "high", "low", "close"));
7. Tutorial
Tablesaw is a large library. We’ll use Tablesaw to look at data about Tornadoes. If you haven’t already done so, we strongly recommend that you read the Getting Started guide, before continuing here.
7.1. Exploring Tornadoes
To give a better sense of how Tablesaw works, we’ll use a tornado data set from NOAA. Here’s what we’ll cover:
-
Reading and writing CSV files
-
Viewing table metadata
-
Adding and removing columns
-
Printing the first few rows for a peak at the data
-
Sorting
-
Running descriptive stats (mean, min, max, etc.)
-
Performing mapping operations over columns
-
Filtering rows
-
Computing cross-tabs
All the data is in the Tablesaw data folder. The code is taken from the TornadoExample test class.
7.1.1. Read a CSV file
Here we read a csv file of tornado data. Tablesaw infers the column types by sampling the data.
Table tornadoes = Table.read().csv("../../data/tornadoes_1950-2014.csv");
Note: that the file is addressed relative to the current working directory. You may have to change it for your code.
7.1.2. Viewing table metadata
Often, the best way to start is to print the column names for reference:
tornadoes.columnNames()
[Date, Time, State, State No, Scale, Injuries, Fatalities, Start Lat, Start Lon, Length, Width]
The shape() method displays the row and column counts:
tornadoes.shape()
59945 rows X 11 cols
structure() shows the index, name and type of each column
tornadoes.structure().printAll()
Structure of tornadoes_1950-2014.csv
Index | Column Name | Column Type |
-----------------------------------------
0 | Date | LOCAL_DATE |
1 | Time | LOCAL_TIME |
2 | State | STRING |
3 | State No | INTEGER |
4 | Scale | INTEGER |
5 | Injuries | INTEGER |
6 | Fatalities | INTEGER |
7 | Start Lat | DOUBLE |
8 | Start Lon | DOUBLE |
9 | Length | DOUBLE |
10 | Width | INTEGER |
Like many Tablesaw methods, structure() returns a table. You can then produce a string representation for display. For convenience, calling toString() on a table invokes print(), which produces a string representation of the table table. To display the table then, you can simply call.
System.out.println(tornadoes);
You can also perform other table operations on it. For example, the code below removes all columns whose type isn’t DOUBLE:
tornadoes
.structure()
.where(tornadoes.structure().stringColumn("Column Type").isEqualTo("DOUBLE"))
Structure of tornadoes_1950-2014.csv
Index | Column Name | Column Type |
-----------------------------------------
7 | Start Lat | DOUBLE |
8 | Start Lon | DOUBLE |
9 | Length | DOUBLE |
Of course, that also returned a table. We’ll cover selecting rows in more detail later.
7.1.3. Previewing data
The first(n) method returns a new table containing the first n rows.
tornadoes.first(3)
tornadoes_1950-2014.csv
Date | Time | State | State No | Scale | Injuries | Fatalities | Start Lat | Start Lon | Length | Width |
-----------------------------------------------------------------------------------------------------------------------------------------
1950-01-03 | 11:00:00 | MO | 1 | 3 | 3 | 0 | 38.77 | -90.22 | 9.5 | 150 |
1950-01-03 | 11:00:00 | MO | 1 | 3 | 3 | 0 | 38.77 | -90.22 | 6.2 | 150 |
1950-01-03 | 11:10:00 | IL | 1 | 3 | 0 | 0 | 38.82 | -90.12 | 3.3 | 100 |
7.1.4. Mapping operations
Mapping operations in Tablesaw take one or more columns as inputs and produce a new column as output. We can map arbitrary expressions onto the table, but many common operations are built in. You can, for example, calculate the difference in days, weeks, or years between the values in two date columns. The method below extracts the Month name from the date column into a new column.
StringColumn month = tornadoes.dateColumn("Date").month();
Now that you have a new column, you can add it to the table:
tornadoes.addColumns(month);
You can remove columns from tables to save memory or reduce clutter:
tornadoes.removeColumns("State No");
7.1.5. Sorting
Now lets sort the table in reverse order by the id column. The negative sign before the name indicates a descending sort.
tornadoes.sortOn("-Fatalities");
7.1.6. Descriptive statistics
Descriptive statistics are calculated using the summary() method:
tornadoes.column("Fatalities").summary().print()
Column: Fatalities
Measure | Value |
------------------------------------
n | 59945 |
sum | 6802 |
Mean | 0.11347068145800349 |
Min | 0 |
Max | 158 |
Range | 158 |
Variance | 2.901978053261765 |
Std. Dev | 1.7035193140266314 |
7.1.7. Filtering
You can write your own methods to filter rows, but it’s easier to use the built-in filter classes as shown below:
Table result = tornadoes.where(tornadoes.intColumn("Fatalities").isGreaterThan(0));
result = tornadoes.where(result.dateColumn("Date").isInApril());
result =
tornadoes.where(
result
.intColumn("Width")
.isGreaterThan(300) // 300 yards
.or(result.doubleColumn("Length").isGreaterThan(10))); // 10 miles
result = result.select("State", "Date");
tornadoes_1950-2014.csv
State | Date |
------------------------
MO | 1950-01-03 |
IL | 1950-01-03 |
OH | 1950-01-03 |
The last example above returns a table containing only the columns named in select() parameters,rather than all the columns in the original.
7.1.8. Totals and sub-totals
Column metrics can be calculated using methods like sum(), product(), mean(), max(), etc.
You can apply those methods to a table, calculating results on one column, grouped by the values in another.
Table injuriesByScale = tornadoes.summarize("Injuries", median).by("Scale").sortOn("Scale");
injuriesByScale.setName("Median injuries by Tornado Scale");
This produces the following table, in which Group represents the Tornado Scale and Median the median injures for that group:
Median injuries by Tornado Scale
Scale | Median [Injuries] |
-------------------------------
-9 | 0 |
0 | 0 |
1 | 0 |
2 | 0 |
3 | 1 |
4 | 12 |
5 | 107 |
7.1.9. Cross Tabs
Tablesaw lets you easily produce two-dimensional cross-tabulations (“cross tabs”) of counts and proportions with row and column subtotals. Here’s a count example where we look at the interaction of tornado severity and US state:
CrossTab.counts(tornadoes, tornadoes.stringColumn("State"), tornadoes.intColumn("Scale"))
.first(10)
Crosstab Counts: State x Scale
[labels] | -9 | 0 | 1 | 2 | 3 | 4 | 5 | total |
----------------------------------------------------------------------------
AL | 0 | 624 | 770 | 425 | 142 | 38 | 12 | 2011 |
AR | 1 | 486 | 667 | 420 | 162 | 29 | 0 | 1765 |
AZ | 1 | 146 | 71 | 16 | 3 | 0 | 0 | 237 |
CA | 1 | 271 | 117 | 23 | 2 | 0 | 0 | 414 |
CO | 3 | 1322 | 563 | 112 | 22 | 1 | 0 | 2023 |
CT | 0 | 18 | 53 | 22 | 4 | 2 | 0 | 99 |
DC | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 2 |
DE | 0 | 22 | 26 | 12 | 1 | 0 | 0 | 61 |
FL | 2 | 1938 | 912 | 319 | 37 | 3 | 0 | 3211 |
GA | 0 | 413 | 700 | 309 | 74 | 11 | 0 | 1507 |
7.1.10. Putting it all together
Now that you’ve seen the pieces, we can put them together to perform a more complex data analysis. Lets say we want to know how frequently Tornadoes occur in the summer. Here''s one way to approach that:
Let’s start by getting only those tornadoes that occurred in the summer.
Table summer =
tornadoes.where(
QuerySupport.or(
// In June
QuerySupport.and(
t -> t.dateColumn("Date").month().isEqualTo("JUNE"),
t -> t.dateColumn("Date").dayOfMonth().isGreaterThanOrEqualTo(21)),
// In July or August
t -> t.dateColumn("Date").month().isIn("JULY", "AUGUST"),
// In September
QuerySupport.or(
t -> t.dateColumn("Date").month().isEqualTo("SEPTEMBER"),
t -> t.dateColumn("Date").dayOfMonth().isLessThan(22))));
To get the frequency, we calculate the difference in days between successive tornadoes. The lag() method creates a column where every value equals the previous value (the prior row) of the source column. Then we can simply get the difference in days between the two dates. DateColumn has a method daysUntil() that does this. It returns a NumberColumn that we’ll call "delta".
summer = summer.sortAscendingOn("Date", "Time");
summer.addColumns(summer.dateColumn("Date").lag(1));
DateColumn summerDate = summer.dateColumn("Date");
DateColumn laggedDate = summer.dateColumn("Date lag(1)");
IntColumn delta = laggedDate.daysUntil(summerDate);
summer.addColumns(delta);
Now we simply calculate the mean of the delta column. Splitting on year keeps us from inadvertently including the time between the last tornado of one summer and the first tornado of the next.
Table summary = summer.summarize(delta, mean, count).by(summerDate.year());
Printing summary gives us the answer by year.
tornadoes_1950-2014.csv summary
Date year | Mean [Date lag(1) - Date[DAYS]] | Count [Date lag(1) - Date[DAYS]] |
--------------------------------------------------------------------------------------
1950 | 2.0555555555555545 | 162 |
1951 | 1.7488584474885829 | 219 |
1952 | 1.8673469387755088 | 196 |
1953 | 0.983870967741935 | 372 |
1954 | 0.8617283950617302 | 405 |
...
To get a DOUBLE for the entire period, we can take the average of the annual means.
summary.nCol(1).mean()
// Average days between tornadoes in the summer:
0.5931137164104612
8. ML
8.1. Introduction
Tablesaw provides excellent functionality for easy and efficient manipulation, vizualization, and exploration of data from a variety of sources. A natural extension of this is the ability to utilize statistical/machine learning methods alongside this functionality. Tablesaw now supports basic integration with the leading JVM machine learning library, Smile.
Smile supports numerous machine learning techniques, everything from basic linear regression to unsupervised learning algorithms. The library boasts top of the line performance both in the JVM realm and in comparison to alternatives in the Python/R ecosystems.
At the basic level, one can use Tablesaw to do all of the data manipulation required for a project and then easily convert a Table to the Smile DataFrame format when passing it off to a model.
Table data = Table.read().csv("path/to/file.csv");
//clean, manipulate, visualize data as needed
//convert to Smile Dataframe format
DataFrame data_smile = data.smile().toDataFrame();
The Tablesaw User Guide contains several examples of how to use Tablesaw and Smile together to implement popular machine learning methods. To add Smile to your project, add the dependency below to your Gradle or Maven file. Though there are newer versions available, Smile 2.0.0 was used in the development of the tutorials and is recommended.
implementation 'com.github.haifengl:smile-core:2.0.0'
<dependency>
<groupId>com.github.haifengl</groupId>
<artifactId>smile-core</artifactId>
<version>2.0.0</version>
</dependency>
References
-
[Linear Regression with Smile and Tablesaw: Moneyball Tutorial](https://jtablesaw.github.io/tablesaw/userguide/ml/Moneyball Linear regression)
-
[K-Means clustering with Smile and Tablesaw: NYC Uber Tutorial](https://jtablesaw.github.io/tablesaw/userguide/ml/Uber K Means)
-
[Classification using Random Forests with Smile and Tablesaw: Heart Tutorial](https://jtablesaw.github.io/tablesaw/userguide/ml/Heart Random forest)
8.2. Moneyball
Linear regression analysis has been called the "Hello World" of machine learning, because it’s widely used and easy to understand. It’s also very powerful. We’ll walk through the modeling process here using Smile and Tablesaw. Smile is a fantastic Java machine learning library and Tablesaw is data wrangling library like pandas.
One of the best known applications of regression comes from the book Moneyball, which describes the innovative use of data science at the Oakland A’s baseball team. My analysis is based on a lecture given in the EdX course: MITx: 15.071x The Analytics Edge. If you’re new to data analytics, I would strongly recommend this course.
Moneyball is a great example of how to apply data science to solve a business problem. For the A’s, the business problem was "How do we make the playoffs?" They break that problem down into simpler problems that can be solved with data. Their approach is summarized in the diagram below:
In baseball, you make the playoffs by winning more games than your rivals, but you can’t control the number of games your rivals win. How should you proceed? The A’s needed to find controllable variables that affected their likelihood of making the playoffs.
Specifically, they wanted to know how to spend their salary dollars to produce the most wins. Statistics like "Batting Average" are available for individual players so if you knew Batting Average had the greatest impact, you can trade for players with high batting averages, and thus improve your odds of success.
To do regression modeling in Tablesaw, we’ll first need to import Smile:
implementation group: 'com.github.haifengl', name: 'smile-core', version: '2.0.0'
<dependency>
<groupId>com.github.haifengl</groupId>
<artifactId>smile-core</artifactId>
<version>2.0.0</version>
</dependency>
You will also need tablesaw-core and tablesaw-jsplot .
|
To connect player stats to making the playoffs, they systematically decomposed their high-level goal. They started by asking how many wins they’d need to make the playoffs. They decided that 95 wins would give them a strong chance. Here’s how we might check that assumption in Tablesaw. [(Download CSV here)](https://raw.githubusercontent.com/jtablesaw/tablesaw/master/data/baseball.csv)
// Get the data
Table baseball = Table.read().csv("data/baseball.csv");
// filter the data to start at the 2002 season when the A's model was made
Table moneyball = baseball.where(baseball.intColumn("year").isLessThan(2002));
We can check the assumption visually by plotting wins per year in a way that separates the teams who make the playoffs from those who don’t. This code produces the chart below:
NumericColumn wins = moneyball.nCol("W");
NumericColumn year = moneyball.nCol("Year");
Column playoffs = moneyball.column("Playoffs");
Figure winsByYear = ScatterPlot.create("Regular season wins by year", moneyball, "W", "year", "playoffs");
Plot.show(winsByYear);
Teams that made the playoffs are shown as yellow points. If you draw a vertical line at 95 wins, you can see that it’s likely a team that wins more than 95 games will make the playoffs. So far so good.
Unfortunately, you can’t directly control the number of games you win. We need to go deeper. At the next level, we hypothesize that the number of wins can be predicted by the number of Runs Scored during the season, combined with the number of Runs Allowed.
To check this assumption we compute Run Difference as Runs Scored - Runs Allowed:
NumberColumn RS = (NumberColumn) moneyball.numberColumn("RS");
NumberColumn RA = (NumberColumn) moneyball.numberColumn("RA");
NumberColumn runDifference = RS.subtract(RA).setName("RD");
moneyball.addColumns(runDifference);
Now lets see if Run Difference is correlated with Wins. We use a scatter plot again:
Figure runsVsWins = ScatterPlot.create("Run Difference x Wins", moneyball, "RD","W");
Plot.show(runsVsWins);
Our plot shows a strong linear relation between the two.
8.2.1. Modeling
Let’s create our first predictive model using linear regression, with runDifference as the sole explanatory variable. Here we use Smile’s OLS (Ordinary Least Squares) regression model.
LinearModel winsModel = OLS
.fit(Formula.lhs("W"), moneyball.selectColumns("RD", "W").smile().toDataFrame());
If we print our "winsModel", it produces the output below:
Residuals:
Min 1Q Median 3Q Max
-14.2662 -2.6511 0.1282 2.9365 11.6570
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept 80.8814 0.1312 616.6747 0.0000 ***
Run Difference 0.1058 0.0013 81.5536 0.0000 ***
---------------------------------------------------------------------
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.9391 on 900 degrees of freedom
Multiple R-squared: 0.8808, Adjusted R-squared: 0.8807
F-statistic: 6650.9926 on 1 and 900 DF, p-value: 0.000
8.2.2. Interpreting the model
If you’re new to regression, here are some take-aways from the output:
-
The R-squared of .88 can be interpreted to mean that roughly 88% of the variance in Wins can be explained by the Run Difference variable. The rest is determined by some combination of other variables and pure chance.
-
The estimate for the Intercept is the average wins independent of Run Difference. In baseball, we have a 162 game season so we expect this value to be about 81, as it is.
-
The estimate for the RD variable of .1, suggests that an increase of 10 in Run Difference, should produce about 1 additional win over the course of the season.
Of course, this model is not simply descriptive. We can use it to make predictions. In the code below, we predict how many games we will win if we score 135 more runs than our opponents. To do this, we pass an array of doubles, one for each explanatory variable in our model, to the predict() method. In this case, there’s just one variable: run difference.
double[] runDifferential = {135};
double expectedWins = winsModel.predict(runDifferential);
> 95.159733753496
We’d expect 95 wins when we outscore opponents by 135 runs.
8.2.3. Modeling Runs Scored
It’s time to go deeper again and see how we can model Runs Scored and Runs Allowed. The approach the A’s took was to model Runs Scored using team On-base percent (OBP) and team Slugging Average (SLG). In Tablesaw, we write:
LinearModel runsScored = OLS.fit(Formula.lhs("RS"), moneyball.selectColumns("RS", "OBP", "SLG").smile().toDataFrame());
Once again the first parameter takes a Tablesaw column containing the values we want to predict (Runs scored). The next two parameters take the explanatory variables OBP and SLG.
Linear Model:
Residuals:
Min 1Q Median 3Q Max
-70.8379 -17.1810 -1.0917 16.7812 90.0358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -804.6271 18.9208 -42.5261 0.0000 ***
OBP 2737.7682 90.6846 30.1900 0.0000 ***
SLG 1584.9085 42.1556 37.5966 0.0000 ***
---------------------------------------------------------------------
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.7900 on 899 degrees of freedom
Multiple R-squared: 0.9296, Adjusted R-squared: 0.9294
F-statistic: 5933.7256 on 2 and 899 DF, p-value: 0.000
Again we have a model with excellent explanatory power with an R-squared of 92. Now we’ll check the model visually to see if it violates any assumptions. Our residuals should be normally distributed. We can use a histogram to verify:
Plot.show(Histogram.create("Runs Scored Residuals",runsScored.residuals()));
It looks great. It’s also important to plot the predicted (or "fitted") values against the residuals. We want to see if the model fits some values better than others, which will influence whether we can trust its predictions or not. Ideally, we want to see a cloud of random dots around zero on the y axis.
Our Scatter class can create this plot directly from the model:
double[] fitted = runsScored.fittedValues();
double[] resids = runsScored.residuals();
Plot.show(ScatterPlot.create("Runs Scored from OBP and SLG", "Fitted", fitted, "Residuals", resids));
Again, the plot looks good.
Let’s review. We’ve created a model of baseball that predicts entry into the playoffs based on batting stats, with the influence of the variables as:
SLG & OBP -> Runs Scored -> Run Difference -> Regular Season Wins
8.2.4. Modeling Runs Allowed
Of course, we haven’t modeled the Runs Allowed side of Run Difference. We could use pitching and field stats to do this, but the A’s cleverly used the same two variables (SLG and OBP), but now looked at how their opponent’s performed against the A’s. We could do the same as these data are encoded in the dataset as OOBP and OSLG.
LinearModel runsAllowed = OLS
.fit(Formula.lhs("RA"), moneyball.selectColumns("RA", "OOBP", "OSLG"
.dropRowsWithMissingValues()
.smile()
.toDataFrame());
> Linear Model:
Residuals:
Min 1Q Median 3Q Max
-82.3971 -15.9782 0.0166 17.9137 60.9553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept -837.3779 60.2554 -13.8971 0.0000 ***
OOBP 2913.5995 291.9710 9.9791 0.0000 ***
OSLG 1514.2860 175.4281 8.6319 0.0000 ***
---------------------------------------------------------------------
Significance codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.6739 on 87 degrees of freedom
Multiple R-squared: 0.9073, Adjusted R-squared: 0.9052
F-statistic: 425.8225 on 2 and 87 DF, p-value: 1.162e-45
This model also looks good, but you’d want to look at the plots again, and do other checking as well. Checking the predictive variables for collinearity is always good.
Finally, we can tie this all together and see how well wins is predicted when we consider both offensive and defensive stats.
LinearModel winsFinal = OLS
.fit(Formula.lhs("W"), moneyball.selectColumns("W", "OOBP", "OBP", "OSLG", "SLG")
.dropRowsWithMissingValues()
.smile()
.toDataFrame());
The output isn’t shown, but we get an R squared of .89. Again this is quite good.
8.2.5. The A’s in 2001
For fun, I decided to see what the model predicts for the 2001 A’s. First, I got the independent variables for the A’s in that year.
StringColumn team2001 = moneyball.stringColumn("team");
NumberColumn year2001 = (NumberColumn) moneyball.numberColumn("year");
Table AsIn2001 = moneyball.selectColumns("year", "OOBP", "OBP", "OSLG", "SLG")
.where(team2001.isEqualTo("OAK")
.and(year2001.isEqualTo(2001)));
> baseball.csv
Year | OOBP | OBP | OSLG | SLG |
-------------------------------------------------
2001.0 | 0.308 | 0.345 | 0.38 | 0.439 |
Now we get the prediction:
double[][] values = new double[][] {{ 0.308, 0.345, .38, 0.439 }};
double[] value = winsFinal.predict(values);
The model predicted that the 2001 A’s would win 102 games given their slugging and On-Base stats. They won 103.
8.2.6. Recap
We used regression to build predictive models, and visualizations to check our assumptions and validate our models.
The next step would involve predicting how the current team will perform using historical data, and find available talent who could increase the team’s average OBP or SLG numbers, or reduce the opponent values of the same stats. Taking it to that level requires individual player stats that aren’t in our dataset, so we’ll leave it here, but I hope this post has shown how Tablesaw and Smile work together to make regression analysis in Java easy and practical.
8.3. Random Forest
While linear regression analysis (introduced in the <a href="https://jtablesaw.github.io/tablesaw/userguide/ml/Moneyball%20Linear%20regression">Moneyball tutorial</a>) is widely used and works well for a variety of problems, tree-based models provide excellent results and be applied to datasets with both numerical and categorical features, without making any assumption of linearity. In addition, tree based methods can be used effectively for regression and classification tasks.
This tutorial is based on Chapter 8 (Tree Based Methods) of the widely used and freely available textbook <a href="https://www.statlearning.com/">An Introduction to Statistical Learning, Second Edition</a>.
8.3.1. Basic Decision Trees
The basic idea of a decision tree is simple: use a defined, algorithmic approach to partition the feature space of a dataset into regions in order to make a prediction. For a regression task, this prediction is simply the mean of the response variable for all training observations in the region. In any decision tree, every datapoint that falls within the same region receives the same prediction.
The following figure illistrates a two-dimensional feature space that has been partitioned according to a greedy algorithm that minimizes the residual sum of squares.
As you can see, the feature space is split into three regions, and three predictions are possible: 5.00, 6.00, and 6.74.
While decision trees are easily interpretable, especially when looking at the visual of the tree, basic decision trees like this one are generally less accurate than other regression/classification approaches. More advanced tree-based methods, however, not only compete with other approaches, but often provide exceptional performance.
8.3.2. Random Forest
Random Forest is an ensemble method that builds on basic decision trees to form a model that is the composition of a large number of individual decision trees. In a Random Forest model, n decision trees are grown independently, with each tree considering only a subset of of predictors m, using only a subset of the training data. Becuase the majority of the predictors are not considered in each individual tree, the predicitons of each individual tree are decorrelated and therefore the average decision of all the trees in the model will have lower variance and be of greater predictive value. The greedy algorithm used to construct decision trees is heavily influenced by strong predictors, so by excluding a number of predictors when each tree in the Random Forest is grown, the model can explore the predictive value of other features that previously may have been largely ignored.
8.3.3. Implementing Random Forest for the Heart dataset using Tablesaw + Smile
The Heart dataset contains 13 qualitative and quantitative predictors for 303 patients who sought medical attention due to chest pain. The response represents a binary classification scenario as we want to predict which patients have heart disease and which do not. You can download the dataset <a href="https://github.com/jtablesaw/tablesaw/blob/master/data/Heart.csv">here</a>.
As usual, you will need to add the smile, tablesaw-core, and tablesaw-jsplot dependencies. (Described in <a href="https://jtablesaw.github.io/tablesaw/userguide/ml/Moneyball%20Linear%20regression">Moneyball Tutorial</a>)
First, load and clean the data. Qualitative features must be represented as integers to build the model.
Table data = Table.read().csv("Heart.csv");
//encode qualitative features as integers to prepare for Smile model.
data.replaceColumn("AHD", data.stringColumn("AHD").asDoubleColumn().asIntColumn());
data.replaceColumn("Thal", data.stringColumn("Thal").asDoubleColumn().asIntColumn());
data.replaceColumn("ChestPain", data.stringColumn("ChestPain").asDoubleColumn());
//Remove the index column, as it is not a feature
data.removeColumns("C0");
//Remove rows with missing values
data = data.dropRowsWithMissingValues();
//print out cleaned dataset for inspection
System.out.println(data.printAll());
Next, segment your dataset into two distinct tables. Rows are randomly assigned tables by default using the .sampleSplit() function.
//Split the data 70% test, 30% train
Table[] splitData = data.sampleSplit(0.7);
Table dataTrain = splitData[0];
Table dataTest = splitData[1];
Now build an initial RandomForest model using the training dataset and sensible parameter values. m, the number of features to be considered for each tree, is set to the square root of the number of features. n, or ntrees, is set to 50, a reasonmable starting point that will run quickly. d, or maxDepth, is the maximum depth of any individual decision tree, and is set to another reasonable value, 7. maxNodes is set to 100 and represents the most number of terminal decision making nodes a tree can have. All of these values are considered Hyperparameters and should be refined by the user to improve the accuracy of the model.
Smile contains two RandomForest classes, one for classification tasks and one for regression tasks. Make sure you import the correct class for your problem context, in this case smile.classification.RandomForest* |
//initial model with sensible parameters
RandomForest RFModel1 = smile.classification.RandomForest.fit(
Formula.lhs("AHD"),
dataTrain.smile().toDataFrame(),
50, //n
(int) Math.sqrt((double) (dataTrain.columnCount() - 1)), //m = sqrt(p)
SplitRule.GINI,
7, //d
100, //maxNodes
1,
1
);
View the first decision tree generated by the model.
System.out.println(RFModel1.trees()[0]);
Predict the response of the test dataset using your model and assess model accuracy.
//predict the response of test dataset with RFModel1
int[] predictions = RFModel1.predict(dataTest.smile().toDataFrame());
//evaluate % classification accuracy for RFModel1
double accuracy1 = Accuracy.of(dataTest.intColumn("AHD").asIntArray(), predictions);
System.out.println(accuracy1);
Generate and plot feature importance. According to the Smile documentation, the built in .importance() function calculates variable importance as the sum decrease in the Gini impurity criterion across all node splits on that variable, averaged across all trees in the Random Forest. In other words, the more that impurity declines when a node is split on a variable, the more important it is to the predictive power of a model.
//measure variable importance (mean decrease Gini Index)
double[] RFModel1_Importance = RFModel1.importance();
System.out.println(Arrays.toString(RFModel1_Importance));
//plot variable importance with tablesaw
Table varImportance = Table.create("featureImportance");
List<String> featureNames = dataTrain.columnNames();
featureNames.remove(13); //remove response
varImportance.addColumns(DoubleColumn.create("featureImportance", RFModel1_Importance), StringColumn.create("Feature", featureNames));
varImportance = varImportance.sortDescendingOn("featureImportance");
Plot.show(HorizontalBarPlot.create("Feature Importance", varImportance, "Feature", "featureImportance"));
As you can see, features Fbs and ExAng have the lowest importance in the model, while Age, MaxHR, RestBP, and Chol are all of high importance.
Another (lesser) concern when selecting features to include in the model is having two features that are highly correlated with one another. At best, including all of such features is redundant, at worst, it could negatively impact model performance.
Spearman’s correlation metric provides a measure of feature correlation and can be generated automatically with Tablesaw. (-1 represents an extreme negative correlation, +1 represents an extreme positive correlation)
Generate a matrix of Spearman’s correlation metrics between all features.
//construct correlation matrix
Table corr = Table.create("Spearman's Correlation Matrix");
corr.addColumns(StringColumn.create("Feature"));
for(String name: dataTest.columnNames().subList(0,12))
{
corr.addColumns(DoubleColumn.create(name));
}
for(int i = 0; i < 12; i++)
{
for(int j = 0; j < 12; j++)
{
corr.doubleColumn(i+1).append(dataTrain.numericColumns(i).get(0).asDoubleColumn().spearmans(dataTrain.numericColumns(j).get(0).asDoubleColumn()));
}
}
corr.stringColumn("Feature").addAll(dataTrain.columnNames().subList(0,12));
for(Object ea: corr.columnsOfType(ColumnType.DOUBLE))
{
((NumberColumn) ea).setPrintFormatter(NumberColumnFormatter.fixedWithGrouping(2));
}
System.out.println(corr.printAll());
Output:
> Spearman's Correlation Matrix
Feature | Age | Sex | ChestPain | RestBP | Chol | Fbs | RestECG | MaxHR | ExAng | Oldpeak | Slope | Ca |
----------------------------------------------------------------------------------------------------------------------------------------------
Age | 1.00 | -0.08 | -0.18 | 0.34 | 0.14 | 0.08 | 0.16 | -0.41 | 0.10 | 0.23 | 0.16 | 0.34 |
Sex | -0.08 | 1.00 | -0.11 | -0.06 | -0.13 | 0.04 | 0.01 | -0.05 | 0.10 | 0.13 | 0.03 | 0.13 |
ChestPain | -0.18 | -0.11 | 1.00 | -0.15 | -0.02 | 0.04 | -0.11 | 0.29 | -0.33 | -0.33 | -0.28 | -0.20 |
RestBP | 0.34 | -0.06 | -0.15 | 1.00 | 0.11 | 0.12 | 0.15 | -0.09 | 0.06 | 0.20 | 0.10 | 0.07 |
Chol | 0.14 | -0.13 | -0.02 | 0.11 | 1.00 | 0.03 | 0.17 | -0.07 | 0.09 | 0.03 | -0.01 | 0.07 |
Fbs | 0.08 | 0.04 | 0.04 | 0.12 | 0.03 | 1.00 | 0.06 | -0.03 | 0.05 | 0.01 | -0.00 | 0.17 |
RestECG | 0.16 | 0.01 | -0.11 | 0.15 | 0.17 | 0.06 | 1.00 | -0.07 | 0.03 | 0.11 | 0.16 | 0.13 |
MaxHR | -0.41 | -0.05 | 0.29 | -0.09 | -0.07 | -0.03 | -0.07 | 1.00 | -0.46 | -0.41 | -0.40 | -0.26 |
ExAng | 0.10 | 0.10 | -0.33 | 0.06 | 0.09 | 0.05 | 0.03 | -0.46 | 1.00 | 0.29 | 0.27 | 0.22 |
Oldpeak | 0.23 | 0.13 | -0.33 | 0.20 | 0.03 | 0.01 | 0.11 | -0.41 | 0.29 | 1.00 | 0.60 | 0.30 |
Slope | 0.16 | 0.03 | -0.28 | 0.10 | -0.01 | -0.00 | 0.16 | -0.40 | 0.27 | 0.60 | 1.00 | 0.10 |
Ca | 0.34 | 0.13 | -0.20 | 0.07 | 0.07 | 0.17 | 0.13 | -0.26 | 0.22 | 0.30 | 0.10 | 1.00 |
Features Slope and Oldpeak have a moderate positive correlation of 0.6, the largest in the table. I will opt to leave both features in the dataset as their correlation is likely not strong enough to distort the model.
Based on the feature importance plot, I will cut Fbs from the feature space.
//cut variables
dataTest.removeColumns("Fbs");
dataTrain.removeColumns("Fbs");
Now, we can generate a second model using the selected features.
RandomForest RFModel2 = smile.classification.RandomForest.fit(
Formula.lhs("AHD"),
dataTrain.smile().toDataFrame(),
50, //n
(int) Math.sqrt((double) (dataTrain.columnCount() - 1)), //m = sqrt(p)
SplitRule.GINI,
7, //d
100, //maxNodes
1,
1
);
Now, lets determine an appropriate number of trees to grow in the model. A good rule of thumb is to select a number of trees that is the least number required to achieve minimum out-of-bag error when building the model. To determine this, we can graph OOBError vs ntrees.
//tuning ntrees
Table AccuracyvTrees = Table.create("OOB-Error vs nTrees");
AccuracyvTrees.addColumns(DoubleColumn.create("OOBerror"), DoubleColumn.create("ntrees"));
for(int j = 50; j < 2000; j = j+25)
{
RandomForest model = smile.classification.RandomForest.fit(
Formula.lhs("AHD"),
dataTrain.smile().toDataFrame(),
j,
(int) Math.sqrt((double) (dataTrain.columnCount() - 1)), //root p
SplitRule.GINI,
7,
100,
1,
1
);
double err = model.error();
AccuracyvTrees.doubleColumn(0).append(err);
AccuracyvTrees.doubleColumn(1).append(j);
}
Plot.show(LinePlot.create("Accuracy", AccuracyvTrees, "ntrees", "OOBerror"));
The Out-of-bounds error appears to settle down after ~1,000 trees. (your plot may look slightly different due to randomness in splitting the dataset and in the model algorithm). To be conservative, we can select 1,200 trees as the parameter of our final model.
We can now build and assess our final model using our new value for ntrees.
//model with graph-selected number of trees
RandomForest RFModelBest = smile.classification.RandomForest.fit(
Formula.lhs("AHD"),
dataTrain.smile().toDataFrame(),
1200,
(int) Math.sqrt((double) (dataTrain.columnCount() - 1)), //root p
SplitRule.GINI,
7,
100,
1,
1
);
int[] predictionsBest = RFModelBest.predict(dataTest.smile().toDataFrame());
double accuracyBest = Accuracy.of(dataTest.intColumn("AHD").asIntArray(), predictionsBest);
System.out.println(accuracyBest);
With result
> 0.8333333333333334
The classification accuracy of my final model was ~83%. So, using a relatively small dataset, the Random Forest algorithm is able to correctly predict whether or not a patient has heart disease ~83% of the time.
8.3.4. Recap
We used Tablesaw to clean the Heart dataset and prepare it for the Random Forest algorithm. We generated a sensible starting model and plotted its Gini Index importance and Spearman’s correlation matrix to identify features to cut from the model. We then used out-of-bounds error to identify a large enough number of trees to include in the model to achieve maximum accuracy with limited computation time.
8.3.5. Extensions
While the classic method of splitting the dataset 70% test, 30% train works reasonably well, for smaller datasets your model performance metrics can experience some variation due to having a training dataset of limited size. For such datasets, validation procedures such as n-fold cross validation and leave one out cross validation may be more appropriate. Smile includes built-in functions to perform cross validation.
In addition, in this example we only tuned the ntrees hyperparameter; adjustments to mtry, maxDepth, maxNodes, and nodeSize could be considered as well.
8.4. Uber K-means
In previous tutorials, we have covered two popular supervised learning techniques, linear regression and random forest classification. These techniques take a predefined set of features and predict a response.
With K-Means, we introduce the notion of an unsupervised learning task, in this case with the desire to cluster observations into similiar groups without the input-output structure of regression or classification.
Our reference text for this tutorial is Chapter 12 (Unsupervised Learning) of the widely used and freely available textbook, <a href="https://www.statlearning.com/">An Introduction to Statistical Learning, Second Edition</a>. The dataset comes from the <a href="https://data.fivethirtyeight.com/">FiveThirtyEight Repository</a> and additional tutorials covering this dataset can be found <a href="https://towardsdatascience.com/how-does-uber-use-clustering-43b21e3e6b7d">here</a> and <a href="https://www.linkedin.com/pulse/uber-trip-segmentation-using-k-means-clustering-khatre-csm-pmp/">here</a> .
In K-Means clustering, every observation is assigned to a single cluster and a good cluster is one where the within-cluster variation is as low as possible.
8.4.1. K-Means Algorithm Intuition
1. Randomly assign each observation to a cluster.
2. Until the clusters stop changing:
a. Compute each cluster's centroid.
b. Reassign each observation to the cluster with the closest centroid.
Essentially, this becomes a computationally-intensive optimization problem to which K-means provides a locally optimal solution. Because K-means provides a local optimal via a greedy algorithm, its results may be heavily influenced by the random assignment in step 1. (In practice, Smile uses a variation of the algorithm known as <a href="https://en.wikipedia.org/wiki/K-means%2B%2B?msclkid=4118fed8b9c211ecb86802b7ac83b079#Improved_initialization_algorithm">K-means++</a> that strategically selects initial clusters to minimize this initialization bias)
8.4.2. NYC Uber Data
Because the K-means algorithm clusters datapoints, we can use it to determine ideal locations for Uber drivers to idle between customer pickups. The data in uber-pickups-april14.csv represents all Uber pickups in New York City during April 2014. To start, create a bounding box around the data to focus on Manhatten and the immediate vicinity. Download the data [here](https://github.com/jtablesaw/tablesaw/blob/8c03426f73cfa432ac563d5a6cb69bebdfea056a/data/uber-pickups-apr14.csv).
Table pickups = Table.read().csv("uber-pickups-apr14.csv");
pickups = pickups.dropWhere(pickups.doubleColumn("lat").isGreaterThan(40.91));
pickups = pickups.dropWhere(pickups.doubleColumn("lat").isLessThan(40.50));
pickups = pickups.dropWhere(pickups.doubleColumn("lon").isGreaterThan(-73.8));
pickups = pickups.dropWhere(pickups.doubleColumn("lon").isLessThan(-74.05));
Now, randomly select a subset of 100,000 records to work with. (This is purely to speed up plotting of the clusters).
pickups = pickups.sampleN(100000);
Format the existing Date/Time Text Column as two new columns, a LocalDateTimeColumn and a LocalTime Column.
List<String> dateTimes = pickups.textColumn("Date/Time").asList();
DateTimeColumn dateTimesAsLocalDateTime = DateTimeColumn.create("PickupDateTime");
TimeColumn timeAsTimeColumn = TimeColumn.create("PickupTime");
for(String dt: dateTimes)
{
dateTimesAsLocalDateTime.append(LocalDateTime.parse(dt, DateTimeFormatter.ofPattern("M/d/yyyy H:m")));
timeAsTimeColumn.append(LocalDateTime.parse(dt, DateTimeFormatter.ofPattern("M/d/yyyy H:m")).toLocalTime());
}
pickups = pickups.replaceColumn("Date/Time", dateTimesAsLocalDateTime);
pickups.addColumns(timeAsTimeColumn);
Print out a portion of your data to verify that it is in the correct format.
System.out.println(pickups);
> uber-pickups-apr14.csv
PickupDateTime | Lat | Lon | Base | PickupTime |
------------------------------------------------------------------------------
2014-04-01T00:28:00.000 | 40.7588 | -73.9776 | B02512 | 00:28:00 |
2014-04-01T00:33:00.000 | 40.7594 | -73.9722 | B02512 | 00:33:00 |
2014-04-01T01:19:00.000 | 40.7256 | -73.9869 | B02512 | 01:19:00 |
2014-04-01T01:49:00.000 | 40.7271 | -73.9803 | B02512 | 01:49:00 |
Train an initial K-means model with three clusters.
KMeans model = KMeans.fit(pickups.as().doubleMatrix(),3);
Table plot_data = pickups.copy();
plot_data.addColumns(IntColumn.create("cluster",model.y));
Plot.show(ScatterPlot.create("K=3", plot_data, "lon", "lat", "cluster"));
Your plot should look similiar to the following. You can clearly see the three selected clusters represented by the color of each pickup location. (Notice that the data mirrors the shape of Manhatten and the surrounding area)
A practical issue encountered when using the K-means algorithm is the choice of the number of clusters, k. A common approach is to create an "Elbow Curve", which is a plot of the distortion (sum of squared distances from the centroid of a cluster) against chosen values of k. Let’s create an Elbow Curve for each value of k from 2,10).
Table elbowTable = Table.create("Elbow", DoubleColumn.create("Distortion", 10));
elbowTable.addColumns(IntColumn.create("k", 10));
for(int k = 2; k < 10; k++)
{
KMeans model2 = KMeans.fit(pickups.as().doubleMatrix(),k);
elbowTable.doubleColumn("Distortion").set(k, model2.distortion);
elbowTable.intColumn("k").set(k, k);
}
Plot.show(LinePlot.create("Distortion vs K", elbowTable, "k", "distortion"));
Your curve should look something like the image below. We are looking for a hard break in the curve at a value of k where the distortion flattens out. (Hence the name, Elbow Curve)
Based on this curve, I will choose k=4
. Generate a new model with k=4. This time, generate an additional plot showing the centroids of each region.
KMeans modelBest = KMeans.fit(pickups.as().doubleMatrix(),4);
Table plot_data_best = pickups.copy();
plot_data_best.addColumns(IntColumn.create("cluster",modelBest.y));
Plot.show(ScatterPlot.create("K=4", plot_data_best, "lon", "lat", "cluster"));
Table centTable = Table.create("Centroids",DoubleColumn.create("lat", modelBest.centroids.length), DoubleColumn.create("lon", modelBest.centroids.length));
for(int i = 0; i < modelBest.centroids.length; i++)
{
centTable.doubleColumn("lat").set(i, modelBest.centroids[i][0]);
centTable.doubleColumn("lon").set(i, modelBest.centroids[i][1]);
}
Plot.show(ScatterPlot.create("centroids", centTable, "lon", "lat"));
We now have a reasonable assessment of where idled Uber drivers should congregated as they wait for their next pickup: the centroids of our 4 regions.
So far in our analysis, we have sought to develop a general recommendation for where a driver should idle irrespective of the day of the week or time of the day. Now, let’s look at how day of week and time of day impact customer pickup requests. For brevity, I will use k=5 for all of these scenarios. In practice, one would ideally generate an Elbow curve for each subset of the data.
Late Night (11 pm-5 am)
//Late Night (11 pm-5 am)
Table lateNight = pickups.where(pickups.timeColumn("PickupTime").isAfter(LocalTime.of(23,0)).or(pickups.timeColumn("PickupTime").isBefore(LocalTime.of(5,0))));
KMeans modelLateNight = KMeans.fit(lateNight.as().doubleMatrix(),5);
Table plot_data_lateNight = lateNight.copy();
plot_data_lateNight.addColumns(IntColumn.create("cluster",modelLateNight.y));
Plot.show(ScatterPlot.create("Late Night, K=5", plot_data_lateNight, "lon", "lat", "cluster"));
Weekday Mornings and Evenings
Table weekdays = pickups.where(pickups.dateTimeColumn("PickupDateTime")
.isMonday()
.or(pickups.dateTimeColumn("PickupDateTime").isTuesday())
.or(pickups.dateTimeColumn("PickupDateTime").isWednesday())
.or(pickups.dateTimeColumn("PickupDateTime").isThursday()));
//Weekday Morning (M-Th, 6 am-10 am)
Table weekdayMorning = weekdays.where(weekdays.timeColumn("PickupTime").isAfter(LocalTime.of(6, 0))
.and(weekdays.timeColumn("PickupTime").isBefore(LocalTime.of(10,0))));
KMeans modelWeekdayMorning = KMeans.fit(weekdayMorning.as().doubleMatrix(),5);
Table plot_data_WeekdayMorning = weekdayMorning.copy();
plot_data_WeekdayMorning.addColumns(IntColumn.create("cluster",modelWeekdayMorning.y));
Plot.show(ScatterPlot.create("Weekday Morning, K=5", plot_data_WeekdayMorning, "lon", "lat", "cluster"));
//Weekday Evening (M-Th, 5 pm-10 pm)
Table weekdayEvening = weekdays.where(weekdays.timeColumn("PickupTime").isAfter(LocalTime.of(17, 0))
.and(weekdays.timeColumn("PickupTime").isBefore(LocalTime.of(22,0))));
KMeans modelWeekdayEvening = KMeans.fit(weekdayEvening.as().doubleMatrix(),5);
Table plot_data_WeekdayEvening = weekdayEvening.copy();
plot_data_WeekdayEvening.addColumns(IntColumn.create("cluster",modelWeekdayEvening.y));
Plot.show(ScatterPlot.create("Weekday Evening, K=5", plot_data_WeekdayEvening, "lon", "lat", "cluster"));
Weekends
//Weekend
Table weekend = pickups.where(pickups.dateTimeColumn("PickupDateTime")
.isSaturday()
.or(pickups.dateTimeColumn("PickupDateTime").isSunday()));
KMeans modelWeekend = KMeans.fit(weekend.as().doubleMatrix(),5);
Table plot_data_Weekend = weekend.copy();
plot_data_Weekend.addColumns(IntColumn.create("cluster",modelWeekend.y));
Plot.show(ScatterPlot.create("Weekend, K=5", plot_data_Weekend, "lon", "lat", "cluster"));
8.4.3. Conclusions
These clusters reveal some interesting trends. Weekday mornings and evenings, for example, the algorithm suggests that most drivers should be awaiting passengers in Lower Manhatten. This is consistent with a weekday rushhour in the financial district of the city. Late night, however, is a very different story. The number of clusters in Manhatten drops from 3 to 2, and the largest geographic extent of demand in Manhatten is clustered around upper Manhatten and tourist/leisure areas such as Times Square and 5th Avenue. There are certainly a variety of potential explanations beyond what has been offered here for this variation, but regardless of the underlying reason for these variations the K-means algorithm has effectively addressed our question of interest—where do Uber drivers need to be at certain time/day to efficiently serve demand for Ubers in NYC.
8.4.4. Extensions
Possible areas for further exploration:
-
Alternative time windows/k-values to search for other time-based variations.
-
Apply Smile’s extensions of K-means, (<a href="https://haifengl.github.io/clustering.html?msclkid=fad0d344ba5f11ecb024703bc12a87be">X-means</a> and <a href="https://haifengl.github.io/clustering.html?msclkid=fad0d344ba5f11ecb024703bc12a87be">G-means</a>) that attempt to address the issue of determining k.
-
K-means can be thought of as a version of the classical facility location problem in the field of Operations Research where the candidate facility locations are undefined. One shortcoming of the K-means approach, however, is that it does not consider geographical constraints that make driving from one place to another more difficult than Euclidean distance would suggest. Use K-means to identify potential facility locations and then model this problem as the <a href="https://haifengl.github.io/clustering.html?msclkid=fad0d344ba5f11ecb024703bc12a87be">uncapacitated facility location</a> problem with additional constraints preventing pickups from one side of a waterway being served by a centroid on the other. Open source packages such as <a href="https://developers.google.com/optimization/?msclkid=4c2e5150ba6211ecbbb3ff46615f0eb3">Google OR-Tools</a> can be used to formulate and solve such problems in a Java environment.
9. Contribute
Join us in making Java a great platform for data science |
Tablesaw makes it easy to use Java for data science, but we can use your help. Here are some ways you can contribute today:
-
Ask or answer questions
-
Improve the documentation
-
Report or fix a bug
-
Implement tests for existing features
-
Implement new features and enhancements
-
Become a project maintainer
9.1. About questions
Please use the project [Issues section on Github](https://github.com/jtablesaw/tablesaw/issues) to ask questions, tagging the issue as a "Question".
9.2. Improve the documentation
All our docs are in the docs folder of the tablesaw repo, and they are all written in Markdown. To provide enhancements, follow the directions on how to implement featues and bugs below as the workflow is basically the same.
9.3. Tests
We could always use better test coverage. If you see a gap you’d like to fill, follow the basic Github process outlined below.
9.4. Github
If you want to help out with one of our bugs or implement a new feature, improve documentation or add tests, here are some suggestions for success. The basic process looks like this:
-
Fork Tablesaw
-
Clone your fork to your development machine
-
Create your issue branch
-
Implement your feature or fix the bug and push the results to your fork
-
Send us a pull request.
You can find more info about [creating a fork](https://help.github.com/articles/fork-a-repo/) and [pull request](https://help.github.com/articles/creating-a-pull-request-from-a-fork/) in the GitHub help articles. Our maintainers will review your pull request as soon as we can. You can find our project on Github at:
9.4.1. What makes a good Pull Request?
To summarize:
-
Address only one issue with a single pull request.
-
For new functionality, have a conversation with one or more maintainers before putting in too much work. Opening an issue is a good way to start the conversation
-
Requests with tests are the best
-
Finally, We love small pull requests. That we we can find time to review them between work, and family, and watching Game of Thrones.
9.4.2. "Help Wanted" labels
Among the open bugs and feature requests, we’ve tagged some as "Help Wanted". This indicates that a fix should be relatively straightforward to address. We would, of course, LOVE help on some of the more challenging issues. Help Wanted simply means that this is a good place for a developer with limited knowlege of the library and/or limited experience with Java to jump in.
9.4.3. Javascript visualization libraries
Implement an interface for a Javascript plotting library (big job, we know), but the best Javascript libraries are awesome. Email larry@tablesaw.tech if you have any questions, or use the comments for the issue.
9.4.4. Become a Maintainer/Committer
We are actively seeking new maintainers for the project. The best way to get involved is to familiarize yourself with the project through use, and make a few minor changes/fixes. After that, introduce yourself by emailing info@tablesaw.tech and we’ll get in touch.
11. License
Tablesaw is under the Apache 2.0 license.