Tabular Datasets

Tabular datasets in this context are directories of files (any variant of .csv or .json is accepted).

Initializing datasets

Let’s start by creating our own toy dataset (with one duplicate, i.e. files 1 and 2)

import pandas as pd
import os
outdir = "toy_csv"
os.mkdir(outdir)

csv1 = """
"Index", "Living Space (sq ft)", "Beds", "Baths"
 1, 2222, 3, 3.5
 2, 1628, 3, 2
 3, 3824, 5, 4
 4, 1137, 3, 2
 5, 3560, 6, 4
 6, 2893, 4, 3
 7, 3631, 4, 3
 8, 2483, 4, 3
 9, 2400, 4, 4
10, 1997, 3, 3
"""
csv2 = """
"Index", "Living Space (sq ft)", "Beds", "Baths"
 1, 2222, 3, 3.5
 2, 1628, 3, 2
 3, 3824, 5, 4
 4, 1137, 3, 2
 5, 3560, 6, 4
 6, 2893, 4, 3
 7, 3631, 4, 3
 8, 2483, 4, 3
 9, 2400, 4, 4
10, 1997, 3, 3
"""
csv3 = """
"Index", "Living Space (sq ft)", "Beds", "Baths"
11, 2222, 3, 3.5
12, 1628, 3, 2
13, 3824, 5, 4
14, 1137, 3, 2
15, 3560, 6, 4
16, 2893, 4, 3
17, 3631, 4, 3
18, 2483, 4, 3
19, 2400, 4, 4
20, 1997, 3, 3
"""
with open(outdir + "/1.csv", "w") as fd:
    fd.write(csv1)
with open(outdir + "/2.csv", "w") as fd:
    fd.write(csv2)
with open(outdir + "/3.csv", "w") as fd:
    fd.write(csv3)
from kaishi.tabular.dataset import TabularDataset
td = TabularDataset(outdir)
td.file_report()
Current file list:
+-------+-----------+------------------------+--------+
| Index | File Name | Children               | Labels |
+-------+-----------+------------------------+--------+
|   0   |   1.csv   |     {'duplicates': []} |   []   |
|   1   |   3.csv   |     {'duplicates': []} |   []   |
|   2   |   2.csv   |     {'duplicates': []} |   []   |
+-------+-----------+------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
+-----------+---------------+

Interaction with datasets

There are several ways to interact with tabular datasets. Let’s start by looking at a detailed report.

td.report()
Dataframe 0
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/1.csv
====================================
NO DATA OR NOT LOADED (try running 'dataset.load_all()')


Dataframe 1
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/3.csv
====================================
NO DATA OR NOT LOADED (try running 'dataset.load_all()')


Dataframe 2
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/2.csv
====================================
NO DATA OR NOT LOADED (try running 'dataset.load_all()')

Our data weren’t loaded, let’s fix that and try again

td.load_all()
td.report()
Dataframe 0
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/1.csv
====================================
4 columns: ['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"']

---  Column 'Index'
count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
Name: Index, dtype: float64

---  Column ' "Living Space (sq ft)"'
count      10.00000
mean     2577.50000
std       894.97725
min      1137.00000
25%      2053.25000
50%      2441.50000
75%      3393.25000
max      3824.00000
Name:  "Living Space (sq ft)", dtype: float64

---  Column ' "Beds"'
count    10.000000
mean      3.900000
std       0.994429
min       3.000000
25%       3.000000
50%       4.000000
75%       4.000000
max       6.000000
Name:  "Beds", dtype: float64

---  Column ' "Baths"'
count    10.000000
mean      3.150000
std       0.747217
min       2.000000
25%       3.000000
50%       3.000000
75%       3.875000
max       4.000000
Name:  "Baths", dtype: float64

***** Fraction of missing data in each column *****
Index: 0.0
 "Living Space (sq ft)": 0.0
 "Beds": 0.0
 "Baths": 0.0


Dataframe 1
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/3.csv
====================================
4 columns: ['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"']

---  Column 'Index'
count    10.00000
mean     15.50000
std       3.02765
min      11.00000
25%      13.25000
50%      15.50000
75%      17.75000
max      20.00000
Name: Index, dtype: float64

---  Column ' "Living Space (sq ft)"'
count      10.00000
mean     2577.50000
std       894.97725
min      1137.00000
25%      2053.25000
50%      2441.50000
75%      3393.25000
max      3824.00000
Name:  "Living Space (sq ft)", dtype: float64

---  Column ' "Beds"'
count    10.000000
mean      3.900000
std       0.994429
min       3.000000
25%       3.000000
50%       4.000000
75%       4.000000
max       6.000000
Name:  "Beds", dtype: float64

---  Column ' "Baths"'
count    10.000000
mean      3.150000
std       0.747217
min       2.000000
25%       3.000000
50%       3.000000
75%       3.875000
max       4.000000
Name:  "Baths", dtype: float64

***** Fraction of missing data in each column *****
Index: 0.0
 "Living Space (sq ft)": 0.0
 "Beds": 0.0
 "Baths": 0.0


Dataframe 2
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/2.csv
====================================
4 columns: ['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"']

---  Column 'Index'
count    10.00000
mean      5.50000
std       3.02765
min       1.00000
25%       3.25000
50%       5.50000
75%       7.75000
max      10.00000
Name: Index, dtype: float64

---  Column ' "Living Space (sq ft)"'
count      10.00000
mean     2577.50000
std       894.97725
min      1137.00000
25%      2053.25000
50%      2441.50000
75%      3393.25000
max      3824.00000
Name:  "Living Space (sq ft)", dtype: float64

---  Column ' "Beds"'
count    10.000000
mean      3.900000
std       0.994429
min       3.000000
25%       3.000000
50%       4.000000
75%       4.000000
max       6.000000
Name:  "Beds", dtype: float64

---  Column ' "Baths"'
count    10.000000
mean      3.150000
std       0.747217
min       2.000000
25%       3.000000
50%       3.000000
75%       3.875000
max       4.000000
Name:  "Baths", dtype: float64

***** Fraction of missing data in each column *****
Index: 0.0
 "Living Space (sq ft)": 0.0
 "Beds": 0.0
 "Baths": 0.0

To look at a specific file object, you can access via either index or key

td.files[0].df
Index "Living Space (sq ft)" "Beds" "Baths"
0 1 2222 3 3.5
1 2 1628 3 2.0
2 3 3824 5 4.0
3 4 1137 3 2.0
4 5 3560 6 4.0
5 6 2893 4 3.0
6 7 3631 4 3.0
7 8 2483 4 3.0
8 9 2400 4 4.0
9 10 1997 3 3.0
td["1.csv"].df
Index "Living Space (sq ft)" "Beds" "Baths"
0 1 2222 3 3.5
1 2 1628 3 2.0
2 3 3824 5 4.0
3 4 1137 3 2.0
4 5 3560 6 4.0
5 6 2893 4 3.0
6 7 3631 4 3.0
7 8 2483 4 3.0
8 9 2400 4 4.0
9 10 1997 3 3.0

Tabular data processing pipelines

Let’s see the pipeline options

td.get_pipeline_options()
['FilterByLabel',
 'FilterByRegex',
 'FilterDuplicateFiles',
 'FilterDuplicateRowsAfterConcatenation',
 'FilterDuplicateRowsEachDataframe',
 'FilterInvalidFileExtensions',
 'FilterSubsample',
 'LabelerValidationAndTest',
 'AggregatorConcatenateDataframes']

Now let’s configure our own pipeline and run it

td.configure_pipeline(["FilterDuplicateFiles", "AggregatorConcatenateDataframes"])
print(td.pipeline)
td.run_pipeline()
Kaishi pipeline: 
0: FilterDuplicateFiles
1: AggregatorConcatenateDataframes

As expected, the duplicate file was filtered

td.file_report()
Current file list:
+-------+-----------+-----------------------------+--------+
| Index | File Name | Children                    | Labels |
+-------+-----------+-----------------------------+--------+
|   0   |   1.csv   |     {'duplicates': [2.csv]} |   []   |
|   1   |   3.csv   |     {'duplicates': []}      |   []   |
+-------+-----------+-----------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
|   2.csv   |   duplicates  |
+-----------+---------------+

But what about the concatenated dataframe? When Kaishi pipeline components create artifacts, they are added to the artifacts member of a dataset.

print(td.artifacts.keys())
dict_keys(['df_concatenated'])
td.artifacts["df_concatenated"]
Index "Living Space (sq ft)" "Beds" "Baths"
0 1 2222 3 3.5
1 2 1628 3 2.0
2 3 3824 5 4.0
3 4 1137 3 2.0
4 5 3560 6 4.0
5 6 2893 4 3.0
6 7 3631 4 3.0
7 8 2483 4 3.0
8 9 2400 4 4.0
9 10 1997 3 3.0
10 11 2222 3 3.5
11 12 1628 3 2.0
12 13 3824 5 4.0
13 14 1137 3 2.0
14 15 3560 6 4.0
15 16 2893 4 3.0
16 17 3631 4 3.0
17 18 2483 4 3.0
18 19 2400 4 4.0
19 20 1997 3 3.0

This ultimately sets the framework for being able to manipulate your own tabular data sets and add custom functionality, without the hassle of dealing with the boring and monotonous ETL steps.