Kaishi (開始)¶
Kaishi is a toolkit from KUNGFU.AI used to accelerate the initial phases of exploratory data analysis, as well as to enable rapid dataset preparation for downstream tasks.
More simply, Kaishi helps you automate steps to get from a raw dataset (in the form of a directory of files) to something that’s usable for machine learning (or any other task you may need a clean dataset for).
Examples of common operations include:
Filtering duplicate files
Standardizing image sizes
Detecting similar data that may not add value to machine learning tasks
Deduplication of table rows
Many more…
API Reference¶
This page contains auto-generated API reference documentation 1.
kaishi
¶
Subpackages¶
kaishi.core
¶
Core module for Kaishi datasets.
This module contains classes, functions, etc. that are useful elsewhere in the project, as well as dataset definitions that are agnostic to the data type. In particular, the kaishi.core.dataset.FileDataset
class can be used to operate on any directory of files, regardless of the dataset. Specific data type modules (e.g. kaishi.image
) also inherit functionality from the core module.
Subpackages¶
kaishi.core.filters
¶Pipeline components that filter data points based on user-defined criteria.
kaishi.core.filters.by_label
¶Class definition for filter by label.
-
class
kaishi.core.filters.by_label.
FilterByLabel
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter each element of a dataset by a specified label.
-
__call__
(self, dataset)¶
-
configure
(self, label_to_filter=None)¶ Specify the label to filter.
- Parameters
label_to_filter (str) – data elements with this label will be filtered
-
kaishi.core.filters.by_regex
¶Class definition for filter by regex.
-
class
kaishi.core.filters.by_regex.
FilterByRegex
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter data elements with a filename matching a specified regex.
-
__call__
(self, dataset)¶
-
configure
(self, pattern='/(?=a)b/')¶ Configure the regex pattern to match (default does not filter).
- Parameters
pattern (str) – pattern to filter by
-
kaishi.core.filters.duplicate_files
¶Class definition for filtering duplicate files.
-
class
kaishi.core.filters.duplicate_files.
FilterDuplicateFiles
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter duplicate files, detected via hashing.
-
__call__
(self, dataset)¶
-
kaishi.core.filters.subsample
¶Class definition for subsampling filter.
-
class
kaishi.core.filters.subsample.
FilterSubsample
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter by subsampling.
-
__call__
(self, dataset)¶
-
configure
(self, N=None, seed=None)¶ Configuration options for subsample filter.
- Parameters
N (int) – number of data points to keep
seed (int) – random seed for reproducibility
-
kaishi.core.labelers
¶Pipeline components that label data points based on user-defined criteria.
kaishi.core.labelers.validation_and_test
¶Class definition for validation and test labeler.
-
class
kaishi.core.labelers.validation_and_test.
LabelerValidationAndTest
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Assign validation and/or test data labels.
-
__call__
(self, dataset)¶
-
configure
(self, val_frac: float = 0.2, test_frac: float = 0.0, seed=None)¶ Configure the labeler (note: any nonlabeled data point is automatically assigned a “TRAIN” label).
- Parameters
val_frac (float) – fraction of data points to assign “VALIDATE” label
test_frac (float) – fraction of data points to assign “TEST” label
seed (int) – random seed for reproducibility
-
Submodules¶
kaishi.core.dataset
¶Primary interface to the image tool kit.
kaishi.core.file
¶Class definition for reading/writing files of various types.
-
class
kaishi.core.file.
File
(basedir: str, relpath: str, filename: str)¶ Class with common methods and members to work with files.
-
__repr__
(self)¶
-
__str__
(self)¶
-
compute_hash
(self)¶ Compute the hash of the file.
- Returns
hash value
-
has_label
(self, label_to_check: str)¶ Check if file has a specific label.
- Parameters
label_to_check (str) – label to look for
- Returns
flag indicating if label is present in the file
- Return type
bool
-
add_label
(self, label_to_add: str)¶ Add a label to a file object.
- Parameters
label_to_add (str) – label to append to the file’s labels
-
remove_label
(self, label_to_remove: str)¶ Remove a label from a file object. If the label is not found, this method does nothing.
- Parameters
label_to_remove (str) – label to remove from the file
-
kaishi.core.file_group
¶Class definition for reading/writing files of various types.
-
class
kaishi.core.file_group.
FileGroup
(recursive: bool)¶ Class for reading and performing general operations on groups of files.
-
__getitem__
(self, key)¶ Get a specific file object.
-
load_dir
(self, source: str, file_initializer, recursive: bool)¶ Read file names in a directory
- Parameters
source (str) – Directory to load from
file_initializer (kaishi file initializer class (e.g.
kaishi.core.file.File
)) – Data file calss to initialize each file with
-
get_pipeline_options
(self)¶ Returns available pipeline options for this dataset.
- Returns
list of uninitialized pipeline component objects
- Return type
list
-
configure_pipeline
(self, choices: list = None, verbose: bool = False)¶ Configures the sequence of components in the data processing pipeline.
- Parameters
choices (list) – list of pipeline choices
verbose (bool) – flag to indicate verbosity
-
file_report
(self, max_file_entries=16, max_filter_entries=10)¶ Show a report of valid and invalid data.
- Parameters
max_file_entries (int) – max number of entries to print of file list
max_filter_entries (int) – max number of entries to print per filter category (e.g. duplicates, similar, etc.)
-
run_pipeline
(self, verbose: bool = False)¶ Run the pipeline as configured.
- Parameters
verbose (bool) – flag to indicate verbosity
-
kaishi.core.labels
¶Enumeration definition for labels.
kaishi.core.misc
¶Miscellaneous helper functions.
-
kaishi.core.misc.
load_files_by_walk
(dir_name_raw: str, file_initializer, recursive: bool = False)¶ Load files from a directory with an option to recurse.
- Parameters
dir_name_raw (str) – Directory to load file structure from
file_initializer (kaishi file initializer class (e.g.
kaishi.core.file.File
)) – Data file class to initialize each file withrecursive (bool) – Option to load recursively, defaults to False
- Returns
canonical directory name, list of subdirectories, and list of initialized files
- Return type
str, list, and list
-
kaishi.core.misc.
trim_list_by_inds
(list_to_trim: list, indices: list)¶ Trim a list given an unordered list of indices.
- Parameters
list_to_trim (list) – list to remove elements from
indices (list) – indices of list items to remove
- Returns
new list, trimmed items
- Return type
list, list
-
kaishi.core.misc.
find_duplicate_inds
(list_with_duplicates: list)¶ Find indices of duplicates in a list.
- Parameters
list_with_duplicates (list) – list containing duplicate items
- Returns
list of duplicate indices, list of unique items (parents of duplicates)
- Return type
list and list
-
kaishi.core.misc.
find_similar_by_value
(list_of_values: list, difference_threshold)¶ Find near duplicates based on similar reference value.
- Parameters
list_of_values (list) – list of values to compare
difference_threshold – differences above this threshold will be identified for removal
- Returns
list of similar indices, list of unique items (parents of similar items)
- Return type
list and list
-
kaishi.core.misc.
md5sum
(filename: str)¶ Compute the md5sum of a file.
- Parameters
filename (str) – name of file to compute hash of
- Returns
hash value
-
class
kaishi.core.misc.
CollapseChildren
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Restructure potentially multi-layer file tree into a single parent/child layer.
-
__call__
(self, dataset)¶
-
-
kaishi.core.misc.
is_valid_label
(label_str: str, label_enum)¶ Check if a label is contained in an enum.
- Parameters
label_str (str) – string defining the label
- Returns
flag indicating if label is valid
- Return type
bool
kaishi.core.pipeline
¶Class definition for a pipeline object.
-
class
kaishi.core.pipeline.
Pipeline
¶ Base class for a generic pipeline object.
-
__call__
(self, dataset, verbose: bool = False)¶ Run the full pipeline as configured.
- Parameters
dataset (initiailized kaishi dataset class (e.g. :class kaishi.image.dataset.Dataset)) – dataset to perform pipeline operations on
verbose (bool) – flag to indicate verbosity
-
__repr__
(self)¶ Print pipeline overview.
-
__str__
(self)¶
-
_get_configs_for_component
(self, initialized_component)¶ Get args and values of them from an initialized component.
- Parameters
initialized_component (initialized pipeline component (has to inherit from
kaishi.core.pipeline_component.PipelineComponent
)) – pipeline to get configurable arguments for- Returns
dictionary with argument name keys and their values as contents
- Return type
dict
-
add_component
(self, component)¶ Add a method to be called as a pipeline step.
- Parameters
component – component to add to the pipeline
-
remove_component
(self, index)¶ Remove a pipeline method by index.
- Parameters
index (int) – index to remove
-
reset
(self)¶ Reset the pipeline by removing all components.
-
kaishi.core.pipeline_component
¶Class definition for pipeline components.
-
class
kaishi.core.pipeline_component.
PipelineComponent
¶ Base class for pipeline components.
-
__str__
(self)¶
-
__repr__
(self)¶
-
configure
(self)¶ Method to configure via named arguments. Defaults to no configurations, unless inherited and overridden.
-
applies_to
(self, target_criteria)¶ Limit data files that the component applies to via regex.
- Parameters
target_criteria – list or string containing a regex to denote files that this component applies to
-
get_target_indexes
(self, dataset)¶ Get target indexes of a dataset based on criteria set using the applies_to() method
- Parameters
dataset (initialized kaishi dataset object (e.g.
kaishi.core.dataset.FileDatset
)) – dataset to inspect- Returns
list of indexes
- Return type
list of int
-
_is_valid_target_int
(self, target)¶ Check if an index is a valid integer type.
- Parameters
target (int (or similar, e.g. np.int32)) – target to verify
- Returns
flag indicating if the target is a valid integer type
- Return type
bool
-
_is_valid_target_str
(self, target)¶ Check if an index is a valid string type.
- Parameters
target (int (or similar)) – target to verify
- Returns
flag indicating if the target is a valid string type
- Return type
bool
-
kaishi.core.printing
¶Definitions for print helper utilities.
-
kaishi.core.printing.
should_print_row
(i: int, max_entries: int, num_entries: int)¶ Make decision to print row or not based on max_rows.
- Parameters
i (int) – index of row
max_entries (int) – max number of entries for the table
num_entries (int) – number of possible entries (the full list)
- Returns
0 if should not print, 1 if should print, 2 if should print ellipsis (“…”)
- Return type
int
kaishi.image
¶
Kaishi framework for image datasets.
Subpackages¶
kaishi.image.filters
¶Pipeline components that filter data points based on user-defined criteria, specifically for image datasets.
kaishi.image.filters.invalid_file_extensions
¶Class definition for filtering by invalid image file extensions.
-
kaishi.image.filters.invalid_file_extensions.
VALID_EXT
= ['.bmp', '.dib', '.jpeg', '.jpg', '.jpe', '.jp2', '.png', '.pbm', '.pgm', '.ppm', '.sr', '.ras', '.tiff', '.tif']¶
-
class
kaishi.image.filters.invalid_file_extensions.
FilterInvalidFileExtensions
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter files without a valid image file extension, where valid extensions are defined in the configure method.
-
__call__
(self, dataset)¶ Perform the filter operation on dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDataset
) – dataset to perform filter operation on
-
configure
(self, valid_extensions=VALID_EXT)¶ Configure filter with the valid extensions defined.
- Parameters
valid_extensions (list[str]) – list of valid extensions (each should start with “.”)
-
kaishi.image.filters.invalid_image_headers
¶Class definition for filtering files with invalid image headers.
-
class
kaishi.image.filters.invalid_image_headers.
FilterInvalidImageHeaders
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter image files that have invalid or nonexistent headers.
-
__call__
(self, dataset)¶ Perform filter operation on a kaishi image dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDataset
) – image dataset to perform filter operation on
-
kaishi.image.filters.similar
¶Class definition for filtering similar images in a dataset.
-
class
kaishi.image.filters.similar.
FilterSimilar
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter near duplicate files, detected via perceptual hashing (using the imagehash library).
-
__call__
(self, dataset)¶ Perform filter operation on a specified dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDataset
) – dataset to perform operation on
-
configure
(self, perceptual_hash_threshold=3)¶ Configure the filter with a perceptual hash threshold.
- Parameters
perceptual_hash_threshold (int or float) – threshold for determining whether or not images are similar (> are deemed not similar)
-
kaishi.image.labelers
¶Pipeline components that label data points based on user-defined criteria, specifically for image datasets.
kaishi.image.labelers.generic_convnet
¶Class definition for generic convnet labeler.
-
class
kaishi.image.labelers.generic_convnet.
LabelerGenericConvnet
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Use pre-trained ConvNet to predict image labels (e.g. stretched, rotated, etc.).
This labeler uses a default configured
kaishi.image.model.Model
object where the output layer is assumed to have 6 values ranging 0 to 1, where the labels are [DOCUMENT, RECTIFIED, ROTATED_RIGHT, ROTATED_LEFT, UPSIDE_DOWN, STRETCHED].-
__call__
(self, dataset)¶ Perform the labeling operation on an image dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDataset
) – kaishi image dataset
-
kaishi.image.transforms
¶Pipeline components that transform data points based on user-defined criteria, specifically for image datasets.
kaishi.image.transforms.fix_rotation
¶Class definition for fixing image rotation.
-
class
kaishi.image.transforms.fix_rotation.
TransformFixRotation
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Fix rotations of each image in a dataset given pre-determined labels (uses the default convnet for labels).
-
__call__
(self, dataset)¶ Perform the transformation operation on an image dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDataset
) – image dataset to perform operation on
-
kaishi.image.transforms.limit_dimensions
¶Class definition for limiting the max dimension of each image in an image dataset.
-
class
kaishi.image.transforms.limit_dimensions.
TransformLimitDimensions
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Transform to limit max dimension of each image in a dataset.
-
__call__
(self, dataset)¶ Perform operation on a specified dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDatset
) – image dataset to perform operation on
-
configure
(self, max_dimension=None, max_width=None, max_height=None)¶ Configure the component. Any combination of these parameters can be defined or not, where smallest specified max value in each case is the assumed value (e.g. if max_width is 300 but max_dimension is 200, the maximum width is effectively 200).
- Parameters
max_dimension (int) – maximum dimension for each image (either width or height)
max_width (int) – maximum width for each image
max_height (int) – maximum height for each image
-
kaishi.image.transforms.to_grayscale
¶Class definition for transforming images to grayscale.
-
class
kaishi.image.transforms.to_grayscale.
TransformToGrayscale
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Transform images in a dataset to grayscale.
-
__call__
(self, dataset)¶ Perform operation on a given dataset.
- Parameters
dataset (
kaishi.image.dataset.ImageDataset
) – image dataset with images to convert
-
Submodules¶
kaishi.image.dataset
¶Definition for kaishi image datasets.
kaishi.image.file
¶Definition for image files.
-
kaishi.image.file.
THUMBNAIL_SIZE
= [64, 64]¶
-
kaishi.image.file.
MAX_DIM_FOR_SMALL
= 224¶
-
kaishi.image.file.
PATCH_SIZE
= [64, 64]¶
-
kaishi.image.file.
RESAMPLE_METHOD
¶
-
class
kaishi.image.file.
ImageFile
(basedir: str, relpath: str, filename: str)¶ Bases:
kaishi.core.file.File
Image file object that inherits from the core file class and adds image-specific attributes and methods.
-
verify_loaded
(self)¶ Verify image and derivatives are loaded (only performs the load if the image is unloaded).
-
update_derived_images
(self)¶ Update images derived from the base image (i.e. thumbnail, small version, and random patch).
-
rotate
(self, ccw_rotation_degrees: int)¶ Rotate all instances of image by ‘ccw_rotation_degrees’.
- Parameters
ccw_rotation_degrees (int) – degrees to rotate the image by
-
limit_dimensions
(self, max_width: int = None, max_height: int = None, max_dimension: int = None)¶ Limit the max dimension of the image and resize accordingly. Any combination of these arguments can be defined, however, if none are defined, this method does nothing.
- Parameters
max_width (int) – maximum width of the image
max_height (int) – maximum height of the image
max_dimension (int) – maximum width or height (applies to both)
-
convert_to_grayscale
(self)¶ Convert image to grayscale.
-
compute_perceptual_hash
(self, hashfunc=imagehash.average_hash)¶ Calculate perceptual hash (close in value to similar images.
- Parameters
hashfunc (function) – function object to be used to calculate the hash value (defualts to imagehash.average_hash)
- Returns
hash value (as computed by hashfunc)
-
kaishi.image.file_group
¶Definition for groups of image files.
-
kaishi.image.file_group.
THUMBNAIL_SIZE
= [64, 64]¶
-
kaishi.image.file_group.
MAX_DIM_FOR_SMALL
= 224¶
-
kaishi.image.file_group.
PATCH_SIZE
= [64, 64]¶
-
kaishi.image.file_group.
RESAMPLE_METHOD
¶
-
class
kaishi.image.file_group.
ImageFileGroup
(source: str, recursive: bool)¶ Bases:
kaishi.core.file_group.FileGroup
Group of image files that inherits from the core file group class.
-
load_all
(self)¶ Load all files in the directory that this class was initialized with.
-
build_numpy_batches
(self, channels_first: bool = True, batch_size: int = None, image_type: str = 'small_image')¶ Build a tensor from the entire image corpus (or generate batches if specified).
If a batch size is specified, this acts as a generator of batches and returns a list of file objects to manipulate. Otherwise, a single batch of all images is returned in an array format.
- Parameters
channels_first (bool) – flag indicating channels first (e.g. PyTorch) vs. channels last (e.g. Keras)
batch_size (int) – size of each batch (default is None, which will return a single batch)
image_type (str) – choice of “small_image”, “thumbnail”, or “patch”, indicating which version of each image to use
- Returns
batch of images (generator if batch size specified)
- Return type
numpy.array
-
save
(self, out_dir: str)¶ Save image data set in the current structure while preserving any changes.
- Parameters
out_dir (str) – output directory
-
run_pipeline
(self, verbose: bool = False)¶ Run the pipeline as configured.
- Parameters
verbose (bool) – flag indicating verbosity
-
report
(self)¶ Run a descriptive report (currently just prints the file report).
-
kaishi.image.generator
¶Data generator for image datasets.
-
kaishi.image.generator.
augment_and_label
(imobj)¶ Augment an image with common issues and return the modified image + label vector.
Labels at output layer (probabilities, no softmax): [DOCUMENT, RECTIFIED, ROTATED_RIGHT, ROTATED_LEFT, UPSIDE_DOWN, STRETCHING]
- Parameters
imobj (
kaishi.image.file.ImageFile
) – image object to randomly augment and label- Returns
augmented image and label vector applied
- Return type
kaishi.image.file.ImageFile
and numpy.array
-
kaishi.image.generator.
train_generator
(self, batch_size: int = 32, string_to_match: str = None)¶ Generator for training the data labeler. Operates on a
kaishi.image.dataset.ImageDataset
object.- Parameters
self (
kaishi.image.dataset.ImageDatset
) – image datasetbatch_size (int) – batch size for generated data
string_to_match (str) – string to match (ignores files without this string in the relative path)
- Returns
batch arrays and label vectors
- Return type
numpy.array
and list
-
kaishi.image.generator.
generate_validation_data
(self, n_examples: int = 400, string_to_match: str = None)¶ Generate a reproducibly random validation data set.
- Parameters
n_examples (int) – number of examples in the validation set
string_to_match (str) – string to match (ignores files without this string in the relative path)
- Returns
stacked training examples (first dimension is batch) and stacked labels
- Return type
numpy.array and numpy.array
kaishi.image.model
¶Definition for PyTorch model abstraction.
-
class
kaishi.image.model.
Model
(n_classes: int = 6, model_arch: str = 'resnet18')¶ Abstraction for working with PyTorch models.
-
vgg16_bn
(self, n_classes: int)¶ Basic VGG16 model with variable number of output classes.
- Parameters
n_classes (int) – number of classes at output layer
- Returns
PyTorch VGG16 model object with batch normalization
- Return type
torchvision.models.vgg16_bn
-
resnet18
(self, n_classes: int)¶ Basic ResNet18 model with specified number of output classes.
- Parameters
n_classes (int) – number of classes at the output layer
- Returns
PyTorch ResNet18 model object
- Return type
torchvision.models.resnet18
-
resnet50
(self, n_classes: int)¶ Basic ResNet50 model with specified number of output classes.
- Parameters
n_classes (int) – number of classes at the output layer
- Returns
PyTorch ResNet50 model object
- Return type
torchvision.models.resnet50
-
predict
(self, numpy_array)¶ Make predictions from a numpy array, where dimensions are (batch, channel, x, y).
- Parameters
numpy_array (numpy.array) – input array to predict
- Returns
predictions, where the dimensions are (batch, output)
- Return type
numpy.array
-
kaishi.image.ops
¶Definitions for common operations on images.
-
kaishi.image.ops.
extract_patch
(im, patch_size)¶ Extract a center cropped patch of size ‘patch_size’ (2-element tuple).
- Parameters
im (PIL image object) – input image
patch_size (tuple, array, or similar) – size of patch
- Returns
center-cropped patch
- Return type
PIL image object
-
kaishi.image.ops.
make_small
(im, max_dim: int = 224, resample_method=Image.NEAREST)¶ Make a small version of an image while maintaining aspect ratio.
- Parameters
im (PIL image object) – input image
max_dim (int) – maximum dimension of resized image (x or y)
resample_method (PIL resample method) – method for resampling the image
- Returns
resized image
- Return type
PIL image object
-
kaishi.image.ops.
add_jpeg_compression
(im, quality_level: int = 30)¶ Apply JPEG compression to an image with a given quality level.
- Parameters
im (PIL image object) – input image
quality_level (int) – JPEG qualit level, where: 0 < value <= 100
- Returns
compressed image
- Return type
PIL image object
-
kaishi.image.ops.
add_rotation
(im, ccw_rotation_degrees: int = 90)¶ Rotate an image CCW by ccw_rotation_degrees degrees.
- Parameters
im (PIL image object) – input image
ccw_rotation_degrees (int) – number of degrees to rotate counter-clockwise
- Returns
rotated image
- Return type
PIL image object
-
kaishi.image.ops.
add_stretching
(im, w_percent_additional, h_percent_additional)¶ Stretch an image by the specified percentages.
- Parameters
im (PIL image object) – input image
w_percent_additional (int or float greater than 0) – amount of width stretching to add (0 maintains the same size, 100 doubles the size)
h_percent_additional (int or float greater than 0) – amount of height stretching to add (0 maintains the same size, 100 doubles the size)
- Returns
stretched image
- Return type
PIL image object
-
kaishi.image.ops.
add_poisson_noise
(im, param: float = 1.0, rescale: bool = True)¶ Add Poisson noise to image, where (poisson noise * param) is the final noise function.
See http://kmdouglass.github.io/posts/modeling-noise-for-image-simulations for more info. If rescale is set to True, the image will be rescaled after noise is added. Otherwise, the noise will saturate.
- Parameters
im (PIL image object) – input image
param (float) – noise parameter
rescale (bool) – flag indicating whether or not to rescale the image after adding noise (maintaining original image extrema)
- Returns
image with Poisson noise added
- Return type
PIL image object
kaishi.image.util
¶Image utilities and helper functions.
-
kaishi.image.util.
swap_channel_dimension
(tensor)¶ Swap between channels_first and channels_last.
If ‘tensor’ has 4 elements, it’s assumed to be the shape vector. Otherwise, it’s assumed that it’s the actual tensor. Returns the edited shape vector or tensor.
- Parameters
tensor (numpy.array) – shape vector or tensor to have channel dimensions swapped
- Returns
altered input with the channel dimensions swapped
- Return type
numpy.array
-
kaishi.image.util.
validate_image_header
(filename: str)¶ Validate that an image has a valid header.
Returns True if valid, False if invalid.
- Parameters
filename (str) – name of file to analyze
- Returns
flag indicating whether header is valid (by using imghdr.what())
- Return type
bool
-
kaishi.image.util.
get_batch_dimensions
(self, batch_size: int, channels_first: bool = True, image_type: str = 'small_image')¶ Get dimensions of the batch tensor. Note that the ‘batch_size’ argument can be the full data set.
- Parameters
self (
kaishi.image.dataset.ImageDataset
) – image dataset objectbatch_size (int) – batch size
channels_first (bool) – flag indicating whether channels are first or last dimension in each image
image_type (str) – one of “small_image”, “thumbnail”, or “patch”
- Returns
batch dimesions (4D tuple)
- Return type
tuple
kaishi.tabular
¶
Kaishi framework for tabular datasets.
Subpackages¶
kaishi.tabular.aggregators
¶Pipeline components that aggregate data based on user-defined criteria, specifically for tabular datasets.
kaishi.tabular.aggregators.concatenate_dataframes
¶Class definition for concatenating tabular data files.
-
class
kaishi.tabular.aggregators.concatenate_dataframes.
AggregatorConcatenateDataframes
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Concatenate all data frames.
-
__call__
(self, dataset)¶ Perform concatenation on a given dataset (all files must have the same schema).
- Parameters
dataset (
kaishi.tabular.dataset.TabularDataset
) – tabular dataset to perform operation on
-
kaishi.tabular.filters
¶Pipeline components that filter data points based on user-defined criteria, specifically for tabular datasets.
kaishi.tabular.filters.duplicate_rows_after_concatenation
¶Class definition for filtering duplicate rows after concatenation.
-
class
kaishi.tabular.filters.duplicate_rows_after_concatenation.
FilterDuplicateRowsAfterConcatenation
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter duplicate rows in the concatenated dataframe (dataset will be concatenated if it hasn’t been already).
-
__call__
(self, dataset)¶ Perform filter on a given tabular dataset.
- Parameters
dataset (
kaishi.tabular.dataset.TabularDataset
) – tabular dataset to perform operation on
-
kaishi.tabular.filters.duplicate_rows_each_dataframe
¶Class definition for filtering duplicate rows in each dataframe.
-
class
kaishi.tabular.filters.duplicate_rows_each_dataframe.
FilterDuplicateRowsEachDataframe
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter duplicate rows in each dataframe of a tabular dataset.
-
__call__
(self, dataset)¶ Perform the filter operation on a given tabular dataset.
- Parameters
dataset (
kaishi.tabular.dataset.TabularDataset
) – dataset to perform operation on
-
kaishi.tabular.filters.invalid_file_extensions
¶Class definition for filtering invalid tabular file extensions.
-
kaishi.tabular.filters.invalid_file_extensions.
VALID_EXT
= ['.json', '.jsonl', '.json.gz', '.jsonl.gz', '.csv', '.csv.gz']¶
-
class
kaishi.tabular.filters.invalid_file_extensions.
FilterInvalidFileExtensions
¶ Bases:
kaishi.core.pipeline_component.PipelineComponent
Filter file list if non-tabular extensions exist.
-
__call__
(self, dataset)¶ Perform operation on a tabular dataset.
- Parameters
dataset (
kaishi.tabular.dataset.TabularDataset
) – dataset to perform file extension filter on
-
configure
(self, valid_extensions=VALID_EXT)¶ Configure the file extension filter (default list defined in VALID_EXT).
- Parameters
valid_extensions (list[str]) – list of file extensions that are valid (each must begin with “.”)
-
Submodules¶
kaishi.tabular.dataset
¶Class definition for data exploration utilities for tabular data (csv files, database tables).
kaishi.tabular.file
¶Class definition for tabular data files.
-
class
kaishi.tabular.file.
TabularFile
(basedir: str, relpath: str, filename: str)¶ Bases:
kaishi.core.file.File
Class for tabular data file-specific attributes and methods.
-
_has_csv_file_ext
(self)¶ Check if file is variant of .csv
- Returns
flag indicating if file is valid
- Return type
bool
-
_has_json_file_ext
(self)¶ Check if file is variant of .json
- Returns
flag indicating if file is valid
- Return type
bool
-
verify_loaded
(self)¶ Load the file if supported.
-
get_summary
(self)¶ Create summary for this data frame.
- Returns
summary dictionary containing common analyses
- Return type
dict
-
kaishi.tabular.file_group
¶Class definition for group of tabular files.
-
class
kaishi.tabular.file_group.
TabularFileGroup
(source: str, recursive: bool, use_predefined_pipeline: bool = False, out_dir: str = None)¶ Bases:
kaishi.core.file_group.FileGroup
Object containing groups of
kaishi.tabular.file.File
objects with methods to perform common operations on them.-
_get_indexes_with_valid_dataframe
(self)¶ Get a list of indexes with valid dataframes.
- Returns
indexes with valid dataframe
- Return type
list
-
_get_valid_dataframes
(self)¶ Get a list of valid dataframe objects.
- Returns
valid dataframes
- Return type
list[
pandas.core.frame.DataFrame
]
-
save
(self, out_dir: str, file_format: str = 'csv')¶ Save the processed dataset as individual files or as one file with all the data.
- Parameters
out_dir (str) – The path of the output directory. If the directory does not exist, it will be created.
file_format (str) – The format of output files. Currently only supports “csv”.
-
load_all
(self)¶ Load all files from the source directory.
-
run_pipeline
(self, verbose: bool = False)¶ Run the pipeline as configured.
- Parameters
verbose (bool) – flag indicating verbosity
-
report
(self)¶ Print a report of the dataset in its current state.
-
- 1
Created with sphinx-autoapi
User Guide¶
Installation¶
Kaishi is hosted on PyPI. To install, simply run:
pip install kaishi
or, alternatively, if you’re a developer/contributor, clone the source from github and then run:
pip install -r requirements.txt
pip install .
Pipeline Components¶
Pipeline components are broken up into several distinct categories, and the classes that define them MUST begin with the correct keyword:
Filter*
- removes elements of a datasetTransform*
- changes one or more data objects in some fundamental wayLabeler*
- creates labels for data objects without modifying the underlying dataAggregator*
- combines data objects in some way
Look at how other pipeline components are implemented. Feel free to write your own, while following the below rules:
Inherits from the
PipelineComponent
classHas no initialization arguments
Has a single
__call__
method with a single argument (a dataset object)If specific configuration is needed, a method named
self.configure(...)
must be written with named arguments with defaults.self.configure()
must be called as part of the__init__(...)
call for configuration to work.self.applies_to_available = True
must be in the__init__(...)
call if the component takes advantage of theself.applies_to()
andself.get_target_indexes()
methods fromkaishi.core.pipeline_component.PipelineComponent
methodIf an artifact (i.e. some result) is created from the operation, as in the case of Aggregators, the artifact should be added to the
dataset.artifacts
dictionary (automatically initialized with any new dataset)
You can then enable usage of the component with your instantiated dataset object, e.g.:
from your_definition import YourNewComponent
imd.YourNewComponent = YourNewComponent
imd.configure_pipeline(['YourNewComponent'])
Tutorials¶
Image Datasets¶
Image datasets in this context are directories of files. Kaishi has a lot of built in functionality for interacting with them. While kaishi has many built-in pipeline components that operate on image datasets, a lot of the standard ETL is handled for you in the event you want to add your own custom code (without the ETL hassle).
Initializing datasets¶
Let’s start by downloading a sample dataset to work with. You will need wget installed unless using your own directory of files.
import wget
import pickle
from PIL import Image
import tarfile
import os
wget.download("http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz")
tarfile.open("101_ObjectCategories.tar.gz").extractall()
os.remove("101_ObjectCategories.tar.gz")
First, initialize a kaishi image dataset object and print a descriptive report of files.
from kaishi.image.dataset import ImageDataset
imd = ImageDataset("101_objectCategories", recursive=True)
imd.file_report()
Current file list:
+-------+--------------------------+---------------------------------------+--------+
| Index | File Name | Children | Labels |
+-------+--------------------------+---------------------------------------+--------+
| 0 | gerenuk/image_0032.jpg | {'duplicates': [], 'similar': []} | [] |
| 1 | gerenuk/image_0026.jpg | {'duplicates': [], 'similar': []} | [] |
| 2 | gerenuk/image_0027.jpg | {'duplicates': [], 'similar': []} | [] |
| 3 | gerenuk/image_0033.jpg | {'duplicates': [], 'similar': []} | [] |
| 4 | gerenuk/image_0019.jpg | {'duplicates': [], 'similar': []} | [] |
| 5 | gerenuk/image_0025.jpg | {'duplicates': [], 'similar': []} | [] |
| 6 | gerenuk/image_0031.jpg | {'duplicates': [], 'similar': []} | [] |
| 7 | gerenuk/image_0030.jpg | {'duplicates': [], 'similar': []} | [] |
| ... | | | |
| 9137 | metronome/image_0015.jpg | {'duplicates': [], 'similar': []} | [] |
| 9138 | metronome/image_0029.jpg | {'duplicates': [], 'similar': []} | [] |
| 9139 | metronome/image_0028.jpg | {'duplicates': [], 'similar': []} | [] |
| 9140 | metronome/image_0014.jpg | {'duplicates': [], 'similar': []} | [] |
| 9141 | metronome/image_0016.jpg | {'duplicates': [], 'similar': []} | [] |
| 9142 | metronome/image_0002.jpg | {'duplicates': [], 'similar': []} | [] |
| 9143 | metronome/image_0003.jpg | {'duplicates': [], 'similar': []} | [] |
| 9144 | metronome/image_0017.jpg | {'duplicates': [], 'similar': []} | [] |
+-------+--------------------------+---------------------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
+-----------+---------------+
/Users/mwharton/.miniconda3/envs/kaishi/lib/python3.7/site-packages/kaishi/image/dataset.py:20: UserWarning: No GPU detected, ConvNet prediction tasks will be very slow
warnings.warn("No GPU detected, ConvNet prediction tasks will be very slow")
There are almost 10k images in this directory, let’s use a subdirectory to keep the dataset small.
imd = ImageDataset("101_objectCategories/Faces")
imd.file_report()
Current file list:
+-------+----------------+---------------------------------------+--------+
| Index | File Name | Children | Labels |
+-------+----------------+---------------------------------------+--------+
| 0 | image_0146.jpg | {'duplicates': [], 'similar': []} | [] |
| 1 | image_0152.jpg | {'duplicates': [], 'similar': []} | [] |
| 2 | image_0185.jpg | {'duplicates': [], 'similar': []} | [] |
| 3 | image_0191.jpg | {'duplicates': [], 'similar': []} | [] |
| 4 | image_0378.jpg | {'duplicates': [], 'similar': []} | [] |
| 5 | image_0344.jpg | {'duplicates': [], 'similar': []} | [] |
| 6 | image_0422.jpg | {'duplicates': [], 'similar': []} | [] |
| 7 | image_0350.jpg | {'duplicates': [], 'similar': []} | [] |
| ... | | | |
| 427 | image_0349.jpg | {'duplicates': [], 'similar': []} | [] |
| 428 | image_0375.jpg | {'duplicates': [], 'similar': []} | [] |
| 429 | image_0413.jpg | {'duplicates': [], 'similar': []} | [] |
| 430 | image_0407.jpg | {'duplicates': [], 'similar': []} | [] |
| 431 | image_0361.jpg | {'duplicates': [], 'similar': []} | [] |
| 432 | image_0188.jpg | {'duplicates': [], 'similar': []} | [] |
| 433 | image_0177.jpg | {'duplicates': [], 'similar': []} | [] |
| 434 | image_0163.jpg | {'duplicates': [], 'similar': []} | [] |
+-------+----------------+---------------------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
+-----------+---------------+
Interaction with datasets¶
Now, let’s look at a couple ways to access the images.
Each of these “files” are actually kaishi.image.file.File objects, which have quite a few interesting methods to enable rapid analysis. Each image object is initialized to None by default, but verify_loaded() will load an individual file, whereas load_all() will load all of them. If running a pipeline, load_all() will be called for you.
imd.files[:10]
[image_0146.jpg,
image_0152.jpg,
image_0185.jpg,
image_0191.jpg,
image_0378.jpg,
image_0344.jpg,
image_0422.jpg,
image_0350.jpg,
image_0387.jpg,
image_0393.jpg]
print(imd.files[0].image is None)
imd.files[0].verify_loaded()
print(imd.files[0].image is None)
True
False
import matplotlib.pyplot as plt
plt.imshow(imd.files[0].image)
plt.show()
plt.imshow(imd["image_0146.jpg"].image)
plt.show()
png
png
Image processing pipelines¶
Let’s see the pipeline options.
imd.get_pipeline_options()
['FilterByLabel',
'FilterByRegex',
'FilterDuplicateFiles',
'FilterInvalidFileExtensions',
'FilterInvalidImageHeaders',
'FilterSimilar',
'FilterSubsample',
'LabelerGenericConvnet',
'LabelerValidationAndTest',
'TransformFixRotation',
'TransformLimitDimensions',
'TransformToGrayscale']
Now we can create, configure, and run a pipeline.
Note: the default regex pattern does not perform any filtering.
imd.configure_pipeline(["FilterInvalidImageHeaders", "FilterDuplicateFiles", "FilterByRegex", "TransformLimitDimensions", "TransformToGrayscale"])
print(imd.pipeline)
Kaishi pipeline:
0: FilterInvalidImageHeaders
1: FilterDuplicateFiles
2: FilterByRegex
pattern: '/(?=a)b/'
3: TransformLimitDimensions
max_dimension: None
max_width: None
max_height: None
4: TransformToGrayscale
imd.pipeline.components[2].configure(pattern="image_02.*jpg")
imd.pipeline.components[3].configure(max_dimension=400)
print(imd.pipeline)
Kaishi pipeline:
0: FilterInvalidImageHeaders
1: FilterDuplicateFiles
2: FilterByRegex
pattern: 'image_02.*jpg'
3: TransformLimitDimensions
max_dimension: 400
max_width: None
max_height: None
4: TransformToGrayscale
imd.run_pipeline()
Now we can analyze the results.
imd.file_report()
Current file list:
+-------+----------------+---------------------------------------+---------------+
| Index | File Name | Children | Labels |
+-------+----------------+---------------------------------------+---------------+
| 0 | image_0146.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 1 | image_0152.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 2 | image_0185.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 3 | image_0191.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 4 | image_0378.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 5 | image_0344.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 6 | image_0422.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 7 | image_0350.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| ... | | | |
| 327 | image_0349.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 328 | image_0375.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 329 | image_0413.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 330 | image_0407.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 331 | image_0361.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 332 | image_0188.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 333 | image_0177.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
| 334 | image_0163.jpg | {'duplicates': [], 'similar': []} | ['GRAYSCALE'] |
+-------+----------------+---------------------------------------+---------------+
Filtered files:
+----------------+---------------+
| File Name | Filter Reason |
+----------------+---------------+
| image_0215.jpg | regex |
| image_0201.jpg | regex |
| image_0229.jpg | regex |
| image_0228.jpg | regex |
| image_0200.jpg | regex |
| ... | |
| image_0231.jpg | regex |
| image_0225.jpg | regex |
| image_0224.jpg | regex |
| image_0230.jpg | regex |
| image_0218.jpg | regex |
+----------------+---------------+
Note that the images have been sized down (max dimension is 400) and are now grayscale (as expected)
plt.imshow(imd["image_0146.jpg"].image)
plt.show()
plt.imshow(imd["image_0361.jpg"].image)
plt.show()
png
png
Custom pipeline components¶
What if we wanted to create a custom pipeline component?
Let’s create one that quantizes each of the images that passed our previous filter operations.
from kaishi.core.pipeline_component import PipelineComponent
# Follow rules specified in the pipeline comonent guide
class TransformByQuantizing(PipelineComponent):
"""Transform that quantizes images."""
def __init__(self):
super().__init__()
self.configure()
self.applies_to_available = True # Set this to true if using the "get_target_indexes" method
def __call__(self, dataset):
# Trim any files without image extensions
for i in self.get_target_indexes(dataset):
dataset.files[i].image = dataset.files[i].image.quantize(colors=self.n_colors).convert("L")
dataset.files[i].update_derived_images() # This updates thumbnails/etc.
def configure(self, n_colors=32):
self.n_colors = n_colors
imd.TransformByQuantizing = TransformByQuantizing
Check to see that it was properly added
imd.get_pipeline_options()
['FilterByLabel',
'FilterByRegex',
'FilterDuplicateFiles',
'FilterInvalidFileExtensions',
'FilterInvalidImageHeaders',
'FilterSimilar',
'FilterSubsample',
'LabelerGenericConvnet',
'LabelerValidationAndTest',
'TransformByQuantizing',
'TransformFixRotation',
'TransformLimitDimensions',
'TransformToGrayscale']
imd.configure_pipeline(["TransformByQuantizing"])
print(imd.pipeline)
Kaishi pipeline:
0: TransformByQuantizing
n_colors: 32
imd.pipeline.components[0].configure(n_colors=10)
imd.pipeline.components[0].applies_to("image_01.*jpg")
print(imd.pipeline)
imd.run_pipeline()
Kaishi pipeline:
0: TransformByQuantizing
n_colors: 10
As expected, any image with the pattern image_01… is quantized (most notable in the background) whereas any image not matching this pattern remained intact.
plt.imshow(imd["image_0146.jpg"].image)
plt.show()
plt.imshow(imd["image_0361.jpg"].image)
plt.show()
png
png
Saving¶
Finally, we can save the edited dataset
imd.save("Faces_edited")
imd_edited = ImageDataset("Faces_edited")
imd_edited.load_all()
/Users/mwharton/.miniconda3/envs/kaishi/lib/python3.7/site-packages/kaishi/image/dataset.py:20: UserWarning: No GPU detected, ConvNet prediction tasks will be very slow
warnings.warn("No GPU detected, ConvNet prediction tasks will be very slow")
plt.imshow(imd_edited["image_0146.jpg"].image)
plt.show()
plt.imshow(imd_edited["image_0361.jpg"].image)
plt.show()
png
png
Tabular Datasets¶
Tabular datasets in this context are directories of files (any variant of .csv or .json is accepted).
Initializing datasets¶
Let’s start by creating our own toy dataset (with one duplicate, i.e. files 1 and 2)
import pandas as pd
import os
outdir = "toy_csv"
os.mkdir(outdir)
csv1 = """
"Index", "Living Space (sq ft)", "Beds", "Baths"
1, 2222, 3, 3.5
2, 1628, 3, 2
3, 3824, 5, 4
4, 1137, 3, 2
5, 3560, 6, 4
6, 2893, 4, 3
7, 3631, 4, 3
8, 2483, 4, 3
9, 2400, 4, 4
10, 1997, 3, 3
"""
csv2 = """
"Index", "Living Space (sq ft)", "Beds", "Baths"
1, 2222, 3, 3.5
2, 1628, 3, 2
3, 3824, 5, 4
4, 1137, 3, 2
5, 3560, 6, 4
6, 2893, 4, 3
7, 3631, 4, 3
8, 2483, 4, 3
9, 2400, 4, 4
10, 1997, 3, 3
"""
csv3 = """
"Index", "Living Space (sq ft)", "Beds", "Baths"
11, 2222, 3, 3.5
12, 1628, 3, 2
13, 3824, 5, 4
14, 1137, 3, 2
15, 3560, 6, 4
16, 2893, 4, 3
17, 3631, 4, 3
18, 2483, 4, 3
19, 2400, 4, 4
20, 1997, 3, 3
"""
with open(outdir + "/1.csv", "w") as fd:
fd.write(csv1)
with open(outdir + "/2.csv", "w") as fd:
fd.write(csv2)
with open(outdir + "/3.csv", "w") as fd:
fd.write(csv3)
from kaishi.tabular.dataset import TabularDataset
td = TabularDataset(outdir)
td.file_report()
Current file list:
+-------+-----------+------------------------+--------+
| Index | File Name | Children | Labels |
+-------+-----------+------------------------+--------+
| 0 | 1.csv | {'duplicates': []} | [] |
| 1 | 3.csv | {'duplicates': []} | [] |
| 2 | 2.csv | {'duplicates': []} | [] |
+-------+-----------+------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
+-----------+---------------+
Interaction with datasets¶
There are several ways to interact with tabular datasets. Let’s start by looking at a detailed report.
td.report()
Dataframe 0
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/1.csv
====================================
NO DATA OR NOT LOADED (try running 'dataset.load_all()')
Dataframe 1
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/3.csv
====================================
NO DATA OR NOT LOADED (try running 'dataset.load_all()')
Dataframe 2
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/2.csv
====================================
NO DATA OR NOT LOADED (try running 'dataset.load_all()')
Our data weren’t loaded, let’s fix that and try again
td.load_all()
td.report()
Dataframe 0
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/1.csv
====================================
4 columns: ['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"']
--- Column 'Index'
count 10.00000
mean 5.50000
std 3.02765
min 1.00000
25% 3.25000
50% 5.50000
75% 7.75000
max 10.00000
Name: Index, dtype: float64
--- Column ' "Living Space (sq ft)"'
count 10.00000
mean 2577.50000
std 894.97725
min 1137.00000
25% 2053.25000
50% 2441.50000
75% 3393.25000
max 3824.00000
Name: "Living Space (sq ft)", dtype: float64
--- Column ' "Beds"'
count 10.000000
mean 3.900000
std 0.994429
min 3.000000
25% 3.000000
50% 4.000000
75% 4.000000
max 6.000000
Name: "Beds", dtype: float64
--- Column ' "Baths"'
count 10.000000
mean 3.150000
std 0.747217
min 2.000000
25% 3.000000
50% 3.000000
75% 3.875000
max 4.000000
Name: "Baths", dtype: float64
***** Fraction of missing data in each column *****
Index: 0.0
"Living Space (sq ft)": 0.0
"Beds": 0.0
"Baths": 0.0
Dataframe 1
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/3.csv
====================================
4 columns: ['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"']
--- Column 'Index'
count 10.00000
mean 15.50000
std 3.02765
min 11.00000
25% 13.25000
50% 15.50000
75% 17.75000
max 20.00000
Name: Index, dtype: float64
--- Column ' "Living Space (sq ft)"'
count 10.00000
mean 2577.50000
std 894.97725
min 1137.00000
25% 2053.25000
50% 2441.50000
75% 3393.25000
max 3824.00000
Name: "Living Space (sq ft)", dtype: float64
--- Column ' "Beds"'
count 10.000000
mean 3.900000
std 0.994429
min 3.000000
25% 3.000000
50% 4.000000
75% 4.000000
max 6.000000
Name: "Beds", dtype: float64
--- Column ' "Baths"'
count 10.000000
mean 3.150000
std 0.747217
min 2.000000
25% 3.000000
50% 3.000000
75% 3.875000
max 4.000000
Name: "Baths", dtype: float64
***** Fraction of missing data in each column *****
Index: 0.0
"Living Space (sq ft)": 0.0
"Beds": 0.0
"Baths": 0.0
Dataframe 2
source: /Users/mwharton/Code/kaishi/notebooks/toy_csv/2.csv
====================================
4 columns: ['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"']
--- Column 'Index'
count 10.00000
mean 5.50000
std 3.02765
min 1.00000
25% 3.25000
50% 5.50000
75% 7.75000
max 10.00000
Name: Index, dtype: float64
--- Column ' "Living Space (sq ft)"'
count 10.00000
mean 2577.50000
std 894.97725
min 1137.00000
25% 2053.25000
50% 2441.50000
75% 3393.25000
max 3824.00000
Name: "Living Space (sq ft)", dtype: float64
--- Column ' "Beds"'
count 10.000000
mean 3.900000
std 0.994429
min 3.000000
25% 3.000000
50% 4.000000
75% 4.000000
max 6.000000
Name: "Beds", dtype: float64
--- Column ' "Baths"'
count 10.000000
mean 3.150000
std 0.747217
min 2.000000
25% 3.000000
50% 3.000000
75% 3.875000
max 4.000000
Name: "Baths", dtype: float64
***** Fraction of missing data in each column *****
Index: 0.0
"Living Space (sq ft)": 0.0
"Beds": 0.0
"Baths": 0.0
To look at a specific file object, you can access via either index or key
td.files[0].df
Index | "Living Space (sq ft)" | "Beds" | "Baths" | |
---|---|---|---|---|
0 | 1 | 2222 | 3 | 3.5 |
1 | 2 | 1628 | 3 | 2.0 |
2 | 3 | 3824 | 5 | 4.0 |
3 | 4 | 1137 | 3 | 2.0 |
4 | 5 | 3560 | 6 | 4.0 |
5 | 6 | 2893 | 4 | 3.0 |
6 | 7 | 3631 | 4 | 3.0 |
7 | 8 | 2483 | 4 | 3.0 |
8 | 9 | 2400 | 4 | 4.0 |
9 | 10 | 1997 | 3 | 3.0 |
td["1.csv"].df
Index | "Living Space (sq ft)" | "Beds" | "Baths" | |
---|---|---|---|---|
0 | 1 | 2222 | 3 | 3.5 |
1 | 2 | 1628 | 3 | 2.0 |
2 | 3 | 3824 | 5 | 4.0 |
3 | 4 | 1137 | 3 | 2.0 |
4 | 5 | 3560 | 6 | 4.0 |
5 | 6 | 2893 | 4 | 3.0 |
6 | 7 | 3631 | 4 | 3.0 |
7 | 8 | 2483 | 4 | 3.0 |
8 | 9 | 2400 | 4 | 4.0 |
9 | 10 | 1997 | 3 | 3.0 |
Tabular data processing pipelines¶
Let’s see the pipeline options
td.get_pipeline_options()
['FilterByLabel',
'FilterByRegex',
'FilterDuplicateFiles',
'FilterDuplicateRowsAfterConcatenation',
'FilterDuplicateRowsEachDataframe',
'FilterInvalidFileExtensions',
'FilterSubsample',
'LabelerValidationAndTest',
'AggregatorConcatenateDataframes']
Now let’s configure our own pipeline and run it
td.configure_pipeline(["FilterDuplicateFiles", "AggregatorConcatenateDataframes"])
print(td.pipeline)
td.run_pipeline()
Kaishi pipeline:
0: FilterDuplicateFiles
1: AggregatorConcatenateDataframes
As expected, the duplicate file was filtered
td.file_report()
Current file list:
+-------+-----------+-----------------------------+--------+
| Index | File Name | Children | Labels |
+-------+-----------+-----------------------------+--------+
| 0 | 1.csv | {'duplicates': [2.csv]} | [] |
| 1 | 3.csv | {'duplicates': []} | [] |
+-------+-----------+-----------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
| 2.csv | duplicates |
+-----------+---------------+
But what about the concatenated dataframe? When Kaishi pipeline components create artifacts, they are added to the artifacts member of a dataset.
print(td.artifacts.keys())
dict_keys(['df_concatenated'])
td.artifacts["df_concatenated"]
Index | "Living Space (sq ft)" | "Beds" | "Baths" | |
---|---|---|---|---|
0 | 1 | 2222 | 3 | 3.5 |
1 | 2 | 1628 | 3 | 2.0 |
2 | 3 | 3824 | 5 | 4.0 |
3 | 4 | 1137 | 3 | 2.0 |
4 | 5 | 3560 | 6 | 4.0 |
5 | 6 | 2893 | 4 | 3.0 |
6 | 7 | 3631 | 4 | 3.0 |
7 | 8 | 2483 | 4 | 3.0 |
8 | 9 | 2400 | 4 | 4.0 |
9 | 10 | 1997 | 3 | 3.0 |
10 | 11 | 2222 | 3 | 3.5 |
11 | 12 | 1628 | 3 | 2.0 |
12 | 13 | 3824 | 5 | 4.0 |
13 | 14 | 1137 | 3 | 2.0 |
14 | 15 | 3560 | 6 | 4.0 |
15 | 16 | 2893 | 4 | 3.0 |
16 | 17 | 3631 | 4 | 3.0 |
17 | 18 | 2483 | 4 | 3.0 |
18 | 19 | 2400 | 4 | 4.0 |
19 | 20 | 1997 | 3 | 3.0 |
This ultimately sets the framework for being able to manipulate your own tabular data sets and add custom functionality, without the hassle of dealing with the boring and monotonous ETL steps.
File datasets¶
If there’s a particular data type you are working with, it’s usually better to use the type-specific dataset object. However, there are still a few core operations that can be performed on generic files.
Initializing datasets¶
Let’s start by creating a simple dataset of text files.
import os
outdir = "toy_files"
os.mkdir(outdir)
file1 = "file1_contents"
file2 = "file2_contents"
file2_duplicate = "file2_contents"
file3 = "file3_contents"
with open(outdir + "/1.file", "w") as fd:
fd.write(file1)
with open(outdir + "/2.file", "w") as fd:
fd.write(file2)
with open(outdir + "/2_dup.file", "w") as fd:
fd.write(file2_duplicate)
with open(outdir + "/3.file", "w") as fd:
fd.write(file3)
from kaishi.core.dataset import FileDataset
fd = FileDataset(outdir)
fd.file_report()
Current file list:
+-------+------------+------------------------+--------+
| Index | File Name | Children | Labels |
+-------+------------+------------------------+--------+
| 0 | 3.file | {'duplicates': []} | [] |
| 1 | 2.file | {'duplicates': []} | [] |
| 2 | 2_dup.file | {'duplicates': []} | [] |
| 3 | 1.file | {'duplicates': []} | [] |
+-------+------------+------------------------+--------+
Filtered files:
+-----------+---------------+
| File Name | Filter Reason |
+-----------+---------------+
+-----------+---------------+
File procesing pipelines¶
There are fewer components available for files compared to other types, as the other types inherit from the FileGroup class. However, there are still plenty of options available to perform some common operations.
fd.get_pipeline_options()
['FilterByLabel',
'FilterByRegex',
'FilterDuplicateFiles',
'FilterSubsample',
'LabelerValidationAndTest']
fd.configure_pipeline(["FilterDuplicateFiles", "FilterSubsample"])
print(fd.pipeline)
Kaishi pipeline:
0: FilterDuplicateFiles
1: FilterSubsample
N: None
seed: None
fd.pipeline.components[1].configure(N=2, seed=42)
print(fd.pipeline)
Kaishi pipeline:
0: FilterDuplicateFiles
1: FilterSubsample
N: 2
seed: 42
fd.run_pipeline()
fd.file_report()
Current file list:
+-------+-----------+----------------------------------+--------+
| Index | File Name | Children | Labels |
+-------+-----------+----------------------------------+--------+
| 0 | 3.file | {'duplicates': []} | [] |
| 1 | 2.file | {'duplicates': [2_dup.file]} | [] |
+-------+-----------+----------------------------------+--------+
Filtered files:
+------------+---------------+
| File Name | Filter Reason |
+------------+---------------+
| 2_dup.file | duplicates |
| 1.file | subsample |
+------------+---------------+