TFRecorder
TFRecorder makes it easy to create TFRecords from images and labels in Pandas DataFrames or CSV files. Today, TFRecorder supports data stored in 'image csv format' similar to GCP AutoML Vision. In the future TFRecorder will support converting any Pandas DataFrame or CSV file into TFRecords.
Installation
From the top directory of the repo, run the following command:
pip install tfrecorder
Example usage
Generating TFRecords
From Pandas DataFrame
Running on a local machine
import pandas as pd
import tfrecorder
df = pd.read_csv(...)
df.tensorflow.to_tfr(output_dir='gs://my/bucket')
Running on Cloud Dataflow
import pandas as pd
import tfrecorder
df = pd.read_csv(...)
df.tensorflow.to_tfr(
output_dir='gs://my/bucket',
runner='DataFlowRunner',
project='my-project',
region='us-central1')
From CSV
Using Python interpreter:
import tfrecorder
tfrecorder.create_tfrecords(
input_data='/path/to/data.csv',
output_dir='gs://my/bucket')
Using the command line:
tfrecorder create-tfrecords \
--input_data=/path/to/data.csv \
--output_dir=gs://my/bucket
Verifying data in TFRecords generated by TFRecorder
Using Python interpreter:
import tfrecorder
tfrecorder.check_tfrecords(
file_pattern='/path/to/tfrecords/train*.tfrecord.gz',
num_records=5,
output_dir='/tmp/output')
This will generate a CSV file containing structured data and image files representing the images encoded into TFRecords.
Using the command line:
tfrecorder check-tfrecords \
--file_pattern=/path/to/tfrecords/train*.tfrecord.gz \
--num_records=5 \
--output_dir=/tmp/output
Input format
TFRecorder currently expects data to be in the same format as AutoML Vision.
This format looks like a Pandas DataFrame or CSV formatted as:
split | image_uri | label |
---|---|---|
TRAIN | gs://my/bucket/image1.jpg | cat |
where:
split
can take on the values TRAIN, VALIDATION, and TESTimage_uri
specifies a local or google cloud storage location for the image file.label
can be either a text based label that will be integerized or integer
Contributing
Pull requests are welcome.