File registry
A file registry keeps track of what files have been processed, returning files that have not been processed yet.
This is true for each file registry:
- has a sub block that is annotated as
block::sub-block
- outputs a list of unprocessed files
- only some load blocks can be configured with a file registry
FileRegistry:
{BlockName}
Type: {block::sub-block}
Properties:
{Prop}: {Prop}
Example:
FileRegistry:
S3FullScan:
Type: fileregistry::s3_full_scan
Properties:
BasePath: s3://datalake/file-registry/dateset-a
UpdateAfter: WriteToService
HiveDatabaseName: file_registry
HiveTableName: dataset-a
LiftJob:
TrustedFiles:
Type: load::batch_parquet
Properties:
Path: ${Path}
FileRegistry: S3Fullscan
WriteToService:
Type: write::batch_delta
Input: TrustedFiles
Properties:
Path: s3://path/to/prefix
Mode: overwrite
AWS S3
The following file registrys can only be used with the AWS S3 service.
fileregistry::delta_diff
Retrieve a dataset with the new rows compared from the last time lifted.
- Parameters
-
- BasePath (str) – s3 prefix to where you want the file registry
- UpdateAfter (str) – After what lift block should the new files be marked as processed
- DefaultStartDate (str) – At what date should the file registry start to search for new files (format %Y-%m-%d %H:%M:%S)
- JoinOnFields (list) – Join on these fields to exclude the existing rows in a previous version
Only available for pyspark 3.0 and above!
DeltaDiff:
Type: fileregistry::delta_diff
Properties:
BasePath: s3://datalake/file-registry/dateset-a
UpdateAfter: WriteToDatabase
DefaultStartDate: 2019-01-01
JoinOnFields: [id, timestamp]
fileregistry::s3_date_prefix_scan
Find all new files in S3 based with a date partition format i.e. YYYY/MM/DD
With the parameter PartitionFormat you can specify multiple different date formats that will be used to scan an S3 prefix.
Take the following example, you can have files stored in the following way on S3 YYYY/MM/DD/HH.
If we give the PartitionFormat parameter the value %Y/%m
, we use
python strftime codes,
then it will scan files on a monthly basis. Set it to %Y/%m/%d/%H
and it
will scan files on a hourly basis instead.
- Parameters
-
- BasePath (str) – s3 prefix to where you want the file registry
- UpdateAfter (str) – After what lift block should the new files be marked as processed
- DefaultStartDate (str) – At what date should the file registry start to search for new files
- PartitionFormat (str) – Describe how are the files partitioned and how the file registry will search for new files.
- HiveDatabaseName (str) – The hive database name for the file registry
- HiveTableName (str) – The hive table name for the file registry
S3DatePrefixScan:
Type: fileregistry::s3_date_prefix_scan
Properties:
BasePath: s3://datalake/file-registry
UpdateAfter: WriteToDatabase
DefaultStartDate: 2019-01-01
PartitionFormat: %Y/%m/d
HiveDatabaseName: file_registry
HiveTableName: dataset-a
fileregistry::s3_full_scan
Do a full scan for new files under a prefix in s3
- Parameters
-
- BasePath (str) – s3 prefix to where you want the file registry
- UpdateAfter (str) – After what lift block should the new files be marked as processed
- HiveDatabaseName (str) – The hive database name for the file registry
- HiveTableName (str) – The hive table name for the file registry
S3FullScan:
Type: fileregistry::s3_full_scan
Properties:
BasePath: s3://datalake/file-registry/dateset-a
UpdateAfter: WriteToDatabase
HiveDatabaseName: file_registry
HiveTableName: dataset-a