Skip to content

Use Cases

Data Sources

Source
FTP
SFTP
S3
GCS
Azure Buckets
GITHUB - Public
GitHub - Private (>)
API - Public
API - Authentication (Username/Password, Token)
API - Sigv4
Kafka Partition
Manual Upload
CDC - Kafka, various other Source Systems
Elastic search - CDC & export
RDBMS/Redshift/Bigquery/Databricks/Cloudera/Qubole - Full load/ SQL Queries
Schedule/Call external services like stitch fix, five err

Steps

  1. Onboard a data source.

Sidebar:

List of data sources

Create a datasource -- Select one of the option above

Add Credentials (if any)

Test credentials (if any)

Deployment target

Re-use datasource across projects (optional)

Datasources specific projects.

Create projects.

Map User Permissions to SSO/SAML

Admin Screen/Users & Permission

Delete a datasource by Admin

Pattern matching for files to lookup

Types of Admin (superadmin/Admin/Datasource Manager)

  1. Ingest Raw Data and store in cloud object store: RAW_DATA store or LANDING_ZONE store or DMZ store

a) Scan files for virus during ingestion or after ingestion?

  1. Scan files for virus

  2. FileName + Date + Time + (optional Run No.) -> covid-data-2020-12-24-14:27.csv -> Identifier

a) User picks an identifier. If user doesn't pick an identifier -> one will be generated by removing space/lowercase/special characters

b) this identifier is different from a sytem generated identifier

c) system generated identifier is a UUID without dashes

While onboarding a datasource, user can enter the path to save in the object store

? How to validate a user given path

ri


Last update: 2020-12-25