Use Cases¶
Data Sources¶
| Source | |
|---|---|
| FTP | |
| SFTP | |
| S3 | |
| GCS | |
| Azure Buckets | |
| GITHUB - Public | |
| GitHub - Private (>) | |
| API - Public | |
| API - Authentication (Username/Password, Token) | |
| API - Sigv4 | |
| Kafka Partition | |
| Manual Upload | |
| CDC - Kafka, various other Source Systems | |
| Elastic search - CDC & export | |
| RDBMS/Redshift/Bigquery/Databricks/Cloudera/Qubole - Full load/ SQL Queries | |
| Schedule/Call external services like stitch fix, five err |
Steps¶
- Onboard a data source.
Sidebar:
List of data sources
Create a datasource -- Select one of the option above
Add Credentials (if any)
Test credentials (if any)
Deployment target
Re-use datasource across projects (optional)
Datasources specific projects.
Create projects.
Map User Permissions to SSO/SAML
Admin Screen/Users & Permission
Delete a datasource by Admin
Pattern matching for files to lookup
Types of Admin (superadmin/Admin/Datasource Manager)
- Ingest Raw Data and store in cloud object store: RAW_DATA store or LANDING_ZONE store or DMZ store
a) Scan files for virus during ingestion or after ingestion?
-
Scan files for virus
-
FileName + Date + Time + (optional Run No.) -> covid-data-2020-12-24-14:27.csv -> Identifier
a) User picks an identifier. If user doesn't pick an identifier -> one will be generated by removing space/lowercase/special characters
b) this identifier is different from a sytem generated identifier
c) system generated identifier is a UUID without dashes
While onboarding a datasource, user can enter the path to save in the object store
? How to validate a user given path
ri