etl¶

Apache Airflow feedback There's a bunch of different tools to do the same job, from manual cron jobs, to Luigi, Pinball, Azkaban, Oozie, Taverna, Mistral. I've started to use it for personal projects, and slowly probing for adoption in our shop, where applicable.

The good points I have seen

It's simple Python, and not XML like Azkaban. I've seen people with less technical expertise build useful stuff quickly, and automate their workflows.
Very good UI, which just lets you do what you need without fuss.
Easy to build modular and interactive flows, with interesting stuff as sensors, communications between operators, triggers etc.
Everything is stored into a database, which I can query about anything related to the processes run and Airflow itself
Its source is grok-able and documented, it allows you to easily add your own modules (or "operators" as they're called)
Many add-on modules for operators already exist from the community
Easier to force the team to version control your process flows

Some cons, from the light use I've seen

If you scale beyond a point, you have to take care of scaling the database as well, adding DBA work
I've encountered some issues with scheduler and backfilled jobs, and depends_on_past, but it might be my limited experience
People may start to use specific external dependencies/modules, which you will then need to keep track of
Uses its own lingo/terminology, which you'll have to learn and use
Uses system time, so no running in different timezones

If it seems interesting to you, my suggestion is to start small, keep in mind that it handles relations between tasks and not data, and try to automate some easy bash script that you currently handle with cron.

https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems

http://bytepawn.com/fetchr-airflow.html

http://bytepawn.com/fetchr-data-science-infra.html

https://www.astronomer.io/guides/

https://news.ycombinator.com/item?id=17867987

https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8

http://bytepawn.com/luigi-airflow-pinball.html

MEM SQL: https://news.ycombinator.com/item?id=18391863 https://www.memsql.com/pipelines/

https://blog.gitprime.com/designing-performance-management-systems/

DEBUGGING: https://rr-project.org/

https://internetintel.oracle.com/blog-single.html?id=China+Telecom%27s+Internet+Traffic+Misdirection https://sysdig.com/blog/fishing-for-hackers/

https://www.reddit.com/r/ETL/comments/9up7f8/etl_from_oltps_to_dwh/

https://www.talend.com/products/data-catalog/

https://www.talend.com/products/data-quality/

https://www.talend.com/products/data-stewardship/

https://www.talend.com/products/mdm/

https://www.talend.com/products/mdm/mdm-platform/

https://www.snowflake.com/product/architecture/

https://www.snowflake.com/resource/cloud-data-warehousing-dummies/

https://www.cloudanalyticsacademy.com/

https://resources.snowflake.com/

https://aws.amazon.com/mp/scenarios/bi/

http://www.etldatabase.com/etl-process

http://www.dblab.ntua.gr/pubs/uploads/TR-2003-8.pdf

https://www.ibm.com/support/knowledgecenter/en/SSEP7J_10.2.2/com.ibm.swg.ba.cognos.adm_ba_pattern.1.2.0.doc/c_ima_etl_best_practices.html

https://sqa.stackexchange.com/questions/24722/i-have-an-etl-scenario-where-my-source-is-database-and-target-is-flat-file-how/24731

https://www.toolsverse.com/etl-framework/etl-examples/index.html

https://www.tutorialspoint.com/etl_testing/etl_testing_scenarios.htm

http://www.datagaps.com/concepts/etl-testing

https://www.informationbuilders.com/etl-tools

http://testing-dwh.blogspot.com/2012/11/etl-test-scenarios-and-test-cases.html

https://www.educba.com/etl-interview-questions/

https://blog.appliedai.com/etl/

https://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/

field transformations

file & record transformations

extract logic - dump 100% of the data each time? extract just changed rows?

latency between extracts & loads

reprocessing requirements

auditing & logging requirements

https://intellipaat.com/interview-question/etl-interview-questions/ https://www.guru99.com/etl-testing-interview-questions.html https://www.guru99.com/etl-extract-load-process.html https://www.tutorialspoint.com/etl_testing/etl_testing_interview_questions.htm

https://blog.interviewmocha.com/top-20-etl-interview-questions-to-assess-hire-etl-developer https://www.educba.com/etl-interview-questions/ http://www.geekinterview.com/Interview-Questions/Data-Warehouse/ETL/page4 https://www.wisdomjobs.com/e-university/etl-testing-interview-questions.html

http://www.complexsql.com/etl-testing-interview-questions/

Recurssive SQL syntax SQL Server supports two types of CTEs—recursive and nonrecursive [WITH [,...]] ::= cte_name [(column_name [,...])] AS (cte_query) Etl design: Good reject processing and logging; easy to maintain. familiarity with ETL principles.

How can we update a record in target table without using Update strategy?

we need to define the key in the target table in Informatica level and then we need to connect the key and the field we want to update in the mapping Target. In the session level, we should set the target property as “Update as Update” and check the “Update” check-box.

He asked me the difference b/w Left join and right join and If we can get the same functionality they why did Microsoft guys made the right join? The other question was what should be the specialty of a database developer?

http://dwbimaster.com/sql-interview-questions-and-answers/

Note:When you create a mapping with a Lookup transformation that uses a dynamic lookup cache, you must use Update Strategy transformations to flag the rows for the target tables.

Why is it necessary to clean data before loading it into the warehouse

Data Cleansing is a process of detecting and correcting the corrupt and inaccurate data from table or database.

There are following steps used:-

1) Data auditing

2) Workflow Specification

3) Workflow Execution

4) Post-processing and controlling

http://www.geekinterview.com/question_details/83471

https://svrtechnologies.com/interview-question-answers/top-48-informatica-scenario-based-interview-questions

https://www.edureka.co/blog/interview-questions/informatica-interview-questions/

https://www.careerride.com/ETL-interview-questions.aspx

SAP Hana: ETL tool

https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db

Airflow¶

pip install apache-airflow[all] pip install apache-airflow[all] set AIRFLO_GPL_UNIDECODE pip install apache-airflow[all] export SLUGIFY_USES_TEXT_UNIDECODE=yes airflow initdb pip install apache-airflow[all-dbs] airflow webserver -p 8080

https://www.pachyderm.io/open_source.html https://www.pcmaffey.com/roll-your-own-analytics/ https://thehftguy.com/2016/10/20/building-an-analytics-pipeline-in-2016-the-ultimate-guide/

Last update: 2020-12-25