etl¶
https://news.ycombinator.com/item?id=17781762
http://www.etldatabase.com/etl-tools/
https://medium.com/strava-engineering/from-data-streams-to-a-data-lake-b6ca17c00a23
https://www.brianlikespostgres.com/
https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8
https://airflow.apache.org/concepts.html
https://multithreaded.stitchfix.com/blog/2017/11/22/patterns-of-soa-database-transactions/
Apache Airflow feedback There's a bunch of different tools to do the same job, from manual cron jobs, to Luigi, Pinball, Azkaban, Oozie, Taverna, Mistral. I've started to use it for personal projects, and slowly probing for adoption in our shop, where applicable.
The good points I have seen
-
It's simple Python, and not XML like Azkaban. I've seen people with less technical expertise build useful stuff quickly, and automate their workflows.
-
Very good UI, which just lets you do what you need without fuss.
-
Easy to build modular and interactive flows, with interesting stuff as sensors, communications between operators, triggers etc.
-
Everything is stored into a database, which I can query about anything related to the processes run and Airflow itself
-
Its source is grok-able and documented, it allows you to easily add your own modules (or "operators" as they're called)
-
Many add-on modules for operators already exist from the community
-
Easier to force the team to version control your process flows
Some cons, from the light use I've seen
-
If you scale beyond a point, you have to take care of scaling the database as well, adding DBA work
-
I've encountered some issues with scheduler and backfilled jobs, and
depends_on_past, but it might be my limited experience -
People may start to use specific external dependencies/modules, which you will then need to keep track of
-
Uses its own lingo/terminology, which you'll have to learn and use
-
Uses system time, so no running in different timezones
If it seems interesting to you, my suggestion is to start small, keep in mind that it handles relations between tasks and not data, and try to automate some easy bash script that you currently handle with cron.
https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
http://bytepawn.com/fetchr-airflow.html
http://bytepawn.com/fetchr-data-science-infra.html
https://www.astronomer.io/guides/
https://news.ycombinator.com/item?id=17867987
https://robinhood.engineering/why-robinhood-uses-airflow-aed13a9a90c8
http://bytepawn.com/luigi-airflow-pinball.html
MEM SQL: https://news.ycombinator.com/item?id=18391863 https://www.memsql.com/pipelines/
https://blog.gitprime.com/designing-performance-management-systems/
DEBUGGING: https://rr-project.org/
https://internetintel.oracle.com/blog-single.html?id=China+Telecom%27s+Internet+Traffic+Misdirection https://sysdig.com/blog/fishing-for-hackers/
https://www.reddit.com/r/ETL/comments/9up7f8/etl_from_oltps_to_dwh/
https://www.talend.com/products/data-catalog/
https://www.talend.com/products/data-quality/
https://www.talend.com/products/data-stewardship/
https://www.talend.com/products/mdm/
https://www.talend.com/products/mdm/mdm-platform/
https://www.snowflake.com/product/architecture/
https://www.snowflake.com/resource/cloud-data-warehousing-dummies/
https://www.cloudanalyticsacademy.com/
https://resources.snowflake.com/
https://aws.amazon.com/mp/scenarios/bi/
http://www.etldatabase.com/etl-process
http://www.dblab.ntua.gr/pubs/uploads/TR-2003-8.pdf
https://www.toolsverse.com/etl-framework/etl-examples/index.html
https://www.tutorialspoint.com/etl_testing/etl_testing_scenarios.htm
http://www.datagaps.com/concepts/etl-testing
https://www.informationbuilders.com/etl-tools
http://testing-dwh.blogspot.com/2012/11/etl-test-scenarios-and-test-cases.html
https://www.educba.com/etl-interview-questions/
https://blog.appliedai.com/etl/
https://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/
field transformations
file & record transformations
extract logic - dump 100% of the data each time? extract just changed rows?
latency between extracts & loads
reprocessing requirements
auditing & logging requirements
https://intellipaat.com/interview-question/etl-interview-questions/ https://www.guru99.com/etl-testing-interview-questions.html https://www.guru99.com/etl-extract-load-process.html https://www.tutorialspoint.com/etl_testing/etl_testing_interview_questions.htm
https://blog.interviewmocha.com/top-20-etl-interview-questions-to-assess-hire-etl-developer https://www.educba.com/etl-interview-questions/ http://www.geekinterview.com/Interview-Questions/Data-Warehouse/ETL/page4 https://www.wisdomjobs.com/e-university/etl-testing-interview-questions.html
http://www.complexsql.com/etl-testing-interview-questions/
Recurssive SQL syntax SQL Server supports two types of CTEs—recursive and nonrecursive [WITH [,...]] ::= cte_name [(column_name [,...])] AS (cte_query) Etl design: Good reject processing and logging; easy to maintain. familiarity with ETL principles.
How can we update a record in target table without using Update strategy?
we need to define the key in the target table in Informatica level and then we need to connect the key and the field we want to update in the mapping Target. In the session level, we should set the target property as “Update as Update” and check the “Update” check-box.
He asked me the difference b/w Left join and right join and If we can get the same functionality they why did Microsoft guys made the right join? The other question was what should be the specialty of a database developer?
http://dwbimaster.com/sql-interview-questions-and-answers/
Note:When you create a mapping with a Lookup transformation that uses a dynamic lookup cache, you must use Update Strategy transformations to flag the rows for the target tables.
Why is it necessary to clean data before loading it into the warehouse
Data Cleansing is a process of detecting and correcting the corrupt and inaccurate data from table or database.
There are following steps used:-
1) Data auditing
2) Workflow Specification
3) Workflow Execution
4) Post-processing and controlling
http://www.geekinterview.com/question_details/83471
https://www.edureka.co/blog/interview-questions/informatica-interview-questions/
https://www.careerride.com/ETL-interview-questions.aspx
SAP Hana: ETL tool
https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db
Airflow¶
pip install apache-airflow[all] pip install apache-airflow[all] set AIRFLO_GPL_UNIDECODE pip install apache-airflow[all] export SLUGIFY_USES_TEXT_UNIDECODE=yes airflow initdb pip install apache-airflow[all-dbs] airflow webserver -p 8080
https://www.pachyderm.io/open_source.html https://www.pcmaffey.com/roll-your-own-analytics/ https://thehftguy.com/2016/10/20/building-an-analytics-pipeline-in-2016-the-ultimate-guide/