Marquez Monthly Community Meeting
The Marquez Community Meeting occurs on the fourth Thursday of each month. Meetings are held on Zoom.
Next meeting: February 23, 2023
January 26, 2022
Attendees:
- TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Software Engineer, Astronomer
- Peter Hicks, Senior Engineer, Astronomer
- And:
- Howard Yoo, Staff Product Manager, Astronomer
- Michael Robinson, Software Engineer, Developer Relations, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- Maciej Obuchowski, OpenLineage Committer and Software Engineer, GetInData
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Prachi Mishra, Senior Engineer, Astronomer
- Paweł Leszczyński, Data Engineer, GetInData
- Sam Holmberg, Senior Engineer, Astronomer
- Yannick Libert, Lead Data Engineer, Decathlon France
- Henoc Mukadi, Prodigy Finance
- Bramha Naidu Aelem, Big Data/ML/AI Cloud Architect, Tiger Analytics
Meeting:
Agenda:
- Announcement: a new AWS Big Data blog post about OpenLineage & Marquez
- Recent releases 29.0 and 30.0 [Michael R.]
- Column lineage overview and demo [Pawel]
- Soft delete UI feature overview [Howard]
- New OpenLineage facets, migration process [Willy]
- "2023 in Marquez" roadmap discussion [Willy]
November 17, 2022
Attendees:
- TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Software Engineer, Astronomer
- Peter Hicks, Senior Engineer, Astronomer
- Julien Le Dem, Chief Architect, Astronomer
- And:
- Howard Yoo, Staff Product Manager, Astronomer
- Michael Robinson, Software Engineer, Developer Relations, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Prachi Mishra, Senior Engineer, Astronomer
- Ross Turk, Senior Director of Community, Astronomer
Agenda:
- Announcements
- LFAI & Data progress update
- Under-documented topics
- Review of major architectural decisions
- Open discussion
Meeting:
Notes:
- Announcements [Willy]
- Marquez 0.28.0 is coming soon, featuring:
- new optimized current runs query
- new governance docs
- ability to soft-delete namespaces
- The next Marquez meeting will be on January 26th
- Marquez 0.28.0 is coming soon, featuring:
- LFAI & Data progress update [Michael R.]
- LFAI & Data structure
- under umbrella of the LF
- hosted projects need approval of TAC and Governing Board
- Marquez one of many open-source projects hosted by the LFAI
- Current status (since December 2019): Incubation
- Next milestone: Graduation
- To dos/outstanding:
- one unassociated significant contribution (e.g., integration)
- CII Silver Badge (96%)
- CII Gold Badge (83%)
- appointment of technical lead
- approving votes of LFAI TAC and Governing Board
- LFAI & Data structure
- Under-documented topics [Willy]
- Review of major architectural decisions [Willy]
- Open discussion
- should Marquez have a social media account in addition to the Twitter account?
- Mastodon a good candidate
- an unofficial OpenLineage account already exists there
- how can the project be internationalized to meet the expectations of the LFAI?
- a tool such as React's i18next would make this task less daunting
- should Marquez have a social media account in addition to the Twitter account?
October 27, 2022
Attendees:
- TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Software Engineer, Astronomer
- And:
- Michael Robinson, Software Engineer, Developer Relations, Astronomer
- Ross Turk, Senior Director of Community, Astronomer
- Paweł Leszczyński, Data Engineer, GetInData
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Arek Osinski, Senior Data Engineer, Allegro Group
- Prachi Mishra, Senior Engineer, Astronomer
Agenda:
- Announcements
- Recent release 0.27.0
- Dataset symlinks feature demo [Pawel]
- Node color changes to reflect run state in UI demo [Willy]
- UI improvements roadmap review [Willy]
Meeting:
Notes:
- Announcements
- Marquez 0.27.0 was released on October 24th
- FYI, today, October 27th, is the CFP deadline for Data Council Austin 2023
- Recent release 0.27.0
- New dataset symlinks feature:
Implement dataset symlink feature #2066 @pawel-big-lebowski
Provide dataset_symlinks table for SymlinkDatasetFacet #2087 @pawel-big-lebowski
- New column lineage feature:
Model and store column lineage in Marquez #2096 @mzareba382 @pawel-big-lebowski
Add a lineage graph endpoint for column lineage #2124 @pawel-big-lebowski
Enrich returned dataset resource with column lineage information #2113 @pawel-big-lebowski
Add downstream column lineage #2159 @pawel-big-lebowski
Include column lineage in dataset resource #2148 @pawel-big-lebowski
Implement column lineage within Marquez Java client #2163 @pawel-big-lebowski
Add endpoint to get column lineage by a job #2204 @pawel-big-lebowski
Add column lineage methods to Python client #2209 @pawel-big-lebowski
Fix column lineage returning multiple entries for job run multiple times #2176 @pawel-big-lebowski
Increase size of column-lineage.description column #2205 @pawel-big-lebowski
Fix downstream recursion #2181 @pawel-big-lebowski
- Lineage graph changes:
- Display current run state for job node in lineage graph #2146 @wslulciuc
- API changes:
Update insert job function to avoid joining on symlinks for jobs with no symlinks #2144 @collado-mike
Add support for parentRun facet as reported by older Airflow OpenLineage versions #2130 @collado-mike
Add fix and tests for handling Airflow DAGs with dots and task groups #2126 @collado-mike @wslulciuc
Fix bug that caused a single run event to create multiple jobs #2162 @collado-mike
Update jobs_current_version_uuid_index and jobs_symlink_target_uuid_index to ignore NULL values #2186 @collado-mike
Release: https://github.com/MarquezProject/marquez/releases/tag/0.27.0
Changelog: https://github.com/MarquezProject/marquez/blob/0.27.0/CHANGELOG.md
Commit history: https://github.com/MarquezProject/marquez/compare/0.26.0...0.27.0
- New dataset symlinks feature:
- Dataset symlinks feature demo [Pawel]
- This workshop is available, including all installation steps, in the `openlineage/workshops` repository on GitHub
- Scenario: datasets are sometimes known by different names
- This can lead to broken lineage
- An extra facet makes the dataset symlinks feature possible
- The facet is used to create lineage edges over an alternate name
- Workshop notes:
- involves starting a Spark cluster and using the Spark OpenLineage connector, accessing a Hive metastore
- when verifying the event using the Marquez events API endpoint, one can see the different name in the output
- when trying to locate the same dataset using both names, one gets the same dataset back
- when accessing a lineage graph, one can see two tables represented
- Node color changes to reflect run state in UI demo [Willy]
- The changes are part of a recent PR completed with the help of Peter Hicks
- The work was spurred by a discussion in the #random channel in the Marquez Slack
- Colors are now used to indicate run state in the UI
- Now available, and feedback is welcome
- Question: does the run state come from Airflow?
- The Airflow integration does support it
- But it is supported globally, as well
- More information about using Airflow with Marquez is available in `marquez/examples/airflow` on GitHub
- UI improvements roadmap review [Willy]
- The roadmap is publicly available on GitHub Projects
- The roadmap is filterable by label (e.g., "web")
- Potential contributors are welcome to pick up any of the good first issues there
- Howard Yoo does a lot of work on the roadmap
- You should start seeing more of these features in future releases:
- raw event viewer to make use of the new events endpoint
- will make event stats and JSON payloads available in the UI
- search enhancements
- recently proposed: use Elastic Search, instead of matched text searching, for search
- facet viewer
- will take advantage of OpenLineage facets, make them interactive in the UI (expandable, collapsable, etc.)
- time range-based query
- will provide an API for retrieving historical data, make former versions of datasets viewable and comparable
- dataset versions, job versions, run IDs make point-in-time snapshots possible
- discussions about how to proceed are ongoing
- lineage graph display mode
- will make job status visible in the UI (e.g., "failed")
- soft delete
- ability to delete metadata
- Marquez should be the source of truth, so deletion should not be permanent
- "deleted" datasets will be available on the backend but not visible on the lineage graph
- under discussion: should all users have the ability to delete?
- raw event viewer to make use of the new events endpoint
- Feedback on all open issues is welcome!
- Big thanks to Howard Yoo for his work on these issues!
September 22, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Peter Hicks, Senior Engineer, Astronomer
- Julien Le Dem, Chief Architect, Astronomer
- Michael Collado, Staff Software Engineer, Astronomer
And:
- Paweł Leszczyński, Data Engineer, GetInData
- Harel Shein, Director of Engineering, Astronomer
- Ross Turk, Senior Director of Community, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer
- Michael Robinson, Software Engineer, Developer Relations, Astronomer
- Ryan Hatter, Customer Reliability Engineer, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- Maciej Obuchowski, OpenLineage Committer and Software Engineer, GetInData
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Herrick Muhlestein, Software Engineer, Ancestry
- Amay Kadre, Senior Software Engineer, Ancestry
- Dayle Woolston, Principal Software Engineer, Ancestry
Agenda:
Announcements
Recent release 0.26.0
Recent work on versioning
New and in-process APIs
Discussion topics:
How to improve the Marquez UI?
New/in-process APIs
Enhancing search to include schema field names and facets
Meeting:
Slides: https://docs.google.com/presentation/d/160WuwGB0hQSpfMRq_4_R0xls6VXkFYxtpQvj57_SIw0/edit?usp=sharing
Notes:
Announcements
- Marquez stickers are still available: https://www.astronomer.io/datakin-swag
- Recent talk:
- Willy at LinuxCon: https://www.youtube.com/watch?v=sN7j5mZcUQA
- LFAI & Data progress update:
- External contributions needed
Marquez 0.26.0
- ADDED
- Add possibility to soft-delete datasets and jobs #2032 #2099 #2101 @mobuchowski
- Add raw OpenLineage events API #2070 @mobuchowski
- Update FlywayFactory to support an argument to customize the schema programmatically #2055 @collado-mike
- Add --metadata option & metadata cmd #2082 #2091 @wslulciuc
- Create column lineage endpoint proposal #2077 @julienledem @pawel-big-lebowski
- Add steps on proposing changes to Marquez #2065 @wslulciuc
- Improve documentation on nodeId in the spec #2084 @howardyoo
- CHANGED
- Update lineage query to only look at jobs with inputs or outputs #2068 @collado-mike
- Persist OpenLineage event before updating Marquez model #2069 @fm100
- Drop requirement to provide marquez.yml for seed cmd #2094 @wslulciuc
- FIXED
- Fix/rewrite jobs fqn locks #2067 @collado-mike
- Fix enum string types in the OpenAPI spec #2086 @studiosciences
- Fix incorrect PostgresSQL version #2089 @jabbera
- Update OpenLineageDao to handle Airflow run UUID conflicts #2097 @collado-mike
Release: https://github.com/MarquezProject/marquez/releases/tag/0.26.0
Changelog: https://github.com/MarquezProject/marquez/blob/0.26.0/CHANGELOG.md
Commit history: https://github.com/MarquezProject/marquez/compare/0.25.0...0.26.0
Recent work on versioning [Ryan]
- Open issues regarding dataset versioning: 1977, 1883
- Confusing: schema for dataset version contains UUID and version field
- 2071: tries to resolve confusion by removing the version field and replacing it with an external version
- supports different tools that might have a dataset version baked in
- Mqz users can use this as a version field
- These improvements are welcome [Willy]
- The project's approach to versioning has remained unchanged since we began
- Mqz is opinionated about versioning, but other systems and dbs have their own versioning
- This change hasn't been on the roadmap, but we've known for a long time that it was needed
- This will add the flexibility that OpenLineage offers
New and in-process APIs [Maciej]
- Row event API
- Future work needed: make it possible to get the data via namespaces
- Challenge
- Delete APIs
- soft delete approach
- future work: make it possible to clear an entire namespace
- planned: "real" deletion
- Q & A
- is it possible to undelete datasets and jobs?
- not yet
- Row event API
Discussion topics
- updating the UI [Howard]
- list of possible improvements:
- list of possible improvements:
- Enhancing search [Herrick]
- Current state of search in Mqz: helpful if looking for job or dataset
- However, more data is available
- First question from our data governance team: can we look up a column?
- Possible enhancements:
- search for column names and descriptions
- job codes (e.g., SQL queries)
- job facet property names and values within JSON
- job descriptions
- Potential benefits:
- easy to find where data comes from given a column or keyword
- what job transformed a column in a dataset
- reverse lookup from metadata such a SQL query or S3 bucket to find a related job
- Other ideas:
- limit search downstream from a specific job or dataset
- display job status color in lineage view to quickly find failed jobs
- add a "last status" column in the Jobs list view to quickly find failed jobs
- data "consumer" awareness such as a dashboard (OpenLineage dependency)
- custom dataset icons (Kafka, API), to help visualize where things are coming from
- Some of these ideas could be implemented in conjunction with existing ongoing projects, such as column-level lineage [Mike C.]
- We would be happy to help you be successful in whichever parts of these you would want to build [Julien]
- These are small changes but very impactful for usability [Willy]
- search has never been very sophisticated because we're not using a true search engine
- start by creating issues!
- Some of these are low-hanging fruit, but it would be helpful to have them prioritized [Peter]
- A UI hack day might be all we need to knock many of these out [Willy]
- Publicize some of these by creating issues labeled as good first issues [Minkyu]
- updating the UI [Howard]
August 25, 2022
Attendees:
- TSC:
- Michael Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, Chief Architect, Astronomer
- Willy Lulciuc, Co-creator of Marquez
- And:
- Minkyu Park, Senior Engineer, Astronomer
- Nikhil Koli, Software Engineer, Moody's
- Harel Shein, Director of Engineering, Astronomer
- Michael Robinson, Software Engineer, Developer Relations, Astronomer
- Ryan Hatter, Customer Reliability Engineer, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer
Agenda:
- Announcements [Willy]
- Recent release 0.25.0 [Michael R.]
- Column-level lineage proposal [Julien]
- Lineage optimization of `getLineage()` [Michael C.]
- New proposal process [Willy]
- Optimization of query performance for facets [Willy]
- Runs API removal/migration [Willy]
Notes:
Announcements [Willy]
Recent release 0.25.0 [Michael R.]
- Fixed
- Fix `py` module release #2057 @wslulciuc
- Use /bin/sh in web/docker/entrypoint.sh #2059 @wslulciuc
Column-level lineage proposal [Julien]
- Main use case: compliance (GDPR, CCPA, etc.)
- private information especially
- banking regulations
- Point in time lineage
- retrievable from Marquez: version of database in the past
- makes it possible to identify exactly where a breakdown in protocols happened
- New facet in the OpenLineage spec
- for each col in output, you can specify where the data came from
- can also identify whether data is masked or not
- Collection of column-level lineage currently automatic in the Spark integration
- More to come! also: can be added to custom extractors
- Proposal
- add 3 endpoints
- column lineage as first-class in the lineage endpoint
- column lineage specific endpoint
- point in time lineage endpoint
- currently up for review
- describes use cases, proposed new endpoints
- most complicated: point in time lineage
- API requires dataset version ID (UUID)
- add 3 endpoints
- Next steps
- adding detail, use cases to the docs
- Q&A:
- Nikhil: possible to add point in time for jobs?
- JLD: possible at the run level (Marquez captures lineage for each run)
- runs point to specific versions of jobs
- see blog post on OpenLineage site for more info
- Nikhil: possible to add point in time for jobs?
- Main use case: compliance (GDPR, CCPA, etc.)
Lineage optimization of `getLineage()` [Michael C.]
- lineage query that uses temp tables to calculate inputs and outputs of every job
- uses left join to select only the current version of the job
- we were noticing that this query was taking several minutes to return due to the number of jobs (as many as 300k) in the database
- most popular operators have no inputs or outputs (e.g., bash and python operators)
- change: map from `job_versions_io_mapping` table
- reduced execution time to a few seconds
- this a "hack" because eventually we want to cover Python and bash operators
- Q&A
- Julien: will there be a similar query for point in time lineage?
- MC: a different solution will be needed there
New proposal process [Willy]
- 4-step process
- open an issue (please follow the template)
- it will be either accepted or declined
- we'll add the issue to our backlog if it's accepted
- then we'll pin it to a milestone
- Check out the contributing guide when working on your PR
- 4-step process
Optimization of query performance for facets [Willy]
- events can get very large
- proposal
- raw events have to be accessed every time for facets
- new separate tables will be used instead – e.g., `dataset_version_facets`
- look for this change in 0.26.0
- Q&A:
- Nikhil: possible to search for dataset versions using the search box?
- WL: search API currently very simple, but this could make for an interesting proposal
- Take a look at the data model (see link in proposal)
- Nikhil: possible to search for dataset versions using the search box?
Runs API removal/migration [Willy]
- we've switched over to using OpenLineage events from the Runs API
- try it out using the `seed` command and pass in a file containing OpenLineage events
- facets and runs displayed in the UI
July 28, 2022
Attendees:
- TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Software Engineer, Astronomer
- And:
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Ross Turk, Senior Director of Community, Astronomer
- Ryan Hatter, Customer Reliability Engineer, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer
Agenda:
- Announcements
- Introducing the Marquez blog
- Architecture review: the lineage graph
- Discussion
- Marquez issue #2048
Meeting:
Notes:
Announcements [Willy]
Introducing the Marquez Blog [Michael R. and Ross]
- new blog can be found at marquezproject.ai/blog
- designed and built by Ross
- to contribute a blog post on GitHub:
- write post in Markdown, place it in new directory in OpenLineage/website/contents/blog
- OR: open an issue first to suggest a topic or get feedback on your idea
- artwork: Ross happy to make the images; tag him
- Ross also happy to document the artwork creation process for others
Architecture review: the lineage graph [Willy]
- What is Marquez doing in the background to surface lineage metadata at the run level during execution?
- What is a current lineage graph?
- bigraph with nodes for jobs and datasets
- run-level lineage is collected from OpenLineage events
- representation of job is based on datasets and the inputs and outputs they produce
- datasets stitched together using OpenLineage `ID` (global and unique)
- versioning of jobs enabled by OpenLineage `JobVersion`
- Marquez keeps track of changes to code and datasets behind the scenes
- Marquez data model
- Marquez keeps track of:
- job versions
- runs of each version
- sources
- each node represents the latest, or current, version of the job's lineage
- `Job` is `ID` and arrays representing input and output datasets
- Marquez keeps track of:
- Demo
- UI defaults to latest/current graph
- prior versions accessible via `version history` tab
- selecting a version makes another job node/datasets visible
- makes "time travel" possible in your pipeline
- all of this possible thanks to the OpenLineage spec
- Q & A
- If a job has not completed, will you not see metadata? [Howard]
- no – a job has to complete in order for versioning logic to be applied
- Is a job version associated with the code that produced it? [Ryan]
- yes – if the code is provided as a source location facet
- Marquez will determine if the code has changed
- changes to schema also monitored using dataset versioning; this tied to job version
- If a job has not completed, will you not see metadata? [Howard]
Discussion
- Howard: issue 2048:
- There is an edge case (using a custom extractor) where the TaskMetadata's given input or output dataset would NOT have the fields populated (`dataset.fields = []`).
- Having this type of metadata makes Marquez overwrite the existing version of the dataset with empty fields
- Proposal: Marquez should try to reuse the dataset instead of rewriting
- Agreed; question remains about how to do it [Willy]
- behavior reflects versioning logic
- possible solution: use `null` value in OL spec rather than empty array
- challenge: we want to avoid making assumptions
- Howard: issue 2048:
June 23, 2022
Attendees:
- TSC
- Willy Lulciuc, Co-creator of Marquez
- Julien Le Dem, Chief Architect, Astronomer
- And
- Martin Fiser, Head of Professional Services, Keboola
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Support Engineer, Astronomer
- Naga Raghavarapu, Principal Software Engineer, Oracle
- Ross Turk, Senior Director of Community, Astronomer
Agenda:
- Announcements
- Recent release: 0.23.0
- User story by Martin Fiser (Keboola)
- Open discussion
Meeting:
Notes:
Announcements [Willy]
- Mqz/OL swag is still available!
- Willy talked Mqz at OS Summit (LinuxCon)
Recent Release 0.23.0 [Michael R.]
- Added
- Changed
- Set default limit for listing datasets and jobs in UI from 2000 to 25 (#2018, @wslulciuc)
- Fixed
Keboola Use Case [Martin]
- Topic: OL integration with the Keboola platform
- Overview of platform
- modern data experience: data stack as a service
- all-in-one service
- writers/reverse ETL through component framework
- enables version control, governance, etc., in workspaces
- much metadata produced and collected, permitting visibility across entire pipeline
- pipeline jobs
- storage events
- data loads/unloads
- user-generated metadata
- Purpose of OL integration
- data governance to support users' feeding data to external tools
- OL a "language" for speaking to various tools
- offer API for OL information
- native Keboola component
- feeds OL information to an endpoint (e.g., Marquez)
- can be orchestrated on customizable interval
- supports SSH
- exports full job information to the endpoint
- Demo
- users have multiple projects on the platform
- a few hundred components are offered to users out of the box (e.g., Google Drive, SQL, Python, Google Sheets)
- metadata manually pushable to OpenLineage endpoint
- orchestrator could benefit from parent/job support
- Challenges
- need: richer metadata
- component config
- info about tables
- lighter UI
- reflects feedback about legibility
- icon customizability
- namespaces
- connectivity between projects
- more integrations
- rounded logo
- need: richer metadata
- Q & A
- Are you interested in contributing? [Julien]
- would like to; possibly in the future
- Would you like to open issues? (custom facets, UI) [Willy]
- not currently able to
- Are you using any integrations? java or python [Willy]
- component can be anything in the docker container
- multiple languages used in development
- Customers using it already? [Conor]
- some testing is going on
- not in production yet
- no plans to offer Marquez to customers
- Does it work for every connector? [Conor]
- each will produce at least a job
- Auth model [Willy]
- problem: slippery slope [Martin]
- recommended at ingress level [Willy]
- not a focus at the moment
- contributions to related issues welcome
- Is data discovery offered? [Naga]
- built in with API
- additional tools can be added if integration would be seamless
- Are you interested in contributing? [Julien]
- Overview of platform
- Topic: OL integration with the Keboola platform
May 26, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Peter Hicks, Senior Engineer, Astronomer
And:
- Ross Turk, Senior Director of Community, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Support Engineer, Astronomer
- Michael Robinson, Developer Relations Engineer, Astronomer
- Joshua Wankowski, Associate Data Engineer, Northwestern Mutual
- Sam Holmberg, Software Engineer, Astronomer
- Dako Dakov, R&D Manager, VMware
- Agita Jaunzeme, Community Manager, VMware
- Radmila Radovanvic, Senior Data Engineer, Northwestern Mutual
- Gage Russell, Data Engineer, Q2
- Rae Green, Developer, Q2ebanking
- Dimira Petrova, Supervisor of Data Analytics, VMware
- Martin Fiser, Head of Professional Services, Keboola
- Naga Raghavarapu, Principal Software Engineer, Oracle
- Antoni Ivanov, Staff Engineer, VMware
Agenda:
- Announcements
- Use cases from Northwestern Mutual and VMware
- New feature: linking job runs and datasets
Meeting:
- Recording
- Password: WMz0&@Gm
Notes:
Announcements [Willy]
- Marquez stickers are now available: https://www.astronomer.io/datakin-swag
- Michael C. is presenting today at Airflow Summit @ 7 pm PT: https://airflowsummit.org/program/
- Willy will be talking Mqz at Open Source Summit in June: https://sched.co/11NgS
Northwestern Mutual Use Case [Joshua]
- Big-picture role of Mqz at NWM
- Mqz used to track data usage as a whole
- Mqz critical at NWM to data ops, has special future here
- Company background
- Massive insurance co. with investment management arm
- 150+ history with many customer touch points
- Massive data with lots of users
- Rationale for adoption
- OL is where I spend most of my time
- These tools will be the industry standards for dataset usage going forward
- We desired one data standard, not random internal standards
- Breakdown of use case
- We track the HOW of usage from initial consumption to end usage
- We record data product usage over time
- Bonus: improved security
- can see how/which users are actually using data
- allows comparison to security frameworks, double-checking of work
- Visualization is key
- helps in building reports and modeling huge data systems
- we can check the entire platform stack from ingest to updates, normalization, end-usage
- Personal perspective
- Mqz is data ops for data processing
- Will we have a data ops center in the future like we have currently with NOCs?
- The visual language is the key strength of the tool
- This is the future of data
- Q & A
- Are screenshots available? Do you use Spark? [Naga]
- Can't share due to proprietary concerns
- How much data? [Naga]
- Can't be specific, but it's a lot!
- It's exciting to see others excited about the project. Are you using any custom integrations? [Willy]
- Yes, custom integrations support streaming and ingestions across the platform
- Are screenshots available? Do you use Spark? [Naga]
- Big-picture role of Mqz at NWM
VMware use case [Antoni]
- Demo of VDK
- Our motivation
- Verification problems
- OLMqz was the solution
- The common standard provided by OL is essential
- Why Mqz?
- It's helpful in debugging complex jobs, troubleshooting
- It's key to understanding usage for maintenance – e.g., enabling removal of irrelevant datasets, jobs
- The shared metadata is useful
- Diagram of architecture
- Code demo
- Suggestions
- Add visualization of parent/child relationships [note: see PR 1935]
- Make output searchable by metadata (e.g., make it possible to find all late jobs)
- Our stack
- Postgres, Presto, Snowflake, Greenplum db, Trino
- Q & A
- How many integrations in use? [Gage]
- 100 teams, 1000s of tables
- Are you using the Python client? [Willy]
- Yes
- It's amazing to get this feedback [Willy]
- The grouping of jobs is hard, but we're addressing this
- Feel free to open issues and contribute
- How many integrations in use? [Gage]
- New feature linking job runs to datasets [Peter]
- Recently added to jobs: created_by available on dataset views
- Dataset versions also now available on version history tab
- Allows for historical introspection in case of an issue
- Allows for seeing if the code changed, for example
- Allows for historical introspection in case of an issue
- Open discussion
- Is anyone using the Python client for OL? [Gage]
- Based on today's discussion, the answer is yes
- Projects, docs are coming [Willy]
- You can also use the Airflow integration for insight into the Python client
- Column-level lineage has been added to OL [Willy]
- We worked with Microsoft on the spec
- Look for this in the API in the next few months
- Feedback on this appreciated
- What's in the roadmap for multi-tenancy? How can this be used in Mqz? [Naga]
- For every event, route it through Kafka – we're working with a company to help us document this a bit more [Willy]
- Alternate approach: use a namespace to add metadata
- Issue with this: access control (see the project roadmap for more info)
- Is anyone using the Python client for OL? [Gage]
April 28, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, Chief Architect, Astronomer
And:
- Ross Turk, Senior Director of Community, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Support Engineer, Astronomer
- Michael Robinson, Developer Relations Engineer, Astronomer
- Gage Russell, Data Engineer, Q2
- Paweł Leszczyński, Data Engineer, GetInData
- Joshua Wankowski, Associate Data Engineer, Northwest Mutual
- Dillon Stadther
Agenda:
- 0.22.0 preview [Willy]
- lifecycleStateChange support [Pawel]
- Updates to job renaming and symlinking [Michael C.]
Meeting:
Notes:
Announcements [Willy]:
- Cool swag is available! https://www.astronomer.io/datakin-swag
- Willy has two talks about Marquez upcoming:
- Airflow Summit: https://airflowsummit.org/program/
- Open Source Summit: https://sched.co/11NgS
0.22.0 Preview [Willy]:
- lifecycleStateChange support will offer visibility into dataset lifecycle changes, including deleting of tables
- Pawel:
- change motivated by desire for more information about datasets
- approach started out with the Spark integration
- still more information about lifecycle changes is possible/desirable
- additional feature idea: notification console friendly to backend developers
- Additional possibility: grayed out nodes on graph for deleted datasets, logging to show lifecycle history
- Pawel: panel on website could display changes to dataset over X days
- Agreed. Create an issue and we can build on that idea.
- Helm chart addition
- allows annotations, e.g. Prometheus metrics
- Support for renaming and redirection
- introducing job hierarchy
- symlink will permit visibility into name changes to datasets
Updates to job renaming and symlinking [Michael C.]
- stemmed from desire to tie linked jobs together, e.g., jobs called by DAGs, even in cases where identical code is part of different chains
- challenge: linking old jobs to fully qualified version
- motivating factor: changes to job names results in junk nodes on graph
- there was no way to remove the old job names from the graph
- but there is frequently a need to keep track of old job names
- hence the idea of symlinking a job
- currently there's no API to do this
- updating must be done manually currently
- add the UUID of the new job to the db
- from that point on, the job history will redirect to the new job (with a 301)
- future: API will make this possible programmatically
- Willy: is documentation needed for this?
- Yes, I will post a change to the README
- We want to do the same thing for datasets
- Open discussion
- Gage: is a home repo coming?
- Willy: Minkyu has looked into this
- Willy: we want to add the Helm chart to the new website
- Willy: this is on our radar
- New release coming soon!
- Gage: is a home repo coming?
March 31, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Engineer, Astronomer
- Julien Le Dem, Chief Architect, Astronomer
- Peter Hicks, Senior Engineer, Astronomer
And:
- Ross Turk, Sr. Director of Community, Astronomer
- Minkyu Park, Senior Engineer, Astronomer
- John Thomas, Support Engineer, Astronomer
- Michael Robinson, Developer Relations Engineer, Astronomer
- Howard Yoo, Staff Product Manager, Astronomer
Agenda:
- Website update
- Backlog and roadmap discussion
- Open discussion
Meeting:
Slides
Notes:
Announcements [Michael R.]
- Marquez stickers are now available: https://www.astronomer.io/datakin-swag
- Willy and Julien gave a talk on OpenLineage, Airflow and Marquez at Data Council Austin on March 23
- The project's Github star count stands at 983. Have you starred the project yet?
- 1k stars are a requirement for graduation status from the LFAI. The project is nearing completion of all requirements, so formal application will be possible soon.
Website [Ross]
- The project now has a new website.
- Appropriately, it's an open-source project; PRs are welcome.
- Tech: Gatsby, Github Projects
- Dev: run
yarn deploy
to work on it - Plans: blog page. Proposals for posts welcome – post them in Slack or open a PR if you prefer.
Backlog and roadmap [Willy]
- Issue: currently, PRs are driven by a small team (e.g., Peter's view for dataset versions, Pawel's lifecycle PR)
- How to get the broader community involved? Want people to have more input/control over the issues we take up.
- Solution: Github's Roadmap feature. Milestones and releases visible there. Choose Marquez on the Projects tab.
- Process: review issues on monthly basis, move to roadmap, then release.
- Question from Howard about how to propose new features
- Follow-up work: discussion of how to prioritize issues; documentation needed about how to label new issues (e.g., as "features")
- Comment from Michael C.: it's possible to add new columns to the roadmap, in addition to new issues.
Open discussion
- Michael C.: please note issue #1928: supporting job grouping and hierarchy.
- Problem: the project does not track parent/child job relationships, despite this nomenclature being used in OpenLineage to describe related jobs.
- Proposal: a
parent_job_id
column should be added to the jobs table and to the runs table, both being uuids.
- Michael R.: please note that the meeting typically takes place on the 4th Thursday of each month.
February 24, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Michael Collado, Staff Engineer, Datakin
And:
- Minkyu Park, Senior Engineer, Datakin
- Michael Robinson, Developer Relations Engineer, Datakin
- Ross Turk, VP of Marketing, Datakin
Agenda:
- Review of integrations to create runs and associate metadata with runs (replaced with OpenLineage)
- Demo: How to collect OpenLineage events with the lineage API to send metadata to Marquez
- Demo: OL Java client
- Dataset lifecycle management
- Open discussion
Meeting:
Notes:
Announcements [Willy]
- Release date of 0.21.0 is now 2/28
- Confusion in the community about which Java client to use is being addressed in OpenLineage PR #480
- We hope to have this merged for the next OL release
Integrations and OL demo [Willy]
- OL integration
- Available at openlineage.io/integration/, where you can also find instructions for installing and configuring it
- Requirements.txt needs to install airflow
- Set OpenLineage URL to local instance of Marquez
- Marquez is moving towards using a task listener to pull metadata in real time
- For now use the OL Airflow DAG
- You can still use the OL backend; there are limitations there, however
- Spark integration
When doing the Spark submit command you need to provide configuration - specify the extra listener (thanks to Michael C for his work on this)
Point the host to your deployment
- See the OL website for more details (openlineage.io/integration/spark-spark)
- Upcoming: Flink and Kafka
- Your feedback on these integrations appreciated
- There are many connections you can use in your platform by switching over to OL to collect metadata
- OL integration
OL Java client demo [Willy]
- The Java client employs a workflow with interface
- Definition of run method required
- Instance of database required
- This ex: simpleworkflow with database via newDatabase method
- Relies on a Job class
- In Marquez you can see the calls
- For the code see https://github.com/DatakinHQ/demo/tree/main/custom/java/simple
Dataset lifecycle management [Willy]
- Marquez can now capture changes to dataset names
- Community voiced desire for this feature
- Marquez now supports soft deletes of datasets
- See PR #1847
- Support of lifecycle now more concrete: can see the phases datasets go through
Open discussion
- Julien and Willy will be speaking in-person at the Data Council conference in Austin next month (March 23-24)
- Michael C. will be presenting virtually at the Subsurface LIVE conference (March 2-3); topic: Spark
January 27, 2022
Attendees:
TSC:
- Willy Lulciuc, Co-creator of Marquez
- Julien Le Dem, CTO of Datakin
- Michael Collado, Staff Engineer, Datakin
- Peter Hicks, Senior Engineer, Datakin
- Kevin Mellott, Assistant Director of Data Engineering, Northwestern Mutual
And:
- Ross Turk, VP of Marketing, Datakin
- Minkyu Park, Senior Engineer, Datakin
- John Thomas, Support Engineer, Datakin
- Michael Robinson, Developer Relations Engineer, Datakin
Agenda:
- Marquez recent releases overview [Willy]
- Marquez release 0.21.0 overview
- Upgrade to Java17
- Marquez release 0.21.0 overview
- Migrating integrations to OpenLineage [Willy]
- Cloud-based development instance of Marquez via Gitpod [Peter]
- Open discussion
Meeting:
Notes:
0.21.0 overview [Willy]
- Features:
- Bug fixes
- Removal of excess code
- Upgrade to Java17
- API image migrated
- Eclipse Temurin integrated
- All CI deployment updated to support Java17
- Discussion [Kevin, Willy, Michael C.]:
- Support for Java client possible in lower version
- Proposed: schedule separate meeting about this
- Features:
Migrating integrations to OpenLineage [Willy]
- Spark library in Marquez now deprecated
- Use of OpenLineage Spark integration recommended going forward
- review the docs about how to configure your instance
- remember to add underscore to marquez_airflow
- OpenLineage integration allows task listener
- workaround: import DAG from OpenLineage
- See the changelog: environment variables for the Airflow instance have changed
Cloud-based development instance of Marquez [Peter]
- Enabled by integration of Gitpod
- Docker image in the cloud with Marquez and UI
- Ideal for those not ready to install everything locally or who are having issues with their OS
- Fast (30 seconds), eliminates risk
- API also available
- Can be made private or public
- Big advantage: shareable within organizations via URL
- Supports everything one could do locally in VS Code or similar IDE
- Discussion [Willy, Peter, Kevin, Julien]:
- common use case: potential users want to see metadata from their org and share the tool
- potential side-effect: increase in Docker pulls
- availability of metrics unknown
- email address required
Open Discussion
- Advantages of possible move from CircleCI to Github Actions
- CircleCI downsides: outages, billing issues [Willy]
- Julien proposed: moving to Github actions eventually after running both in parallel
- Kevin asked to experiment with Github Actions and report back
- Issue #1800: add support for table operations reported from OpenLineage
- Formal solution needed [Willy]
- Willy proposed: deploy in two modes and use flags (Julien agreed)
- NodeID
- An easy win: add a field that returns a nodeID [Willy]
- Willy proposed: prioritize in next release
- Advantages of possible move from CircleCI to Github Actions
Marquez Workflow Group Calendar Overview
Effective March 22, 2019: Group calendars are managed within LF AI Foundation Groups.io subgroups (mail lists); with each sub-group (mail list) having a unique group calendar. Meeting invites from these group calendars are sent to the applicable sub-group (mail list). In order to see the various group calendars you must:
Be logged into LF AI Foundation Groups.io
Be subscribed to the sub-group(mail-list) you're interested in
Thereafter, you will see all the calendars for the sub-groups you subscribe to under your LF AI Foundation Group Calendar via Groups.io OR
You can also view a specific group calendar via the Wiki (if the group has created a Wiki group calendar) whether you are a member of the sub-group (mail list) or not
View Instructions on How to Subscribe to LF AI Group Calendars
For detailed information on LF AI meeting management processes view this page: LF AI Foundation - Community Meetings and Calendars
Marquez Meetings List
Schedule | Title | Owner | Subgroup (mail list) | Purpose | Dial In Link |
---|---|---|---|---|---|
Day of Week (frequency) 00:00 AM/PM - 00:00 AM/PM (timezone) | Meeting Title (Zoom Account Used) | Meeting Owner/Moderator | marquez-mail-list@lists.lfai.foundation | Meeting Purpose | Zoom Name: https://zoom.us/... |