The OpenLineage Technical Steering Committee meetings are Monthly on the Second Thursday from 10:00am to 11:00am US Pacific. Here's the meeting info.
...
September 14, 2023 (10am PT)
AgendaAttendees:
...
- TSC:
- Paweł Leszczyński, Software Engineer, GetInData
- Julien Le Dem, OpenLineage project lead
- Michael Robinson, Community team, Astronomer
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage committer
- Mandy Chessell, Lead of Egeria Project
- And:
- Harel Shein, Engineering Manager, Astronomer
- Harsh Loomba, Upgrade
- Sheeri Cabral, Product Manager, Collibra
- Ernie Ostic, Manta Software
- Mars Lan, CTO/Co-founder, Metaphor
Agenda:
- Announcements
- Recent releases
- Demo: Spark integration tests in Databricks runtime
- Discussion items
- Open discussion
...
Widget Connector | ||
---|---|---|
|
Notes:
- Announcements [Julien]
- Recent releases [Michael R.]
- Recent Releases
● Michael shared a release update on 1.1.0, including support for configuring OpenLineage based on the Flink integration, solving the problem of multiple jobs writing to different data sets with the same job name in Spark, and adding missing Java docs to the Java client. The default behavior can be turned off with an environment variable, and more information is available in the release notes.
● Michael also thanked new contributors and mentioned bug fixes.
● Maciej and Julien discussed the fact that Airflow changes are not included in the changelog and that the Airflow-OpenLineage is now part of the Airflow project. - Demo: Spark integration tests in Databricks runtime [Pawel]
● Pawel thanked the participants and introduced himself. He talked about upgrading the Spark version and the issues they faced with Databricks integration.
● They had to manually test the changes which was time-consuming. However, Databricks released a Java library that allowed them to run integration tests easily.
● They also implemented a file transport system to capture lineage events and verify that the events contain what they expected. This change helped speed up their work and have better code.
● Julien asked if there were any questions. - Discussion items
- Open Lineage Registry Proposal [Julien]
● Julien explained the concept of OpenLineage and the need for a registry to define custom facets and producers. He shared a Google doc for feedback and listed the goals of the registry, including allowing third parties to register their implementation or custom extension and shortening the producer and skim URL values.
● Custom facets are an easy way to extend the spec without requiring any approval, and producers and consumers can do the list of facets they produce without requiring approval.
● Mandy joined the call and expressed support for the idea of a registry but suggested that facets should be themed to avoid every producer defining their own facets. She proposed having a set of themes like data facets and meeting assets to cluster similar facets together in the registry.
● Mandy expresses concern about naming custom facets after specific technologies, as it can lead to unnecessary duplication. Julien explains that the airflow facet is specific to airflow and provides benefits for generic things.
● Core facets are sometimes added, and there are things specific to what people are doing. Mandy agrees and gives an example of how types are aligned with technologies, leading to duplication.
● Ernie suggests adding a protocol for something in the registry to become a core facet. Julien explains that there is a template for adding to the spector and that custom facets can be defined as long as they have a prefix to the facet name and publish the schema.
● To become a core facet, a proposal can be opened on the open is project and usage of the custom facet can be leveraged to show that it works.
● Mandy suggests having a state on the registry to show whether something is private, under proposal, or being adopted. Julien agrees and explains that some custom facets are specifically in the domain of the producer and should live in the registry, while others are shared.
● Nick interjects and expresses his appreciation for the community aspect of the open lineage. He suggests that producers provide examples and tests for consumers to use.
● Mandy asks for clarification on what he means by tests, and Nick explains that it could be a set of payloads or actually running the runtime to produce events.
● Nick would like to see both examples and payloads for consumers and producers, respectively. He suggests that putting them in a registry would facilitate everything all around like the tests.
● Julien explains that for the core spec, they have the definition of facets, Jason schema for each asset, and documentation. They also added an example of each core asset and a test for the schema validation.
● He suggests making it easier for producers to describe what facet they're producing.
● Mandy asks who did the recent addition, and Julien explains that it was part of getting data. Mandy thanks him for the information.
● Julien suggests that there could be more done to make it easier for producers to describe what facet they're producing. Nick agrees and suggests a framework for testing where producers can provide enough information for the test to be generated.
● Julien explains that they currently use schema validation, but it's just a small portion of what Nick is describing. Nick agrees that it's a start.
● Julien suggests that producers need a registry mechanism to create their own facets and make them explicitly defined. Consumers would also benefit from a programmatic definition of facets they're consuming.
● He mentions the open lineage website's ecosystem page and how it points to documentation, but a more programmatic definition would be great.
● Nick agrees that it would be great to have a more programmatic definition of facets.
● Julien proposed a registry and discussed the trade-offs between a self-contained registry and delegating to other registries. He also mentioned the benefits of using shorter URLs for custom facets.
● Nick asked about how other communities handle this and suggested looking at successful practices of similar organizations. Pawelleszczynski agreed.
● There were questions about whether there should be a registry folder under spec or in the opening tab organization, and how to handle core facets and versioning. The group discussed using an owners file in a repo to approve updates to the registry.
● Julien emphasized that this was just to start the conversation and that there were many different ways to implement the registry.
● Julien mentioned producing a list of schema URL as a third party and discussed the benefits of a self-contained registry, including the ability to run checks against it and ensure consistency.
● Julien explained that defining a name and putting a list of information would allow for shorter URLs for custom facets.
● Julien used ol: as an example of a shorter prefix for schema URLs.
● Julien mentioned that there were questions about whether there should be a registry rep in the opening tab organization and whether it should be a registry folder under spec.
● Julien discussed using a Jason file to contain information about customers and their defined names.
● Julien compared the registry to the even repository and discussed using an owners file to approve updates to the registry.
● Julien mentioned using ti to verify consistency and avoid breaking the registry.
● Nick asked about successful practices of similar organizations in handling registries.
● Nick mentioned that smaller organizations might be more flexible while larger organizations might have more legal requirements for using other registries.
● Pawelleszczynski agreed with Nick's suggestion to look at successful practices of similar organizations.
● Julien explains that data-driven decisions are important and mentions the trade-off of how complicated it is to maintain a repository and whether it is self-service for producers. He suggests adding files to an existing open source repo for small organizations, while big organizations may need legal approval to contribute.
● He also mentions the need for licensing and PR processes.
● Nick responds with agreement.
● Julien shares that he will share the draft dock on Open Lineage Slack for feedback and follow the OpenLineage proposal process. He mentions other ideas for implementation, such as the Men repository and the Evan repository, and welcomes other examples.
● He also asks if there are any questions or things people want to share about OpenLineage.
- Open Lineage Registry Proposal [Julien]
August 10, 2023 (10am PT)
...
- Announcements [Julien]
- Ecosystem Survey still needs responses: https://bit.ly/ecosystem_survey
- OpenLineage graduated from the LF AI on 7/27
- The 3rd issue of our monthly newsletter shipped on 7/31. Sign up here: https://bit.ly/OL_news
- Upcoming meetups:
- 8/30 in S.F. at Astronomer
- 9/18 in Toronto at Airflow Summit
- Marquez meetup on 10/5 in S.F.
- LF AI Update [Michael R.]
- Topics covered by Julien in presentation to LF AI TAC for graduation included trends in adoption
- Recent releases [Michael R.]
1.0.0: Added
- Airflow: convert lineage from legacy File definition #2006@mobuchowski
Removed
- Spec: remove facet ref from core #1997@JDarDagran
Changed
- Airflow: change log level to DEBUG when extractor isn't found #2012@kaxil
- Airflow: make sure we cannot fail in thread despite direct execution #2010@mobuchowski
https://github.com/OpenLineage/OpenLineage/releases/tag/1.0.0
https://github.com/OpenLineage/OpenLineage/compare/0.30.1...1.0.0
0.30.1: Added
- Flink: support Iceberg sinks #1960@pawel-big-lebowski
- Spark: column-level lineage for merge into on delta tables #1958@pawel-big-lebowski
- Spark: column-level lineage for merge into on Iceberg tables #1971@pawel-big-lebowski
- Spark: add support for Iceberg REST catalog #1963@juancappi
- Airflow: add possibility to force direct-execution based on environment variable #1934@mobuchowski
- SQL: add support for Apple Silicon to openlineage-sql-java#1981@davidjgoss
- Spec: add facet deletion #1975@julienledem
- Client: add a file transport #1891@Alexkuva
Changed
- Airflow: do not run plugin if OpenLineage provider is installed #1999@JDarDagran
- Python: rename config to config_class#1998@mobuchowski
https://github.com/OpenLineage/OpenLineage/releases/tag/0.30.1
https://github.com/OpenLineage/OpenLineage/compare/0.29.2...0.30.1
- Update on the OpenLineage Airflow Provider [Maciej]
- Pypi package version 1.0.1 available at: https://pypi.org/project/apache-airflow-providers-openlineage/1.0.1/
- installable with
pip install apache-airflow-providers-openlineage==1.0.1
- installable with
- Development progresses in the Airflow repo
- What's there already:
- Operator coverage:
- A lot of SQL-related operators, especially based on SQLExecuteQueryOperator
- Some GCP ones: BigQueryInsertJobOperator, GCStoGCSOperator
- Some Sagemaker-related operators
- FTP, SFTP operators
- Basic support for Python and Bash operators
- Changed:
- Airflow: do not run plugin if OpenLineage provider is installed #1999@JDarDagran
- Python: rename config to config_class #1998 @mobuchowski
- Operator coverage:
- Next steps
- Operator coverage:
- Popular operators around BigQuery: BigQueryUpsertTableOperator…
- Transport operators, like MySQLToSnowflakeOperator, GCSToBigQueryOperator
- S3 support, like S3CopyObjectOperator
- Add support for XCom-native operators like BigQueryGetDataOperator
- This list is not a promise
- "Core" changes
- Add interfaces around OpenLineage-implementing operators - making implementation more native
- XCom dataset support - this relates to XCom operators mentioned above
- Hook-level lineage support
- Operator coverage:
- Pypi package version 1.0.1 available at: https://pypi.org/project/apache-airflow-providers-openlineage/1.0.1/
- OpenLineage 1.0 with Static Lineage Update
- Putting things together for 1.0 release
- Important features and PRs
- Proposal: add static lineage deletion #1839@julienledem
- Emit job and dataset runless metadata #1880@pawel-big-lebowski
- Marquez: Ability to decode static metadata events #2495@pawel-big-lebowski
- Add facet deletion #1975@julienledem
- Spec: remove facet ref from core #1997@JDarDagran
- Important features and PRs
- Putting things together for 1.0 release
July 13, 2023 (8am PT)
Attendees:
...
- TSC:
- Mike Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, OpenLineage Project lead
- Willy Lulciuc, Co-creator of Marquez
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
- Mandy Chessell, Egeria Project Lead
- Daniel Henneberger, Database engineer
- Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
- And:
- Petr Hajek, Information Management Professional, Profinit
- Harel Shein, Director of Engineering, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Sam Holmberg, Software Engineer, Astronomer
- Ernie Ostic, SVP of Product, MANTA
- Sheeri Cabral, Technical Product Manager, Lineage, Collibra
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics
...
- Release 0.9.0 [Michael R.]
- We added:
- Spark: Column-level lineage introduced for Spark integration (#698, #645) @pawel-big-lebowski
- Java: Spark to use Java client directly (#774) @mobuchowski
- Clients: Add OPENLINEAGE_DISABLED environment variable which overrides config to NoopTransport (#780) @mobuchowski
- For the bug fixes and more information, see the Github repo.
- Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
- We added:
- Snowflake Blog Post [Ross]
- topic: a new integration between OL and Snowflake
- integration is the first OL extractor to process query logs
- design:
- an Airflow pipeline processes queries against Snowflake
- separate job: pulls access history and assembles lineage metadata
- two angles: Airflow sees it, Snowflake records it
- the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
- result: you can study the transformation by asking Snowflake AND Airflow about it
- required: having access history enabled in your Snowflake account (which requires special access level)
- Q & A
- Howard: is the access history task part of the DAG?
- Ross: yes, there's a separate DAG that pulls the view and emits the events
- Howard: what's the scope of the metadata?
- Ross: the account level
- Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
- Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
- Great Expectations integration [Michael C.]
- validation actions in GE execute after validation code does
- metadata extracted from these and transformed into facets
- recent update: the integration now supports version 3 of the GE API
- some configuration ongoing: currently you need to set up validation actions in GE
- Q & A
- Willy: is the metadata emitted as facets?
- Michael C.: yes, two
- dbt integration [Willy]
- a demo on getting started with the OL-dbt library
- pip install the integration library and dbt
- configure the dbt profile
- run seed command and run command in dbt
- the integration extracts metadata from the different views
- in Marquez, the UI displays the input/output datasets, job history, and the SQL
- a demo on getting started with the OL-dbt library
- Open discussion
- Howard: what is the process for becoming a committer?
- Maciej: nomination by a committer then a vote
- Sheeri: is coding beforehand recommended?
- Maciej: contribution to the project is expected
- Willy: no timeline on the process, but we are going to try to hold a regular vote
- Ross: project documentation covers this but is incomplete
- Michael C.: is this process defined by the LFAI?
- Ross: contributions to the website, workshops are welcome!
- Michael R.: we're in the process of moving the meeting recordings to our YouTube channel
- Howard: what is the process for becoming a committer?
May 19th, 2022 (10am PT)
Agenda:
...
- TSC:
- Mike Collado: Staff Software Engineer, Datakin
- Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
- Julien Le Dem: OpenLineage Project lead
- Willy Lulciuc: Co-creator of Marquez
- And:
- Ernie Ostic: SVP of Product, Manta
- Sandeep Adwankar: Senior Technical Product Manager, AWS
- Paweł Leszczyński, Software Engineer, GetinData
- Howard Yoo: Staff Product Manager, Astronomer
- Michael Robinson: Developer Relations Engineer, Astronomer
- Ross Turk: Senior Director of Community, Astronomer
- Minkyu Park: Senior Software Engineer, Astronomer
- Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
Meeting:
Widget Connector url http://youtube.com/watch?v=X0ZwMotUARA
Notes:
- Releases
- 0.8.2
Added
- openlineage-airflow now supports getting credentials from Airflows secrets backend (#723) @mobuchowski
- openlineage-spark now supports Azure Databricks Credential Passthrough (#595) @wjohnson
- openlineage-spark detects datasets wrapped by ExternalRDDs (#746) @collado-mike
Fixed
- PostgresOperator fails to retrieve host and conn during extraction (#705) @sekikn
- SQL parser accepts lists of sql statements (#734) @mobuchowski
- 0.8.1
Added
- Airflow integration uses new TaskInstance listener API for Airflow 2.3+ (#508) @mobuchowski
- Support for HiveTableRelation as input source in Spark integration (#683) @collado-mike
- Add HTTP and Kafka Client to openlineage-java lib (#480) @wslulciuc, @mobuchowski
- New SQL parser, used by Postgres, Snowflake, Great Expectations integrations (#644) @mobuchowski
Fixed
GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike
- 0.7.1
Added
- Python implements Transport interface - HTTP and Kafka transports are available (#530) @mobuchowski
- Add UnknownOperatorAttributeRunFacet and support in lineage backend (#547) @collado-mike
- Support Spark 3.2.1 (#607) @pawel-big-lebowski
- Add StorageDatasetFacet to spec (#620) @pawel-big-lebowski
- README.md created at OpenLineage/integrations for compatibility matrix (#663) @howardyoo
Fixed
- Airflow: custom extractors lookup uses only get_operator_classnames method (#656) @mobuchowski
- Dagster: handle updated PipelineRun in OpenLineage sensor unit test (#624) @dominiquetipton
- Delta improvements (#626) @collado-mike
- Fix SqlDwDatabricksVisitor for Spark2 (#630) @wjohnson
- Airflow: remove redundant logging from GE import (#657) @mobuchowski
- Fix Shebang issue in Spark's wait-for-it.sh (#658) @mobuchowski
- Update parent_run_id to be a uuid from the dag name and run_id (#664) @collado-mike
- Spark: fix time zone inconsistency in testSerializeRunEvent (#681) @sekikn
- 0.8.2
- Communication reminders [Julien]
- Agenda [Julien]
- Column-level lineage [Paweł]
- Linked to 4 PRs, the first being a proposal
- The second has been merged, but the core mechanism is turned off
- 3 requirements:
- Outputs labeled with expression IDs
- Inputs with expression IDs
- Dependencies
- Once it is turned on, each OL event will receive a new JSON field
- It would be great to be able to extend this API (currently on the roadmap)
- Q & A
- Will: handling user-defined functions: is the solution already generic enough?
- The answer will depend on testing, but I suspect that the answer is yes
- The team at Microsoft would be excited to learn that the solution will handle UDFs
- Julien: the next challenge will be to ensure that all the integrations support column-level lineage
- Will: handling user-defined functions: is the solution already generic enough?
- Open discussion
- Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
- Julien: lineage endpoint for col-level lineage to layer on top of what already exists
- Willy: this makes sense – we could use the method for input and output datasets as a model
- Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
- Willy: how do we expect this to be visualized?
- Julien: not quite sure
- Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields
- Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
...
- 0.6.2 release overview [Michael R.]
- Transports in OpenLineage clients [Maciej]
- Airflow integration update [Maciej]
- Dagster integration retrospective [Dalin]
- Open discussion
Meeting info:
Widget Connector url http://youtube.com/watch?v=MciFCgrQaxk
Notes:
- Introductions
- Communication channels overview [Julien]
- Agenda overview [Julien]
- 0.6.2 release overview [Michael R.]
...
- New committers [Julien]
- 4 new committers were voted in last week
- We had fallen behind
- Congratulations to all
- Release overview (0.6.0-0.6.1) [Michael R.]
- Added
- Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
- Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
- These first two additions are similar to SQL facet
- Offer the ability to see top-level code
- Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
- Captures when someone is conducting dataset operations (overwrite, create, etc.)
- Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
- Collects environment variables
- Depends on Databricks runtime but can be reused in other environments
- OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
- The first iteration of the Dagster integration to get lineage from Dagster
- Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
- Small addition to Java client feat. better types; was string
- Fixed
- Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
- The former was a particular issue with the Great Expectations integration
- Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
- Reduce logging level for import errors to info @rossturk (0.6.0)
- Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
- This fix reduced the level to Info
- Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
- Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
- Some keys are Snowflake-specific, but more can be added from other data sources
- Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
- Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
- Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
- Previously when an OL event failed to emit, this could break an integration
- This fix catches possible failures and logs them
- Reduce logging level for import errors to info @rossturk (0.6.0)
- Added
- Process for blog posts [Ross]
- Moving the process to Github Issues
Follow release tracker there
Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts
No one will have a monopoly
Proposals for blog posts also welcome and we can support your efforts with outlines, feedback
Throw your ideas on the issue tracker on Github
- Retrospective: Spark integration [Willy et al.]
Willy: originally this part of Marquez – the inspiration behind OL
OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)
Donated the integration to OL
Srikanth: #559 very helpful to Azure
Pawel: is anything missing from the Spark integration? E.g., column-level lineage?
Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome
Maciej: should be more active about tracking projects we have integrations with; add to test matrix
Julien: let’s open some issues to address these
- Open Discussion
- Flink updates? [Julien]
Maciej: initial exploration is done
challenge: Flink has 4 APIs
prioritizing Kafka lineage currently because most jobs are writing to/from Kafka
track this on Github milestones, contribute, ask questions there
Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage?
Maciej: trying to model entire Flink run as one event
Srikanth: proposed two separate streams, one for data updates and one for metadata
Julien: do we have an issue on this topic in the repo?
Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc
Julien: see notes for ticket number; MC will create the ticket
Srikanth: we can collaborate offline
- Flink updates? [Julien]
...
- OpenLineage recent release overview (0.5.1) [Julien]
- TaskInstanceListener now official way to integrate with Airflow [Julien]
- Apache Flink integration [Julien]
- Dagster integration demo [Dalin]
- Open Discussion
Meeting:
Widget Connector url http://youtube.com/watch?v=cIrXmC0zHLg
Notes:
- OpenLineage recent release overview (0.5.1) [Julien]
- No 0.5.0 due to bug
- Support for dbt-spark adapter
- New backend to proxy OL events
- Support for custom facets
- TaskInstanceListener now official way to integrate with Airflow [Julien]
- Integration runs on worker side
- Will be in next OL release of airflow (2.3)
- Thanks to Maciej for his work on this
- Apache Flink integration [Julien]
- Ticket for discussion available
- Integration test setup
- Early stages
- Dagster integration demo [Dalin]
- Initiated by Dalin Kim
- OL used with Dagster on orchestration layer
- Utilizes Dagster sensor
- Introduces OL sensor that can be added to Dagster repo definition
- Uses cursor to keep track of ID
- Looking for feedback after review complete
- Discussion:
- Dalin: needed: way to interpret Dagster asset for OL
- Julien: common code from Great Expectations/Dagster integrations
- Michael C: do you pass parent run ID in child job when sending the job to MZ?
- Hierarchy can be extended indefinitely – parent/child relationship can be modeled
- Maciej: the sensor kept failing – does this mean the events persisted despite being down?
- Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
- Dalin: hoping for more feedback
- Julien: slides will be posted on slack channel, also tickets
- Open discussion
- Will: how is OL ensuring consistency of datasets across integrations?
- Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
- Julien: need for tutorial on creating integrations
- Srikanth: have done some of this work in Atlas
- Kevin: are there libraries on the horizon to play this role? (Julien: yes)
- Srikanth: it would be good to have model spec to provide enforceable standard
- Julien: agreed; currently models are based on the JSON schema spec
- Julien: contributions welcome; opening a ticket about this makes sense
- Will: Flink integration: MZ focused on batch jobs
- Julien: we want to make sure we need to add checkpointing
- Julien: there will be discussion in OLMZ communities about this
- In MZ, there are questions about what counts as a version or not
- Julien: a consistent model is needed
- Julien: one solution being looked into is Arrow
- Julien: everyone should feel welcome to propose agenda items (even old projects)
- Srikanth: who are you working with on the Flink comms side? Will get back to you.
...
- OpenLineage recent releases overview [Julien]
- OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
- Databricks install README and init scripts (by Will)
- Iceberg integration (by Pawel)
- Kafka read and write support (by Olek and Mike)
- Arbitrary parameters supported in HTTP URL construction (by Will)
- Increased coverage (Pawel/Maciej)
- OpenLineage 0.5 release overview
- OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
- Egeria support for OpenLineage [Mandy]
- Airflow TaskListener for OpenLineage integration [Maciej]
- Open discussion
...
- Attendees:
- TSC:
Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
Michael Collado: Datakin, OpenLineage
- Maciej Obuchowski: GetInData. OpenLineage integrations
- Willy Lulciuc: Marquez co-creator.
- Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
- And:
- Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
- Minkyu Park: Datakin. learning about OpenLineage
- Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
- TSC:
- Meeting recording:
Widget Connector url http://youtube.com/watch?v=Gk0CwFYm9i4
- Meeting notes:
- agenda:
Update on OpenLineage latest release (0.2.1)
dbt integration demo
OpenLineage 0.3 scope discussion
Facet versioning mechanism (Issue #153)
OpenLineage Proxy Backend (Issue #152)
OpenLineage implementer test data and validation
Kafka client
Roadmap
- Iceberg integration
Open discussion
- Discussions:
added to the agenda a Discussion of Iceberg requirements for OpenLineage.
Demo of dbt:
really easy to try
when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
Presentation of Proxy Backend design:
- summary of discussions in Egeria
Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
Two ways to use Egeria with OpenLineage
receives HTTP events and forwards to Kafka
A consumer receives the Kafka events in Egeria
Proxy Backend in OpenLineage:
direct HTTP endpoint implementation in Egeria
Depending on the user they might pick one or the other and we'll document
- summary of discussions in Egeria
Use a direct OpenLineage endpoint (like Marquez)
Deploy the Proxy Backend to write to a queue (ex: Kafka)
Follow up items:
- agenda:
...
Aug 11th 2021
- Attendees:
- TSC:
Ryan Blue
Maciej Obuchowski
Michael Collado
Daniel Henneberger
Willy Lulciuc
Mandy Chessell
Julien Le Dem
- And:
Peter Hicks
Minkyu Park
Daniel Avancini
- TSC:
- Meeting recording:
Widget Connector | ||
---|---|---|
|
...
- Attendees:
- TSC:
- Julien Le Dem
- Mandy Chessel
- Michael Collado
- Willy Lulciuc
- TSC:
- Meeting recording:
Widget Connector url http://youtube.com/watch?v=kYzFYrzSpzg
- Meeting notes
- Agenda:
- Finalize the OpenLineage Mission Statement
- Review OpenLineage 0.1 scope
- Roadmap
- Open discussion
- Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46
- Notes:
Mission statement:
Overall consensus on the statement.
TODO: vote by commenting on the ticket
Spec versioning mechanism:
The goal is to commit to compatible changes once 0.1 is published
We need a follow up to separate core facet versioning
=> TODO: create a separate github ticket.The lineage event should have a field that identifies what version of the spec it was produced with
=> TODO: create a github issue for this
TODO: Add issue to document version number semantics (SCHEMAVER)
Extend Event State notion:
where do we capture more precise state transitions like RESTART?
Discussion should happen here: https://github.com/OpenLineage/OpenLineage/issues/9
OpenLineage 0.1:
finalize a few spec details for 0.1 : a few items left to discuss.
In particular job naming
parent job model
Importing Marquez integrations in OpenLineage
Open Discussion:
connecting the consumer and producer
TODO: ticket to track distribution mechanism
options:
Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
proxy similar to OpenTelemetry Collector.
Send to Kafka: https://github.com/OpenLineage/OpenLineage/issues/70
We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
Do we have a mutual third party or the client know where to send?
Source code location finalization
job naming convention
you don't always have a nested execution
can call a parent
parent job
You can have a job calling another one.
always distinguish a job and its run
need a separate notion for job dependencies
need to capture event driven: TODO: create ticket.
TODO(Julien): update job naming ticket to have the discussion.
- Agenda:
...
- Attendees:
- TSC:
Julien Le Dem: Marquez, Datakin
Drew Banin: dbt, CPO at fishtown analytics
Maciej Obuchowski: Marquez, GetIndata consulting company
Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
Daniel Henneberger: building a database, interested in lineage
Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
Willy Lulciuc: co-creator of Marquez
Michael Collado: Datakin, OpenLineage end-to-end holistic approach. - And:
Kedar Rajwade: consulting on distributed systems.
Barr Yaron: dbt, PM at Fishtown analytics on metadata.
Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue - Excused: Ryan Blue, James Campbell
- TSC:
- Meeting recording:
Widget Connector url http://youtube.com/watch?v=er2GDyQtm5M
- Meeting notes:
Agenda:
project communication
Technical charter review
medium term roadmap discussion
Notes:
project communication
github: for specs, designs, reviews and building consensus (issues and PRs)
email: for announcements, notes, etc
Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)
monthly meeting: recorded, notes and recording published on the wiki
Technical Charter review:
TODO: Finalize the mission statement. TSC members to comment in the doc.
Roadmap discussion:
TODO: please comment in the doc. Julien to update the OpenLineage project in github: https://github.com/OpenLineage/OpenLineage/projects/1
...