Page Comparison

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Thursday from 10:00am to 11:00am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

Next meeting: November 9, 2023 (10am PT)

October 12, 2023 (10am PT)

Tentative agendaAgenda:

Announcements
Recent releases
Airflow Summit recap
Tutorial/demo: migrating to the OpenLineage Airflow Provider
Discussion: observability for OpenLineage+Marquez
Open discussion

Meeting:

Widget Connector

url	http://youtube.com/watch?v=LMuS0DJoOtc

Notes:

Announcements

The first annual Ecosystem Survey is still open. Submit your response today: https://bit.ly/ecosystem_survey
Our next meetup will be on November 29th in Warsaw, Poland, at Google. Sign up: https://www.meetup.com/warsaw-openlineage-meetup-group/events/296705558/?utm_medium=referral&utm_campaign=share-btn_savedevents_share_modal&utm_source=link

Recent releases

1.2.2
- Added
  - Spark: publish the ProcessingEngineRunFacet as part of the normal operation of the OpenLineageSparkEventListener#2089 @d-m-h
  - Spark: capture and emit spark.databricks.clusterUsageTags.clusterAllTags variable from databricks environment #2099 @Anirudh181001
  Thanks to all the contributors, including new contributors @d-m-h, @tati and @xli-1026!
1.3.1
- Added
  - Airflow: add some basic stats to the Airflow integration #1845 @harels
  - Airflow: add columns as schema facet for airflow.lineage.Table (if defined) #2138 @erikalfthan
  - DBT: add SQLSERVER to supported dbt profile types #2136 @erikalfthan
  - Spark: support for latest 3.5 #2118 @pawel-big-lebowski
  Thanks to all the contributors, including new contributor @erikalfthan!
1.4.1
- Added
  - Client: allow setting client's endpoint via environment variable #2151 @mars-lan
  - Flink: expand Iceberg source types #2149 @HuangZhenQiu
  - Spark: add debug facet #2147 @pawel-big-lebowski
  - Spark: enable Nessie REST catalog #2165 @julwin
  Thanks to all the contributors, including new contributors @julwin and @HuangZhenQiu!

Migration from standalone Open Lineage package to Airflow provider
- Jakub explained how to migrate from the standalone openly the flow package to the airflow provider. He gave reasons why they wanted to become an airflow provider, including making sure that the metadata collected in airflow is not breaking airflow itself.
- They also keep the latest code up to date with all the providers and become part of these providers of the operators. There were a couple of changes introduced in the provider package, and the main question is how to migrate.
- The simplest way is to just do the install for the specific package. One of the things they would like to walk away from this customer structures, and there was and still is a possibility to write a customer structure that was controlled by the open infrastructures environment variable.
- Jakub explains that if a user has implemented some get open age assets method previously based on the old module and class, they do not need to worry about it because it is translated. However, if they install opening flow, they will fail to import the old class and need to change the import path.
- There are changes introducing configuration, and there is a whole section called open image in conflict. Many of the features that were previously available in opening package are also compatible with the provider.
- People usually like open in URL, which is pretty common and still works. But some entries in the open in age section take precedence over what's been previously handled by environment variables.
- Jakub gives examples of how the logic for like conflict takes precedence over open in URL. He mentions that the documentation has more information on how it works.
- He also explains how to add new integration in the provider or other providers that make use of opening provider. They want to give up on using open in age common data set module and use just the classes from the open in age python client.
- Jakub gives quick advice on how to grab some information from execution of the operator. Previously, when they didn't have any control or influence on how to grab some information from execution of the operator, they needed to read the code and see that maybe job ID is returned as an ex come.
- Now when they added the integration in the query operator itself, they can just change the code so it saves the job ideas and attributes.
- Jakub gives a quick demo of how it works. He is using breeze, which is a mostly development environment and cli for airflow.
- He is using on two point seven point one and is also using integration open in age, which instant Marcus also that's an option that they have in their flow. The only package that he is using is posts because he'll be using or provider.
- He shows how it works and mentions that the beauty of e-mail life is that he doesn't know if it should work.
- Jakub says that it should work in a minute.
- Jakub types in his password.
- Jakub says that he doesn't need to run post scripts, but actually he doesn't have just to prove he doesn't have any.
- Jakub says that it's working. He is running some example that uses focus as back end.
- Jakub says that previously, there was nothing to configure more if a user has like opening the CR.
- Jakub explains that he changed the next piece and this is development, but the name is changed because he hasn't experimented with something. Eventually, the events came to market.
- Jakub tries it again.
- Jakub demonstrates a quick demo of three options for package installation and rerunning history. Julien thanks Jakub and asks if there are any questions about migration from the old open age integration into the new airflow provider.

Observability for Open Lineage markers
- Julien introduces the discussion topic of observability for opening age markers and invites Harel to start. Harel asks the audience about ensuring liability of lineage collection and what kind of operability they would like to see, such as distributed tracing.
- He suggests gathering feedback on a slack channel. Julien thinks the metrics added to the airflow integration by Harel are a good starting point for observability.
- Hloomba mentions enabling retention policy on all environments and suggests observability on database retention to help with memory or CPU performance. Harel suggests enabling metrics out of the box and instrumenting more functions using drop wizard as a web server.
- Julien and William discuss having metrics on the retention job to track how the data retention job keeps the database small.
- Jeevan asked about the possibility of having an open lineage event for Spark applications, and Pawelleszczynski explained the need for a parent run faster to identify each Spark action as part of a bigger entity, the Spark application. Jens suggested having unique job names for Spark actions and the parent Spark application.
- Pawelleszczynski explained that the current job name is constructed based on the name of the operator or Spark logical note and appended with a dataset name, but they can make it optional to have a human-readable job name or use a hash on the logical plan to ensure uniqueness.
- Harel mentioned having good news for Bob and suggested discussing it next week.
- Jens added that having unique job names would help distinguish each Spark action and its runs, and Pawelleszczynski explained the current job naming convention and the possibility of making it unique using a hash on the logical plan.
- Julien asked if anyone had more comments on the topic.

Creating a registry for consumers and producers
- Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
- Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
- The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
- The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Proposing an optional contract for providers for Airflow operators
- Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
- Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
- The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
- The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Spark integration
- Julien presented four items and discussed them in detail. The first item was about creating a registry for consumers and producers, which was summarized in a Google doc.
- Two options were discussed, and the second proposal with a self-contained repository was preferred. Notes and open items were added to the document, and everyone was encouraged to contribute to it.
- The second item was about proposing an optional contract for providers for airflow operators to exclude their age. A proposal was made to expose open lineage data set directly into DBT's manifest file, and feedback was sought from DBT contributors.
- The third item was about spark integration, which knows how to define unique data sets based on various data sources. However, custom data sources with their own implementation become opaque, so an optional contract was proposed to address this issue.

Certification process in the Open Lineage ecosystem
- Julien discussed the need for a certification process in the Open Lineage ecosystem, and suggested creating a document to start a discussion on how to implement it. He mentioned the possibility of providing data set support for scans and action notes, and creating a contract for implementing data sources to expose lineage in relation notes.
- Julien also talked about the goal of Open Lineage to be built into systems like Airflow, and encouraged attendees to share their opinions and ask questions on Slack.
- Julien discussed the need for a certification process in the Open Lineage ecosystem, and suggested creating a document to start a discussion on how to implement it. He mentioned the possibility of providing data set support for scans and action notes, and creating a contract for implementing data sources to expose lineage in relation notes.
- Julien also talked about the goal of Open Lineage to be built into systems like Airflow, and encouraged attendees to share their opinions and ask questions on Slack.

September 14, 2023 (10am PT)

...

TSC:
- Mike Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, OpenLineage Project lead
- Willy Lulciuc, Co-creator of Marquez
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
- Mandy Chessell, Egeria Project Lead
- Daniel Henneberger, Database engineer
- Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
And:
- Petr Hajek, Information Management Professional, Profinit
- Harel Shein, Director of Engineering, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Sam Holmberg, Software Engineer, Astronomer
- Ernie Ostic, SVP of Product, MANTA
- Sheeri Cabral, Technical Product Manager, Lineage, Collibra
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics

...

Release 0.9.0 [Michael R.]
- We added:
  - Spark: Column-level lineage introduced for Spark integration (#698, #645) @pawel-big-lebowski
  - Java: Spark to use Java client directly (#774) @mobuchowski
  - Clients: Add OPENLINEAGE_DISABLED environment variable which overrides config to NoopTransport (#780) @mobuchowski
- For the bug fixes and more information, see the Github repo.
- Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
Snowflake Blog Post [Ross]
- topic: a new integration between OL and Snowflake
- integration is the first OL extractor to process query logs
- design:
  - an Airflow pipeline processes queries against Snowflake
  - separate job: pulls access history and assembles lineage metadata
  - two angles: Airflow sees it, Snowflake records it
- the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
- result: you can study the transformation by asking Snowflake AND Airflow about it
- required: having access history enabled in your Snowflake account (which requires special access level)
- Q & A
  - Howard: is the access history task part of the DAG?
  - Ross: yes, there's a separate DAG that pulls the view and emits the events
  - Howard: what's the scope of the metadata?
  - Ross: the account level
  - Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
  - Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
Great Expectations integration [Michael C.]
- validation actions in GE execute after validation code does
- metadata extracted from these and transformed into facets
- recent update: the integration now supports version 3 of the GE API
- some configuration ongoing: currently you need to set up validation actions in GE
- Q & A
  - Willy: is the metadata emitted as facets?
  - Michael C.: yes, two
dbt integration [Willy]
- a demo on getting started with the OL-dbt library
  - pip install the integration library and dbt
  - configure the dbt profile
  - run seed command and run command in dbt
  - the integration extracts metadata from the different views
  - in Marquez, the UI displays the input/output datasets, job history, and the SQL
Open discussion
- Howard: what is the process for becoming a committer?
  - Maciej: nomination by a committer then a vote
  - Sheeri: is coding beforehand recommended?
  - Maciej: contribution to the project is expected
  - Willy: no timeline on the process, but we are going to try to hold a regular vote
  - Ross: project documentation covers this but is incomplete
  - Michael C.: is this process defined by the LFAI?
- Ross: contributions to the website, workshops are welcome!
- Michael R.: we're in the process of moving the meeting recordings to our YouTube channel

May 19th, 2022 (10am PT)

Agenda:

...

TSC:
- Mike Collado: Staff Software Engineer, Datakin
- Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
- Julien Le Dem: OpenLineage Project lead
- Willy Lulciuc: Co-creator of Marquez
And:
- Ernie Ostic: SVP of Product, Manta
- Sandeep Adwankar: Senior Technical Product Manager, AWS
- Paweł Leszczyński, Software Engineer, GetinData
- Howard Yoo: Staff Product Manager, Astronomer
- Michael Robinson: Developer Relations Engineer, Astronomer
- Ross Turk: Senior Director of Community, Astronomer
- Minkyu Park: Senior Software Engineer, Astronomer
- Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft

Meeting:

Widget Connector

url	http://youtube.com/watch?v=X0ZwMotUARA

Notes:

Releases
- 0.8.2
  - Added
    - openlineage-airflow now supports getting credentials from Airflows secrets backend (#723) @mobuchowski
    - openlineage-spark now supports Azure Databricks Credential Passthrough (#595) @wjohnson
    - openlineage-spark detects datasets wrapped by ExternalRDDs (#746) @collado-mike
    Fixed
    - PostgresOperator fails to retrieve host and conn during extraction (#705) @sekikn
    - SQL parser accepts lists of sql statements (#734) @mobuchowski
- 0.8.1
  - Added
    - Airflow integration uses new TaskInstance listener API for Airflow 2.3+ (#508) @mobuchowski
    - Support for HiveTableRelation as input source in Spark integration (#683) @collado-mike
    - Add HTTP and Kafka Client to openlineage-java lib (#480) @wslulciuc, @mobuchowski
    - New SQL parser, used by Postgres, Snowflake, Great Expectations integrations (#644) @mobuchowski
    Fixed
    GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike
- 0.7.1
  - Added
    - Python implements Transport interface - HTTP and Kafka transports are available (#530) @mobuchowski
    - Add UnknownOperatorAttributeRunFacet and support in lineage backend (#547) @collado-mike
    - Support Spark 3.2.1 (#607) @pawel-big-lebowski
    - Add StorageDatasetFacet to spec (#620) @pawel-big-lebowski
    - README.md created at OpenLineage/integrations for compatibility matrix (#663) @howardyoo
    Fixed
    - Airflow: custom extractors lookup uses only get_operator_classnames method (#656) @mobuchowski
    - Dagster: handle updated PipelineRun in OpenLineage sensor unit test (#624) @dominiquetipton
    - Delta improvements (#626) @collado-mike
    - Fix SqlDwDatabricksVisitor for Spark2 (#630) @wjohnson
    - Airflow: remove redundant logging from GE import (#657) @mobuchowski
    - Fix Shebang issue in Spark's wait-for-it.sh (#658) @mobuchowski
    - Update parent_run_id to be a uuid from the dag name and run_id (#664) @collado-mike
    - Spark: fix time zone inconsistency in testSerializeRunEvent (#681) @sekikn
Communication reminders [Julien]
Agenda [Julien]
Column-level lineage [Paweł]
- Linked to 4 PRs, the first being a proposal
- The second has been merged, but the core mechanism is turned off
- 3 requirements:
  - Outputs labeled with expression IDs
  - Inputs with expression IDs
  - Dependencies
- Once it is turned on, each OL event will receive a new JSON field
- It would be great to be able to extend this API (currently on the roadmap)
- Q & A
  - Will: handling user-defined functions: is the solution already generic enough?
    - The answer will depend on testing, but I suspect that the answer is yes
    - The team at Microsoft would be excited to learn that the solution will handle UDFs
  - Julien: the next challenge will be to ensure that all the integrations support column-level lineage
Open discussion
- Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
  - Julien: lineage endpoint for col-level lineage to layer on top of what already exists
  - Willy: this makes sense – we could use the method for input and output datasets as a model
  - Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
  - Willy: how do we expect this to be visualized?
    - Julien: not quite sure
    - Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields

...

0.6.2 release overview [Michael R.]
Transports in OpenLineage clients [Maciej]
Airflow integration update [Maciej]
Dagster integration retrospective [Dalin]
Open discussion

Meeting info:

Widget Connector

url	http://youtube.com/watch?v=MciFCgrQaxk

Notes:

Introductions
Communication channels overview [Julien]
Agenda overview [Julien]
0.6.2 release overview [Michael R.]

...

New committers [Julien]
- 4 new committers were voted in last week
- We had fallen behind
- Congratulations to all
Release overview (0.6.0-0.6.1) [Michael R.]
- Added
  - Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
  - Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
    - These first two additions are similar to SQL facet
    - Offer the ability to see top-level code
  - Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
    - Captures when someone is conducting dataset operations (overwrite, create, etc.)
  - Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
    - Collects environment variables
    - Depends on Databricks runtime but can be reused in other environments
  - OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
    - The first iteration of the Dagster integration to get lineage from Dagster
  - Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
    - Small addition to Java client feat. better types; was string
- Fixed
  - Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
    - The former was a particular issue with the Great Expectations integration
- - Reduce logging level for import errors to info @rossturk (0.6.0)
    - Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
    - This fix reduced the level to Info
  - Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
    - Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
    - Some keys are Snowflake-specific, but more can be added from other data sources
  - Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
    - Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
  - Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
    - Previously when an OL event failed to emit, this could break an integration
    - This fix catches possible failures and logs them
Process for blog posts [Ross]
- Moving the process to Github Issues
- Follow release tracker there
- Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts
- No one will have a monopoly
- Proposals for blog posts also welcome and we can support your efforts with outlines, feedback
- Throw your ideas on the issue tracker on Github
Retrospective: Spark integration [Willy et al.]
- Willy: originally this part of Marquez – the inspiration behind OL
  - OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)
  - Donated the integration to OL
- Srikanth: #559 very helpful to Azure
- Pawel: is anything missing from the Spark integration? E.g., column-level lineage?
- Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome
- Maciej: should be more active about tracking projects we have integrations with; add to test matrix
- Julien: let’s open some issues to address these
Open Discussion
- Flink updates? [Julien]
  - Maciej: initial exploration is done
    - challenge: Flink has 4 APIs
    - prioritizing Kafka lineage currently because most jobs are writing to/from Kafka
    - track this on Github milestones, contribute, ask questions there
  - Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage?
  - Maciej: trying to model entire Flink run as one event
  - Srikanth: proposed two separate streams, one for data updates and one for metadata
  - Julien: do we have an issue on this topic in the repo?
  - Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc
  - Julien: see notes for ticket number; MC will create the ticket
    - https://github.com/OpenLineage/OpenLineage/issues/596
  - Srikanth: we can collaborate offline

...

OpenLineage recent release overview (0.5.1) [Julien]
TaskInstanceListener now official way to integrate with Airflow [Julien]
Apache Flink integration [Julien]
Dagster integration demo [Dalin]
Open Discussion

Meeting:

Slides

Widget Connector

url	http://youtube.com/watch?v=cIrXmC0zHLg

Notes:

OpenLineage recent release overview (0.5.1) [Julien]
- No 0.5.0 due to bug
- Support for dbt-spark adapter
- New backend to proxy OL events
- Support for custom facets
TaskInstanceListener now official way to integrate with Airflow [Julien]

Integration runs on worker side
Will be in next OL release of airflow (2.3)
Thanks to Maciej for his work on this

Apache Flink integration [Julien]
- Ticket for discussion available
- Integration test setup
- Early stages
Dagster integration demo [Dalin]
- Initiated by Dalin Kim
- OL used with Dagster on orchestration layer
- Utilizes Dagster sensor
- Introduces OL sensor that can be added to Dagster repo definition
- Uses cursor to keep track of ID
- Looking for feedback after review complete
- Discussion:
  - Dalin: needed: way to interpret Dagster asset for OL
  - Julien: common code from Great Expectations/Dagster integrations
  - Michael C: do you pass parent run ID in child job when sending the job to MZ?
  - Hierarchy can be extended indefinitely – parent/child relationship can be modeled
  - Maciej: the sensor kept failing – does this mean the events persisted despite being down?
  - Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
  - Dalin: hoping for more feedback
  - Julien: slides will be posted on slack channel, also tickets
Open discussion
- Will: how is OL ensuring consistency of datasets across integrations?
- Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
- Julien: need for tutorial on creating integrations
- Srikanth: have done some of this work in Atlas
- Kevin: are there libraries on the horizon to play this role? (Julien: yes)
- Srikanth: it would be good to have model spec to provide enforceable standard
- Julien: agreed; currently models are based on the JSON schema spec
- Julien: contributions welcome; opening a ticket about this makes sense
- Will: Flink integration: MZ focused on batch jobs
- Julien: we want to make sure we need to add checkpointing
- Julien: there will be discussion in OLMZ communities about this
- Julien: a consistent model is needed
- Julien: one solution being looked into is Arrow
- Julien: everyone should feel welcome to propose agenda items (even old projects)
- Srikanth: who are you working with on the Flink comms side? Will get back to you.

...

OpenLineage recent releases overview [Julien]
- OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
  - Databricks install README and init scripts (by Will)
  - Iceberg integration (by Pawel)
  - Kafka read and write support (by Olek and Mike)
  - Arbitrary parameters supported in HTTP URL construction (by Will)
  - Increased coverage (Pawel/Maciej)
- OpenLineage 0.5 release overview
  - https://github.com/OpenLineage/OpenLineage/compare/0.4.0...main
Egeria support for OpenLineage [Mandy]
- https://odpi.github.io/egeria-docs/features/lineage-management/overview/#integrating-with-the-openlineage-standard
Airflow TaskListener for OpenLineage integration [Maciej]
Open discussion

...

Attendees:
- TSC:
  - Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
  - Michael Collado: Datakin, OpenLineage
  - Maciej Obuchowski: GetInData. OpenLineage integrations
  - Willy Lulciuc: Marquez co-creator.
  - Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
- And:
  - Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
  - Minkyu Park: Datakin. learning about OpenLineage
  - Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage

Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=Gk0CwFYm9i4

Meeting notes:
- agenda:
  - Update on OpenLineage latest release (0.2.1)
    - dbt integration demo
  - OpenLineage 0.3 scope discussion
    - Facet versioning mechanism (Issue #153)
    - OpenLineage Proxy Backend (Issue #152)
    - OpenLineage implementer test data and validation
    - Kafka client
  - Roadmap
    - Iceberg integration
  - Open discussion
- Slides
- Discussions:
  - added to the agenda a Discussion of Iceberg requirements for OpenLineage.
- Demo of dbt:
  - really easy to try
  - when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
- Presentation of Proxy Backend design:
  - summary of discussions in Egeria
    - Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
    - Two ways to use Egeria with OpenLineage
      - receives HTTP events and forwards to Kafka
      - A consumer receives the Kafka events in Egeria
  - Proxy Backend in OpenLineage:
    - direct HTTP endpoint implementation in Egeria
  - Depending on the user they might pick one or the other and we'll document
- Use a direct OpenLineage endpoint (like Marquez)
  - Deploy the Proxy Backend to write to a queue (ex: Kafka)
  - Follow up items:

...

Aug 11th 2021

Attendees:
- TSC:
  - Ryan Blue
  - Maciej Obuchowski
  - Michael Collado
  - Daniel Henneberger
  - Willy Lulciuc
  - Mandy Chessell
  - Julien Le Dem
- And:
  - Peter Hicks
  - Minkyu Park
  - Daniel Avancini
Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=bbAwz-rzo3I

...

Attendees:
- TSC:
  - Julien Le Dem
  - Mandy Chessel
  - Michael Collado
  - Willy Lulciuc
Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=kYzFYrzSpzg

Meeting notes
- Agenda:
  - Finalize the OpenLineage Mission Statement
  - Review OpenLineage 0.1 scope
  - Roadmap
  - Open discussion
  - Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46
- Notes:
  Mission statement:
  - https://github.com/OpenLineage/OpenLineage/issues/84
  - Overall consensus on the statement.
  - TODO: vote by commenting on the ticket
  Spec versioning mechanism:
  - The goal is to commit to compatible changes once 0.1 is published
  - We need a follow up to separate core facet versioning
  => TODO: create a separate github ticket.
  - The lineage event should have a field that identifies what version of the spec it was produced with
  - TODO: Add issue to document version number semantics (SCHEMAVER)
  Extend Event State notion:
  - where do we capture more precise state transitions like RESTART?
  OpenLineage 0.1:
  - finalize a few spec details for 0.1 : a few items left to discuss.
  - Importing Marquez integrations in OpenLineage
  Open Discussion:
  - connecting the consumer and producer
    - TODO: ticket to track distribution mechanism
    - options:
      - Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
      - OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
    - We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
  - Source code location finalization
  - job naming convention
    - you don't always have a nested execution
  - need a separate notion for job dependencies
  - need to capture event driven: TODO: create ticket.
  TODO(Julien): update job naming ticket to have the discussion.

...

Attendees:
- TSC:
  Julien Le Dem: Marquez, Datakin
  Drew Banin: dbt, CPO at fishtown analytics
  Maciej Obuchowski: Marquez, GetIndata consulting company
  Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
  Daniel Henneberger: building a database, interested in lineage
  Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
  Willy Lulciuc: co-creator of Marquez
  Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
- And:
  Kedar Rajwade: consulting on distributed systems.
  Barr Yaron: dbt, PM at Fishtown analytics on metadata.
  Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
- Excused: Ryan Blue, James Campbell
Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=er2GDyQtm5M

Meeting notes:
Agenda:
- project communication
- Technical charter review
- medium term roadmap discussion
Notes:
- project communication
- Technical Charter review:
- Roadmap discussion:

...

Version	Old Version 197	New Version 198
Changes made by	Michael Robinson	Michael Robinson
Saved on	Oct 09, 2023	Oct 16, 2023

Versions Compared

Key

Next meeting: November 9, 2023 (10am PT)

October 12, 2023 (10am PT)

September 14, 2023 (10am PT)

May 19th, 2022 (10am PT)

Aug 11th 2021