Page Comparison

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

Next meeting: December 18th, 2024 (9:30am PT)

November 20th, 2024 (9:30am PT)

October 16th, 2024 (9:30am PT)

September 18th, 2024 (9:30am PT)

August 14th, 2024 (9:30am PT)

July 10th, 2024 (9:30am PT)

June 12th, 2024 (9:30am PT)

May 8, 2024 (9:30am PT)

Attendees:

TSC:

Julien Le Dem, OpenLineage project lead, LF AI & Data
Michael Robinson, Community Manager, Astronomer
Harel Shein, Lineage at Datadog
Pawel Leszczynski, Software Engineer, GetInData
Maciej Obuchowski, Software Engineer, GetInData, OpenLineage committer

And:

Mark Soule, Principal Engineer, Improving
Sheeri Cabral, Product Manager, ETL, Collibra
Ernie Ostic, IBM/Manta
Rahul Madan, Atlan

Agenda:

Announcements
Recent releases - 1.13.1
Protobuf support in Flink - Pawel
Improved Project Management on GitHub
Rahul Pre-flight configuration check DAG
Discussion items
Open discussion

Notes:

•Accenture panel (presented by Confluent): Open Standards for Data Lineage, shared participation stats (see slides) including some companies and geographical data

Release 1.13.1 details - https://openlineage.io/docs/releases/1_13_1/

Protobuf support in Flink - Pawel

Protocol buffers are platform and language neutral extensible mechanisms for serializing structured data. OpenLineage can now extract schemas from Flink jobs reading/writing in protobuf from Kafka.

Improved Project Management on GitHub

Recording from Kacper Muda
Adding 4 new issue templates: Bug report, documentation issue, Feature request, and General (everything else) - Issue 2666; adds new “needs triage” label - Issue 2664
Future: Naming conventions for issues and PRs - eg knowing which PR goes with which version it went into. And then reviving milestones.
These can be discussed in #dev-discuss on Slack

Rahul Pre-flight configuration check DAG

After setting up Airflow with OpenLineage, there’s no way to verify correctness of setup. Proposal: an Airflow DAG that is run and checks the config and makes sure it’s OK. https://openlineage.io/docs/integrations/airflow/preflight-check-dag/ - with demo
checks: env + config variables - makes sure OpenLineage is enabled, library is installed, compatible versions, checks connectivity with Marquez if applicable.
Were able to know about and fix #2596 due to the DAG.

Meeting:

Slides (link forthcoming)

April 10, 2024 (9:30am PT)

...

TSC:
- Mike Collado, Staff Software Engineer, Astronomer
- Julien Le Dem, OpenLineage Project lead
- Willy Lulciuc, Co-creator of Marquez
- Michael Robinson, Software Engineer, Dev. Rel., Astronomer
- Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
- Mandy Chessell, Egeria Project Lead
- Daniel Henneberger, Database engineer
- Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
- Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
And:
- Petr Hajek, Information Management Professional, Profinit
- Harel Shein, Director of Engineering, Astronomer
- Minkyu Park, Senior Software Engineer, Astronomer
- Sam Holmberg, Software Engineer, Astronomer
- Ernie Ostic, SVP of Product, MANTA
- Sheeri Cabral, Technical Product Manager, Lineage, Collibra
- John Thomas, Software Engineer, Dev. Rel., Astronomer
- Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics

...

Announcements [Julien]
- OpenLineage earned the OSSF Core Infrastructure Silver Badge!
- Happening soon: OpenLineage to apply formally for Incubation status with the LFAI
- Blog: a post by Ernie Ostic about MANTA’s OpenLineage integration
- Website: a new Ecosystem page
- Workshops repo: An Intro to Dataset Lineage with Jupyter and Spark
- Airflow docs: guidance on creating custom extractors to support external operators
- Spark docs: improved documentation of column lineage facets and extensions
Recent release 0.16.1 [Michael R.]
- Added
  - Airflow: add dag_run information to Airflow version run facet #1133 @fm100
    Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration.
  - Airflow: add LoggingMixin to extractors #1149 @JDarDagran
    Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.
  - Airflow: add default extractor #1162 @mobuchowski
    Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors.
  - Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
    Adds support for running another method on extract_on_complete.
  - SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
    Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains a CI fix.
  Changed
  - Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
    The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.
  - Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
    Splits up the method for greater legibility and easier maintenance.
- Removed
  - Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
    Removes the code structures and tests enabling support for Airflow 1.10.
- Bug fixes and more details
  - https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md

Update on LFAI & Data progress [Michael R.]

LFAI & Data: a single funding effort to support technical projects hosted under the [Linux] foundation
Current status: applying soon for Incubation, will be ready to apply for Graduation soon (dates TBD).

Incubation stage requirements:

2+ organizations actively contributing to the project	23 organizations
A sponsor who is an existing LFAI & Data member	To do
300+ stars on GitHub	1.1K GitHub stars
A Core Infrastructure Initiative Best Practices Silver Badge	Silver Badge earned on November 2
Affirmative vote of the TAC and Governing Board	Pending
A defined TSC with a chairperson	TSC with chairperson: Julien Le Dem

Graduation stage requirements:

5+ organizations actively contributing to the project	23 organizations
Substantial flow of commits for 12 months	Commit growth rate (12 mo.): 155.53% Avg commits pushed by active contributors (12 mo.): 2.18K
1000+ stars on GitHub	1.1K GitHub stars
Core Infrastructure Initiative Best Practices Gold Badge	Gold Badge in progress (57%)
Affirmative vote of the TAC and Governing Board	Pending
1+ collaboration with another LFAI project	Marquez, Egeria, Amundsen
Technical lead appointed on the TAC	To do

Implementing OpenLineage proposal and discussion [Julien]
- Procedure for implementing OpenLineage is under-documented
- Goal: provide a better guide on the multiple approaches that exist
- Contributions are welcome
- Expect more information about this at the next meeting
MANTA integration update [Petr]
- Project: MANTA OpenLineage Connector
- Straightforward solution:
  - Agent installed on customer side to setup an API endpoint for MANTA
  - MANTA Agent will hand over OpenLineage events to the MANTA OpenLineage Extractor, which will save the data in a MANTA OpenLineage Event Repository
  - Use the MANTA Admin UI to run/schedule the MANTA OpenLineage Reader to generator an OpenLineage Graph and produce the final MANTA Graph using a MANTA OpenLineage Generator
  - The whole process will be parameterized
- Demo:
  - Example dataset produced by Keboola integration
  - All dependencies visualized in UI
  - Some information about columns is available, but not true column lineage
  - Possible to draw lineage across range of tools
- Looking for volunteers willing to test the integration
- Q&A
  - Are you using the Column-level Lineage Facet from OpenLineage?
    - Not yet, but we would like to test it
    - Find a good example of this in the OpenLineage/workshops/Spark GitHub repo
    - What would be great would be a real example/real environment for testing
Linking CMF (a common ML metadata framework) and OpenLineage [Suparna & Ann Mary]
- https://github.com/HewlettPackard/cmf
- Where CMF will fit in the OpenLineage ecosystem
  - linkage needed between forms of metadata for conducting AI experiments
  - concept: "git for AI metadata" consumable by tools such as Marquez and Egeria after publication by an OpenLineage-CMF publisher
  - challenges:
    - multiple stages with interlinked dependencies
    - executing asynchronously
    - data centricity requires artifact lineage and tracking influence of different artifacts and data slices on model performance
    - pipelines should be Reproducible, Auditable and Traceable
    - end-to-end visibility is necessary to identify biases, etc.
  - AI for Science example:
    - training loop in complex pipeline with multiple models optimized concurrently
      - e.g., an embedding model, edge selection model and graph neural model in same pipeline
      - CMF used to capture metadata across pipeline stages
  - Manufacturing quality monitoring pipeline
    - iterative retraining with new samples added to the dataset every iteration
    - CMF tracks lineage across training and deployment stages
    - Q: is the recording of metadata automatic, or does the data scientist have control over it?
      - there both explicit (e.g., APIs) and implicit modes of tracking
      - the data scientist can choose which "branches" to "push" a la Git
  - 3 columns of reproducibility
    - metadata store (MLMD/MLFlow)
    - Artifact Store (DVC/Others)
    - Query Cache Layer (Graph Database)
    - GIT
    - optimization
  - Comparison with other AI metadata infrastructure
    - Git-like support and ability to collaborate across teams distinguish CMF from alternatives
    - Metrics and lineage also make CMF comparable to model-centric and pipeline-centric tools
  - Lineage tracking and decentralized usage model
    - complete view of data model lineage for reproducibility, optimization, explainability
    - decentralized usage model, easily cloned in any environment
  - What does it look like?
    - explicit tracking via Python library
    - tracking of dataset, model and metrics
    - offers end-to-end visibility
  - API
    - abstractions: pipeline state, context/stage of execution, execution
  - Automated logging, heterogeneous SQ stand distributed teams
    - enables collaboration of distributed teams of scientists using a diverse set of libraries
    - automatic logging in command line interface
  - POC implementations
    - allows for integration with existing frameworks
    - compatible with ML/DL frameworks and ML tracking platforms
  - Translation between CMF and OpenLineage
    - export of metadata in OpenLineage format
    - mapping of abstractions onto OpenLineage
    - Run ~ Execution with Run facet
    - Job ~ Context with Job facet
    - Dataset ~ Dataset with Dataset facet
    - Namespace ~ Pipeline
  - Q&A
    - Pipeline might map to Job name
    - Context might map to Pipeline as Parent job
    - Model could map to a Dataset as well as Dataset
    - Metric as a model could map to a Dataset facet
    - 2 levels of dataset facet, one static and one tied to Job Runs

...

Release 0.9.0 [Michael R.]
- We added:
  - Spark: Column-level lineage introduced for Spark integration (#698, #645) @pawel-big-lebowski
  - Java: Spark to use Java client directly (#774) @mobuchowski
  - Clients: Add OPENLINEAGE_DISABLED environment variable which overrides config to NoopTransport (#780) @mobuchowski
- For the bug fixes and more information, see the Github repo.
- Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
Snowflake Blog Post [Ross]
- topic: a new integration between OL and Snowflake
- integration is the first OL extractor to process query logs
- design:
  - an Airflow pipeline processes queries against Snowflake
  - separate job: pulls access history and assembles lineage metadata
  - two angles: Airflow sees it, Snowflake records it
- the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
- result: you can study the transformation by asking Snowflake AND Airflow about it
- required: having access history enabled in your Snowflake account (which requires special access level)
- Q & A
  - Howard: is the access history task part of the DAG?
  - Ross: yes, there's a separate DAG that pulls the view and emits the events
  - Howard: what's the scope of the metadata?
  - Ross: the account level
  - Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
  - Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
Great Expectations integration [Michael C.]
- validation actions in GE execute after validation code does
- metadata extracted from these and transformed into facets
- recent update: the integration now supports version 3 of the GE API
- some configuration ongoing: currently you need to set up validation actions in GE
- Q & A
  - Willy: is the metadata emitted as facets?
  - Michael C.: yes, two
dbt integration [Willy]
- a demo on getting started with the OL-dbt library
  - pip install the integration library and dbt
  - configure the dbt profile
  - run seed command and run command in dbt
  - the integration extracts metadata from the different views
  - in Marquez, the UI displays the input/output datasets, job history, and the SQL
Open discussion
- Howard: what is the process for becoming a committer?
  - Maciej: nomination by a committer then a vote
  - Sheeri: is coding beforehand recommended?
  - Maciej: contribution to the project is expected
  - Willy: no timeline on the process, but we are going to try to hold a regular vote
  - Ross: project documentation covers this but is incomplete
  - Michael C.: is this process defined by the LFAI?
- Ross: contributions to the website, workshops are welcome!
- Michael R.: we're in the process of moving the meeting recordings to our YouTube channel

May 19th, 2022 (10am PT)

Agenda:

...

TSC:
- Mike Collado: Staff Software Engineer, Datakin
- Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
- Julien Le Dem: OpenLineage Project lead
- Willy Lulciuc: Co-creator of Marquez
And:
- Ernie Ostic: SVP of Product, Manta
- Sandeep Adwankar: Senior Technical Product Manager, AWS
- Paweł Leszczyński, Software Engineer, GetinData
- Howard Yoo: Staff Product Manager, Astronomer
- Michael Robinson: Developer Relations Engineer, Astronomer
- Ross Turk: Senior Director of Community, Astronomer
- Minkyu Park: Senior Software Engineer, Astronomer
- Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft

Meeting:

Widget Connector

url	http://youtube.com/watch?v=X0ZwMotUARA

Notes:

Releases
- 0.8.2
  - Added
    - openlineage-airflow now supports getting credentials from Airflows secrets backend (#723) @mobuchowski
    - openlineage-spark now supports Azure Databricks Credential Passthrough (#595) @wjohnson
    - openlineage-spark detects datasets wrapped by ExternalRDDs (#746) @collado-mike
    Fixed
    - PostgresOperator fails to retrieve host and conn during extraction (#705) @sekikn
    - SQL parser accepts lists of sql statements (#734) @mobuchowski
- 0.8.1
  - Added
    - Airflow integration uses new TaskInstance listener API for Airflow 2.3+ (#508) @mobuchowski
    - Support for HiveTableRelation as input source in Spark integration (#683) @collado-mike
    - Add HTTP and Kafka Client to openlineage-java lib (#480) @wslulciuc, @mobuchowski
    - New SQL parser, used by Postgres, Snowflake, Great Expectations integrations (#644) @mobuchowski
    Fixed
    GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike
- 0.7.1
  - Added
    - Python implements Transport interface - HTTP and Kafka transports are available (#530) @mobuchowski
    - Add UnknownOperatorAttributeRunFacet and support in lineage backend (#547) @collado-mike
    - Support Spark 3.2.1 (#607) @pawel-big-lebowski
    - Add StorageDatasetFacet to spec (#620) @pawel-big-lebowski
    - README.md created at OpenLineage/integrations for compatibility matrix (#663) @howardyoo
    Fixed
    - Airflow: custom extractors lookup uses only get_operator_classnames method (#656) @mobuchowski
    - Dagster: handle updated PipelineRun in OpenLineage sensor unit test (#624) @dominiquetipton
    - Delta improvements (#626) @collado-mike
    - Fix SqlDwDatabricksVisitor for Spark2 (#630) @wjohnson
    - Airflow: remove redundant logging from GE import (#657) @mobuchowski
    - Fix Shebang issue in Spark's wait-for-it.sh (#658) @mobuchowski
    - Update parent_run_id to be a uuid from the dag name and run_id (#664) @collado-mike
    - Spark: fix time zone inconsistency in testSerializeRunEvent (#681) @sekikn
Communication reminders [Julien]
Agenda [Julien]
Column-level lineage [Paweł]
- Linked to 4 PRs, the first being a proposal
- The second has been merged, but the core mechanism is turned off
- 3 requirements:
  - Outputs labeled with expression IDs
  - Inputs with expression IDs
  - Dependencies
- Once it is turned on, each OL event will receive a new JSON field
- It would be great to be able to extend this API (currently on the roadmap)
- Q & A
  - Will: handling user-defined functions: is the solution already generic enough?
    - The answer will depend on testing, but I suspect that the answer is yes
    - The team at Microsoft would be excited to learn that the solution will handle UDFs
  - Julien: the next challenge will be to ensure that all the integrations support column-level lineage
Open discussion
- Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
  - Julien: lineage endpoint for col-level lineage to layer on top of what already exists
  - Willy: this makes sense – we could use the method for input and output datasets as a model
  - Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
  - Willy: how do we expect this to be visualized?
    - Julien: not quite sure
    - Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields

...

0.6.2 release overview [Michael R.]
Transports in OpenLineage clients [Maciej]
Airflow integration update [Maciej]
Dagster integration retrospective [Dalin]
Open discussion

Meeting info:

Widget Connector

url	http://youtube.com/watch?v=MciFCgrQaxk

Notes:

Introductions
Communication channels overview [Julien]
Agenda overview [Julien]
0.6.2 release overview [Michael R.]

...

New committers [Julien]
- 4 new committers were voted in last week
- We had fallen behind
- Congratulations to all
Release overview (0.6.0-0.6.1) [Michael R.]
- Added
  - Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
  - Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
    - These first two additions are similar to SQL facet
    - Offer the ability to see top-level code
  - Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
    - Captures when someone is conducting dataset operations (overwrite, create, etc.)
  - Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
    - Collects environment variables
    - Depends on Databricks runtime but can be reused in other environments
  - OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
    - The first iteration of the Dagster integration to get lineage from Dagster
  - Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
    - Small addition to Java client feat. better types; was string
- Fixed
  - Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
    - The former was a particular issue with the Great Expectations integration
- - Reduce logging level for import errors to info @rossturk (0.6.0)
    - Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
    - This fix reduced the level to Info
  - Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
    - Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
    - Some keys are Snowflake-specific, but more can be added from other data sources
  - Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
    - Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
  - Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
    - Previously when an OL event failed to emit, this could break an integration
    - This fix catches possible failures and logs them
Process for blog posts [Ross]
- Moving the process to Github Issues
- Follow release tracker there
- Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts
- No one will have a monopoly
- Proposals for blog posts also welcome and we can support your efforts with outlines, feedback
- Throw your ideas on the issue tracker on Github
Retrospective: Spark integration [Willy et al.]
- Willy: originally this part of Marquez – the inspiration behind OL
  - OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)
  - Donated the integration to OL
- Srikanth: #559 very helpful to Azure
- Pawel: is anything missing from the Spark integration? E.g., column-level lineage?
- Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome
- Maciej: should be more active about tracking projects we have integrations with; add to test matrix
- Julien: let’s open some issues to address these
Open Discussion
- Flink updates? [Julien]
  - Maciej: initial exploration is done
    - challenge: Flink has 4 APIs
    - prioritizing Kafka lineage currently because most jobs are writing to/from Kafka
    - track this on Github milestones, contribute, ask questions there
  - Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage?
  - Maciej: trying to model entire Flink run as one event
  - Srikanth: proposed two separate streams, one for data updates and one for metadata
  - Julien: do we have an issue on this topic in the repo?
  - Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc
  - Julien: see notes for ticket number; MC will create the ticket
    - https://github.com/OpenLineage/OpenLineage/issues/596
  - Srikanth: we can collaborate offline

...

OpenLineage recent release overview (0.5.1) [Julien]
TaskInstanceListener now official way to integrate with Airflow [Julien]
Apache Flink integration [Julien]
Dagster integration demo [Dalin]
Open Discussion

Meeting:

Slides

Widget Connector

url	http://youtube.com/watch?v=cIrXmC0zHLg

Notes:

OpenLineage recent release overview (0.5.1) [Julien]
- No 0.5.0 due to bug
- Support for dbt-spark adapter
- New backend to proxy OL events
- Support for custom facets
TaskInstanceListener now official way to integrate with Airflow [Julien]

Integration runs on worker side
Will be in next OL release of airflow (2.3)
Thanks to Maciej for his work on this

Apache Flink integration [Julien]
- Ticket for discussion available
- Integration test setup
- Early stages
Dagster integration demo [Dalin]
- Initiated by Dalin Kim
- OL used with Dagster on orchestration layer
- Utilizes Dagster sensor
- Introduces OL sensor that can be added to Dagster repo definition
- Uses cursor to keep track of ID
- Looking for feedback after review complete
- Discussion:
  - Dalin: needed: way to interpret Dagster asset for OL
  - Julien: common code from Great Expectations/Dagster integrations
  - Michael C: do you pass parent run ID in child job when sending the job to MZ?
  - Hierarchy can be extended indefinitely – parent/child relationship can be modeled
  - Maciej: the sensor kept failing – does this mean the events persisted despite being down?
  - Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
  - Dalin: hoping for more feedback
  - Julien: slides will be posted on slack channel, also tickets
Open discussion
- Will: how is OL ensuring consistency of datasets across integrations?
- Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
- Julien: need for tutorial on creating integrations
- Srikanth: have done some of this work in Atlas
- Kevin: are there libraries on the horizon to play this role? (Julien: yes)
- Srikanth: it would be good to have model spec to provide enforceable standard
- Julien: agreed; currently models are based on the JSON schema spec
- Julien: contributions welcome; opening a ticket about this makes sense
- Will: Flink integration: MZ focused on batch jobs
- Julien: we want to make sure we need to add checkpointing
- Julien: there will be discussion in OLMZ communities about this
- Julien: a consistent model is needed
- Julien: one solution being looked into is Arrow
- Julien: everyone should feel welcome to propose agenda items (even old projects)
- Srikanth: who are you working with on the Flink comms side? Will get back to you.

...

OpenLineage recent releases overview [Julien]
- OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
  - Databricks install README and init scripts (by Will)
  - Iceberg integration (by Pawel)
  - Kafka read and write support (by Olek and Mike)
  - Arbitrary parameters supported in HTTP URL construction (by Will)
  - Increased coverage (Pawel/Maciej)
- OpenLineage 0.5 release overview
  - https://github.com/OpenLineage/OpenLineage/compare/0.4.0...main
Egeria support for OpenLineage [Mandy]
- https://odpi.github.io/egeria-docs/features/lineage-management/overview/#integrating-with-the-openlineage-standard
Airflow TaskListener for OpenLineage integration [Maciej]
Open discussion

...

Attendees:
- TSC:
  - Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
  - Michael Collado: Datakin, OpenLineage
  - Maciej Obuchowski: GetInData. OpenLineage integrations
  - Willy Lulciuc: Marquez co-creator.
  - Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
- And:
  - Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
  - Minkyu Park: Datakin. learning about OpenLineage
  - Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage

Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=Gk0CwFYm9i4

Meeting notes:
- agenda:
  - Update on OpenLineage latest release (0.2.1)
    - dbt integration demo
  - OpenLineage 0.3 scope discussion
    - Facet versioning mechanism (Issue #153)
    - OpenLineage Proxy Backend (Issue #152)
    - OpenLineage implementer test data and validation
    - Kafka client
  - Roadmap
    - Iceberg integration
  - Open discussion
- Slides
- Discussions:
  - added to the agenda a Discussion of Iceberg requirements for OpenLineage.
- Demo of dbt:
  - really easy to try
  - when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
- Presentation of Proxy Backend design:
  - summary of discussions in Egeria
    - Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
    - Two ways to use Egeria with OpenLineage
      - receives HTTP events and forwards to Kafka
      - A consumer receives the Kafka events in Egeria
  - Proxy Backend in OpenLineage:
    - direct HTTP endpoint implementation in Egeria
  - Depending on the user they might pick one or the other and we'll document
- Use a direct OpenLineage endpoint (like Marquez)
  - Deploy the Proxy Backend to write to a queue (ex: Kafka)
  - Follow up items:

...

Aug 11th 2021

Attendees:
- TSC:
  - Ryan Blue
  - Maciej Obuchowski
  - Michael Collado
  - Daniel Henneberger
  - Willy Lulciuc
  - Mandy Chessell
  - Julien Le Dem
- And:
  - Peter Hicks
  - Minkyu Park
  - Daniel Avancini
Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=bbAwz-rzo3I

...

Attendees:
- TSC:
  - Julien Le Dem
  - Mandy Chessel
  - Michael Collado
  - Willy Lulciuc
Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=kYzFYrzSpzg

Meeting notes
- Agenda:
  - Finalize the OpenLineage Mission Statement
  - Review OpenLineage 0.1 scope
  - Roadmap
  - Open discussion
  - Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46
- Notes:
  Mission statement:
  - https://github.com/OpenLineage/OpenLineage/issues/84
  - Overall consensus on the statement.
  - TODO: vote by commenting on the ticket
  Spec versioning mechanism:
  - The goal is to commit to compatible changes once 0.1 is published
  - We need a follow up to separate core facet versioning
  => TODO: create a separate github ticket.
  - The lineage event should have a field that identifies what version of the spec it was produced with
  - TODO: Add issue to document version number semantics (SCHEMAVER)
  Extend Event State notion:
  - where do we capture more precise state transitions like RESTART?
  OpenLineage 0.1:
  - finalize a few spec details for 0.1 : a few items left to discuss.
  - Importing Marquez integrations in OpenLineage
  Open Discussion:
  - connecting the consumer and producer
    - TODO: ticket to track distribution mechanism
    - options:
      - Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
      - OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
    - We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
  - Source code location finalization
  - job naming convention
    - you don't always have a nested execution
  - need a separate notion for job dependencies
  - need to capture event driven: TODO: create ticket.
  TODO(Julien): update job naming ticket to have the discussion.

...

Attendees:
- TSC:
  Julien Le Dem: Marquez, Datakin
  Drew Banin: dbt, CPO at fishtown analytics
  Maciej Obuchowski: Marquez, GetIndata consulting company
  Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
  Daniel Henneberger: building a database, interested in lineage
  Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
  Willy Lulciuc: co-creator of Marquez
  Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
- And:
  Kedar Rajwade: consulting on distributed systems.
  Barr Yaron: dbt, PM at Fishtown analytics on metadata.
  Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
- Excused: Ryan Blue, James Campbell
Meeting recording:

Widget Connector

url	http://youtube.com/watch?v=er2GDyQtm5M

Meeting notes:
Agenda:
- project communication
- Technical charter review
- medium term roadmap discussion
Notes:
- project communication
- Technical Charter review:
- Roadmap discussion:

...

Version	Old Version 218	New Version 219
Changes made by	Michael Robinson	Sheeri Cabral
Saved on	Jun 05, 2024	Dec 02, 2024

Versions Compared

Key

Next meeting: December 18th, 2024 (9:30am PT)

November 20th, 2024 (9:30am PT)

October 16th, 2024 (9:30am PT)

September 18th, 2024 (9:30am PT)

August 14th, 2024 (9:30am PT)

July 10th, 2024 (9:30am PT)

June 12th, 2024 (9:30am PT)

May 8, 2024 (9:30am PT)

April 10, 2024 (9:30am PT)

May 19th, 2022 (10am PT)

Aug 11th 2021