The OpenLineage Technical Steering Committee meetings are Monthly on the third Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

...

The OpenLineage Technical Steering Committee meetings are Monthly on the third Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

Next meeting: February 19th, 2025 (9:30am PT)

January 15th, 2025 (9:30am PT)

Attendees:

TSC:

- Julien LeDem, Datadog, OpenLineage Project Lead

- Michael Robinson, OpenLineage Community

- Maciej Obuchowski, Software Engineer, GetInData

- Sheeri Cabral, Product Manager, Capital One Software

And:

- Dan Rolles, Founder/CEO, BearingNode

- Leo Godin, Data Engineer, NewRelic

Notes:

Recent Releases
- OpenLineage 1.26.0
Presentations
- Data and Information Observability - Dan Rolles
  - BCBS239 - Only 2 out of 31 banks fully comply with BCBS239 even though it's 10 years old. It's about Risk management.
  - Dan presents a Data & Information Observability Framework (slide screenshot forthcoming)
    - Tried not to duplicate capabilities - e.g. Risk Management and Compliance are covered by Data Governance
  - Discussion points - for a working group
    - Standardizing Financial Data Lineage Events
    - Unstructured Data and LLM Pipeline Observability
    - Value-Aligned Dataset Consumption Patterns
- OpenLineage in Airflow 3
  - Airflow 3 is rewriting its architecture and eliminating direct connection between workers and the Airflow ?, will be using API now
  - In Airflow 2, users could manually mark tasks/DAG runs as successful or failure, but this was not emitted out with other OpenLineage information. This will be fixed in Airflow 3
  - Future features:
    - Using the new Task SDK, a future version of Airflow can have an asynchronous, serialized version of the OpenLineage listener.
    - Native support for partitioning https://github.com/OpenLineage/OpenLineage/pull/3392
    - Event-driven Airflow (AIP-82)
Open Discussion
- Github releases are up-to-date but documentation release notes are not automatically updated.
- Tagging - on a per-integration basis. Key/value pairs. Discussion of olin vs. ol. Leo will put a proposal in for dbt tags.

Meeting:

Slides

video links (forthcoming)

2024

December 18th, 2024 (9:30am PT)

November 20th, 2024 (9:30am PT)

...

August 14th, 2024 (9:30am PT)

Needs: Upload video and wiki notes

...

https://lf-aidata.atlassian.net/wiki/spaces/OpenLineage/pages/13174298/Monthly+TSC+meeting
August 14, 2024
...
Attendees:
...
TSC:
- Michael Robinson, Astronomer
...
- Sheeri Cabral, Product Manager, Collibra

And:
- Dan Rolles, Founder/CEO, BearingNode
...
- Chris, Software Engineer, MatilliionMatillion
Notes:-
Announcements
...
Meetup - San Francisco, Sept 12th, during Airflow Summir (link to meetup)
...
New committers - Jens Pfau (Google), Sheeri Cabral (Collibra)
...
New integrations - Amazon DataZone, Trino
...
Recent Releases
...
OpenLineage 1.18.0
...
...
OpenLineage 1.19.0
...
- AWS DataZone Integration Update - Priya
- OpenLineage consumer - specifically AWS Glue on Redshift
- Implementation of compliance/acceptance tests - Tomasz
- Framework for consumers and producers to make their OpenLineage compatibility public. LINK TO GITHUB
- Discussion Items
- Proposal: deprecate support for Spark 2.4 - Maciej
- Does anyone have use cases? Let us know in Slack.
- Open Discussion

Meeting:
Slides and video links (forthcoming)
July 10th, 2024 (9:30am PT)
Attendees:
TSC:
- Michael Robinson, Astronomer
...
Integration matrix
- Jens suggests expanding on the integration matrix and mentions issues with iceberg support in Spark.
- Eric reflects on Jens' suggestion.
- Michael Robinson thanks Jens for the input.
2023
December 14, 2023 (10am PT)
...
TSC:
Mike Collado, Staff Software Engineer, Astronomer
Julien Le Dem, OpenLineage Project lead
Willy Lulciuc, Co-creator of Marquez
Michael Robinson, Software Engineer, Dev. Rel., Astronomer
Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
Mandy Chessell, Egeria Project Lead
Daniel Henneberger, Database engineer
Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
And:
Petr Hajek, Information Management Professional, Profinit
Harel Shein, Director of Engineering, Astronomer
Minkyu Park, Senior Software Engineer, Astronomer
Sam Holmberg, Software Engineer, Astronomer
Ernie Ostic, SVP of Product, MANTA
Sheeri Cabral, Technical Product Manager, Lineage, Collibra
John Thomas, Software Engineer, Dev. Rel., Astronomer
Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics
...
Announcements
OpenLineage earned Incubation status with the LFAI & Data Foundation at their December TAC meeting!
Represents our maturation in terms of governance, code quality assurance practices, documentation, more
Required earning the OpenSSF Silver Badge, sponsorship, at least 300 GitHub stars
Next up: Graduation (expected in early summer)
Recent release 0.19.2 [Michael R.]
Added
SQL: add column-level lineage to SQL parser #1432 #1461 @mobuchowski @StarostaGit
SQL: add ExtractionErrorRunFacet#1442 @mobuchowski
Airflow: add Trino extractor #1288 @sekikn
Airflow: add S3FileTransformOperator extractor #1450 @sekikn
Airflow: add standardized run facet #1413 @JDarDagran
Airflow: add NominalTimeRunFacet and OwnershipJobFacet#1410 @JDarDagran
dbt: add support for postgres datasources #1417 @julienledem
Proxy: add client-side proxy (skeletal version) #1439 #1420 @fm100
Proxy: add CI job to publish Docker image #1086 @wslulciuc
Spark: pass config parameters to the OL client #1383 @tnazarew
Fixed
Airflow: fix collect_ignore, add flags to Pytest for cleaner output #1437 @JDarDagran
Spark & Java client: fix README typos @versaurabh
Thanks to all the contributors, including new contributor @versaurabh!
More details: https://github.com/OpenLineage/OpenLineage/blob/main/CHANGELOG.md
Column-level lineage update [Maciej]
What is the OpenLineage SQL parser?
At its core, it’s a Rust library that parses SQL statements and extracts lineage data from it
80/20 solution - we’ll not be able to parse all possible SQL statements - each database has custom extensions and different syntax, so we focus on standard SQL.
Good example of complicated extension: Snowflake COPY INTO https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
We primarily use the parser in Airflow integration and Great Expectations integration
Why? Airflow does not “understand” a lot of what some operators do, for example PostgreSqlOperator
We also have Java support package for parser
What changed previously?
Parser in current release can emit column-level lineage!
Last OL meeting Piotr Wojtczak, primary author of this change presented new core of parser that enabled that functionality
https://www.youtube.com/watch?v=Lv_bODeAVYQ
Still, the fact that Rust code can do that does not mean we have it for free everywhere
What has changed recently?
We wrote “glue code” that allows us to use new parser constructs in Airflow integration
Error handling just got way easier: SQL parser can “partially” parse SQL construct, and report errors it encountered, with particular statements that caused it.
Usage
Airflow integration extractors based on SqlExtractor (ex. PostgreSqlExtractor, SnowflakeExtractor, TrinoExtractor…) are now able to extract column-level lineage
Close future: Spark will be able to extract lineage from JDBCRelation.
Recent improvements to the Airflow integration [Kuba]
OpenLineage facets
Facets are pieces of metadata that can be attached to the core entities: run, job or dataset
Facets provide context to OpenLineage events
They can be defined as either part of the OpenLineage spec or custom facets
Airflow generic facet
Previously multiple custom facets with no standard
AirflowVersionRunFacet as an example of rapidly growing facet with version unrelated information
Introduced AirflowRunFacet with Task, DAG, TaskInstance and DagRun properties
Old facets are going to be deprecated soon. Currently both old and new facets are emitted
AirflowRunArgsRunFacet, AirflowVersionRunFacet, AirflowMappedTaskRunFacet will be removed
All information from above is moved to AirflowRunFacet
Other improvements (added in 0.19.2)
SQL extractors now send column-level lineage metadata
Further facets standardization
Introduced ProcessingEngineRunFacet
provides processing engine information, e.g. Airflow or Spark version
Improved support for nominal start & end times
makes use of data interval (introduced in Airflow 2.x)
nominal end time now matches next schedule time
DAG owner added to OwnershipJobFacet
Added support for S3FileTransformOperator and TrinoOperator (@sekikn’s great contribution)
Discussion: what does it mean to implement the spec? [Sheeri]
What is it mean to meet the spec?
100% compliance is not required
OL ecosystem page
doesn't say what exactly it does
operational lineage not well defined
what does a payload look like? hard to find this info
Compatibility between producers/consumers is unclear
Important if standard is to be adopted widely [Mandy]
Egeria: uses compliance test with reports and badging; clarifies compatibility
test and test cases available in the Egeria repo, including profiles and clear rules about compliant ways to support Egeria
a badly behaving producer or consumer will create problems
have to be able to trust what you get
What about consumers? [Mike C.]
can we determine if they have done the correct thing with facets? [John]
what do we call "compliant"?
custom facets shouldn't be subject to this – they are by definition custom (and private) [Maciej]
only complete events (not start events) should be required – start events not desired outside of operational use cases [Maciej]
There's a simple baseline on the one hand and facets on the other [Julien]
Note: perfection isn't the goal
instead: shared test cases, data such as sample schema that can be tested against
Marquez doesn't explain which facets it's using or how [Willy]
communication by consumers could be better
Effort at documenting this: matrix [Julien]
How would we define failing tests? [Maciej]
at a minimum we could have a validation mode [Julien]
challenge: the spec is always moving, growing [Maciej]
ex: in the case of JSON schema validation, facets are versioned individually but there's a reference schema that is versioned that might not be the current schema. Facets can be dereferenced, but the right way to do this is not clear [Danny]
one solution could be to split out base times, or we could add a tool that would force us to clean this up
client-side proxy presents same problem; tried different validators in Go; a workaround is to validate against the main doc first; by continually validating against the client proxy we can make sure it stays compliant with the spec [Minkyu]
Mandy: if Marquez says it's "OK," it's OK; we've been doing it manually [Mandy]
Marquez doesn't do any validation for consumers [Mike C.]
manual validation is not good enough [Mandy]
I like the idea of compliance badges – it would be cool if we had a way to validate consumers and there were a way to prove this and if we could extend validation to integrations like the Airflow integration [Mike C.]
Let's follow up on Slack and use the notes from this discussion to collaborate on a proposal [Julien]
2022
December 8, 2022 (10am PT)
...
Release 0.9.0 [Michael R.]
We added:
Spark: Column-level lineage introduced for Spark integration (#698, #645) @pawel-big-lebowski
Java: Spark to use Java client directly (#774) @mobuchowski
Clients: Add OPENLINEAGE_DISABLED environment variable which overrides config to NoopTransport (#780) @mobuchowski
For the bug fixes and more information, see the Github repo.
Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
Snowflake Blog Post [Ross]
topic: a new integration between OL and Snowflake
integration is the first OL extractor to process query logs
design:
an Airflow pipeline processes queries against Snowflake
separate job: pulls access history and assembles lineage metadata
two angles: Airflow sees it, Snowflake records it
the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
result: you can study the transformation by asking Snowflake AND Airflow about it
required: having access history enabled in your Snowflake account (which requires special access level)
Q & A
Howard: is the access history task part of the DAG?
Ross: yes, there's a separate DAG that pulls the view and emits the events
Howard: what's the scope of the metadata?
Ross: the account level
Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
Great Expectations integration [Michael C.]
validation actions in GE execute after validation code does
metadata extracted from these and transformed into facets
recent update: the integration now supports version 3 of the GE API
some configuration ongoing: currently you need to set up validation actions in GE
Q & A
Willy: is the metadata emitted as facets?
Michael C.: yes, two
dbt integration [Willy]
a demo on getting started with the OL-dbt library
pip install the integration library and dbt
configure the dbt profile
run seed command and run command in dbt
the integration extracts metadata from the different views
in Marquez, the UI displays the input/output datasets, job history, and the SQL
Open discussion
Howard: what is the process for becoming a committer?
Maciej: nomination by a committer then a vote
Sheeri: is coding beforehand recommended?
Maciej: contribution to the project is expected
Willy: no timeline on the process, but we are going to try to hold a regular vote
Ross: project documentation covers this but is incomplete
Michael C.: is this process defined by the LFAI?
Ross: contributions to the website, workshops are welcome!
Michael R.: we're in the process of moving the meeting recordings to our YouTube channel
May 19th, 2022 (10am PT)
Agenda:
...
TSC:
Mike Collado: Staff Software Engineer, Datakin
Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
Julien Le Dem: OpenLineage Project lead
Willy Lulciuc: Co-creator of Marquez
And:
Ernie Ostic: SVP of Product, Manta
Sandeep Adwankar: Senior Technical Product Manager, AWS
Paweł Leszczyński, Software Engineer, GetinData
Howard Yoo: Staff Product Manager, Astronomer
Michael Robinson: Developer Relations Engineer, Astronomer
Ross Turk: Senior Director of Community, Astronomer
Minkyu Park: Senior Software Engineer, Astronomer
Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft
Meeting:
Widget Connector
url http://youtube.com/watch?v=X0ZwMotUARA
Notes:
Releases
0.8.2
Added
openlineage-airflow now supports getting credentials from Airflows secrets backend (#723) @mobuchowski
openlineage-spark now supports Azure Databricks Credential Passthrough (#595) @wjohnson
openlineage-spark detects datasets wrapped by ExternalRDDs (#746) @collado-mike
Fixed
PostgresOperator fails to retrieve host and conn during extraction (#705) @sekikn
SQL parser accepts lists of sql statements (#734) @mobuchowski
0.8.1
Added
Airflow integration uses new TaskInstance listener API for Airflow 2.3+ (#508) @mobuchowski
Support for HiveTableRelation as input source in Spark integration (#683) @collado-mike
Add HTTP and Kafka Client to openlineage-java lib (#480) @wslulciuc, @mobuchowski
New SQL parser, used by Postgres, Snowflake, Great Expectations integrations (#644) @mobuchowski
Fixed
GreatExpectations: Fixed bug when invoking GreatExpectations using v3 API (#683) @collado-mike
0.7.1
Added
Python implements Transport interface - HTTP and Kafka transports are available (#530) @mobuchowski
Add UnknownOperatorAttributeRunFacet and support in lineage backend (#547) @collado-mike
Support Spark 3.2.1 (#607) @pawel-big-lebowski
Add StorageDatasetFacet to spec (#620) @pawel-big-lebowski
README.md created at OpenLineage/integrations for compatibility matrix (#663) @howardyoo
Fixed
Airflow: custom extractors lookup uses only get_operator_classnames method (#656) @mobuchowski
Dagster: handle updated PipelineRun in OpenLineage sensor unit test (#624) @dominiquetipton
Delta improvements (#626) @collado-mike
Fix SqlDwDatabricksVisitor for Spark2 (#630) @wjohnson
Airflow: remove redundant logging from GE import (#657) @mobuchowski
Fix Shebang issue in Spark's wait-for-it.sh (#658) @mobuchowski
Update parent_run_id to be a uuid from the dag name and run_id (#664) @collado-mike
Spark: fix time zone inconsistency in testSerializeRunEvent (#681) @sekikn
Communication reminders [Julien]
Agenda [Julien]
Column-level lineage [Paweł]
Linked to 4 PRs, the first being a proposal
The second has been merged, but the core mechanism is turned off
3 requirements:
Outputs labeled with expression IDs
Inputs with expression IDs
Dependencies
Once it is turned on, each OL event will receive a new JSON field
It would be great to be able to extend this API (currently on the roadmap)
Q & A
Will: handling user-defined functions: is the solution already generic enough?
The answer will depend on testing, but I suspect that the answer is yes
The team at Microsoft would be excited to learn that the solution will handle UDFs
Julien: the next challenge will be to ensure that all the integrations support column-level lineage
Open discussion
Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
Julien: lineage endpoint for col-level lineage to layer on top of what already exists
Willy: this makes sense – we could use the method for input and output datasets as a model
Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
Willy: how do we expect this to be visualized?
Julien: not quite sure
Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields
...
0.6.2 release overview [Michael R.]
Transports in OpenLineage clients [Maciej]
Airflow integration update [Maciej]
Dagster integration retrospective [Dalin]
Open discussion
Meeting info:
Widget Connector
url http://youtube.com/watch?v=MciFCgrQaxk
Notes:
Introductions
Communication channels overview [Julien]
Agenda overview [Julien]
0.6.2 release overview [Michael R.]
...
New committers [Julien]
4 new committers were voted in last week
We had fallen behind
Congratulations to all
Release overview (0.6.0-0.6.1) [Michael R.]
Added
Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
These first two additions are similar to SQL facet
Offer the ability to see top-level code
Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
Captures when someone is conducting dataset operations (overwrite, create, etc.)
Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
Collects environment variables
Depends on Databricks runtime but can be reused in other environments
OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
The first iteration of the Dagster integration to get lineage from Dagster
Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
Small addition to Java client feat. better types; was string
Fixed
Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
The former was a particular issue with the Great Expectations integration
Reduce logging level for import errors to info @rossturk (0.6.0)
Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
This fix reduced the level to Info
Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
Some keys are Snowflake-specific, but more can be added from other data sources
Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
Previously when an OL event failed to emit, this could break an integration
This fix catches possible failures and logs them
Process for blog posts [Ross]
Moving the process to Github Issues
Follow release tracker there
Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts
No one will have a monopoly
Proposals for blog posts also welcome and we can support your efforts with outlines, feedback
Throw your ideas on the issue tracker on Github
Retrospective: Spark integration [Willy et al.]
Willy: originally this part of Marquez – the inspiration behind OL
OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)
Donated the integration to OL
Srikanth: #559 very helpful to Azure
Pawel: is anything missing from the Spark integration? E.g., column-level lineage?
Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome
Maciej: should be more active about tracking projects we have integrations with; add to test matrix
Julien: let’s open some issues to address these
Open Discussion
Flink updates? [Julien]
Maciej: initial exploration is done
challenge: Flink has 4 APIs
prioritizing Kafka lineage currently because most jobs are writing to/from Kafka
track this on Github milestones, contribute, ask questions there
Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage?
Maciej: trying to model entire Flink run as one event
Srikanth: proposed two separate streams, one for data updates and one for metadata
Julien: do we have an issue on this topic in the repo?
Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc
Julien: see notes for ticket number; MC will create the ticket
https://github.com/OpenLineage/OpenLineage/issues/596
Srikanth: we can collaborate offline
...
OpenLineage recent release overview (0.5.1) [Julien]
TaskInstanceListener now official way to integrate with Airflow [Julien]
Apache Flink integration [Julien]
Dagster integration demo [Dalin]
Open Discussion
Meeting:
Slides
Widget Connector
url http://youtube.com/watch?v=cIrXmC0zHLg
Notes:
OpenLineage recent release overview (0.5.1) [Julien]
No 0.5.0 due to bug
Support for dbt-spark adapter
New backend to proxy OL events
Support for custom facets
TaskInstanceListener now official way to integrate with Airflow [Julien]
Integration runs on worker side
Will be in next OL release of airflow (2.3)
Thanks to Maciej for his work on this
Apache Flink integration [Julien]
Ticket for discussion available
Integration test setup
Early stages
Dagster integration demo [Dalin]
Initiated by Dalin Kim
OL used with Dagster on orchestration layer
Utilizes Dagster sensor
Introduces OL sensor that can be added to Dagster repo definition
Uses cursor to keep track of ID
Looking for feedback after review complete
Discussion:
Dalin: needed: way to interpret Dagster asset for OL
Julien: common code from Great Expectations/Dagster integrations
Michael C: do you pass parent run ID in child job when sending the job to MZ?
Hierarchy can be extended indefinitely – parent/child relationship can be modeled
Maciej: the sensor kept failing – does this mean the events persisted despite being down?
Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
Dalin: hoping for more feedback
Julien: slides will be posted on slack channel, also tickets
Open discussion
Will: how is OL ensuring consistency of datasets across integrations?
Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
Julien: need for tutorial on creating integrations
Srikanth: have done some of this work in Atlas
Kevin: are there libraries on the horizon to play this role? (Julien: yes)
Srikanth: it would be good to have model spec to provide enforceable standard
Julien: agreed; currently models are based on the JSON schema spec
Julien: contributions welcome; opening a ticket about this makes sense
Will: Flink integration: MZ focused on batch jobs
Julien: we want to make sure we need to add checkpointing
Julien: there will be discussion in OLMZ communities about this
In MZ, there are questions about what counts as a version or not
Julien: a consistent model is needed
Julien: one solution being looked into is Arrow
Julien: everyone should feel welcome to propose agenda items (even old projects)
Srikanth: who are you working with on the Flink comms side? Will get back to you.
...
OpenLineage recent releases overview [Julien]
OpenLineage 0.4 release overview: https://github.com/OpenLineage/OpenLineage/releases/tag/0.4.0
Databricks install README and init scripts (by Will)
Iceberg integration (by Pawel)
Kafka read and write support (by Olek and Mike)
Arbitrary parameters supported in HTTP URL construction (by Will)
Increased coverage (Pawel/Maciej)
OpenLineage 0.5 release overview
https://github.com/OpenLineage/OpenLineage/compare/0.4.0...main
Egeria support for OpenLineage [Mandy]
https://odpi.github.io/egeria-docs/features/lineage-management/overview/#integrating-with-the-openlineage-standard
Airflow TaskListener for OpenLineage integration [Maciej]
Open discussion
...
Proposal to convert licenses to SPDX [Michael]: no objections
2021
Dec 8th 2021 (9am PT)
Attendees:
...
Attendees:
TSC:
Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria
Michael Collado: Datakin, OpenLineage
Maciej Obuchowski: GetInData. OpenLineage integrations
Willy Lulciuc: Marquez co-creator.
Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
And:
Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
Minkyu Park: Datakin. learning about OpenLineage
Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
Meeting recording:
Widget Connector
url http://youtube.com/watch?v=Gk0CwFYm9i4
Meeting notes:
agenda:
Update on OpenLineage latest release (0.2.1)
dbt integration demo
OpenLineage 0.3 scope discussion
Facet versioning mechanism (Issue #153)
OpenLineage Proxy Backend (Issue #152)
OpenLineage implementer test data and validation
Kafka client
Roadmap
Iceberg integration
Open discussion
Slides
Discussions:
added to the agenda a Discussion of Iceberg requirements for OpenLineage.
Demo of dbt:
really easy to try
when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'
Presentation of Proxy Backend design:
summary of discussions in Egeria
Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage
Two ways to use Egeria with OpenLineage
receives HTTP events and forwards to Kafka
A consumer receives the Kafka events in Egeria
Proxy Backend in OpenLineage:
direct HTTP endpoint implementation in Egeria
Depending on the user they might pick one or the other and we'll document
Use a direct OpenLineage endpoint (like Marquez)
Deploy the Proxy Backend to write to a queue (ex: Kafka)
Follow up items:
...
Aug 11th 2021
Attendees:
TSC:
Ryan Blue
Maciej Obuchowski
Michael Collado
Daniel Henneberger
Willy Lulciuc
Mandy Chessell
Julien Le Dem
And:
Peter Hicks
Minkyu Park
Daniel Avancini
Meeting recording:
Widget Connector
url http://youtube.com/watch?v=bbAwz-rzo3I
...
Attendees:
TSC:
Julien Le Dem
Mandy Chessel
Michael Collado
Willy Lulciuc
Meeting recording:
Widget Connector
url http://youtube.com/watch?v=kYzFYrzSpzg
Meeting notes
Agenda:
Finalize the OpenLineage Mission Statement
Review OpenLineage 0.1 scope
Roadmap
Open discussion
Slides: https://docs.google.com/presentation/d/1fD_TBUykuAbOqm51Idn7GeGqDnuhSd7f/edit#slide=id.ge4b57c6942_0_46
Notes:
Mission statement:
https://github.com/OpenLineage/OpenLineage/issues/84
Overall consensus on the statement.
TODO: vote by commenting on the ticket
Spec versioning mechanism:
The goal is to commit to compatible changes once 0.1 is published
We need a follow up to separate core facet versioning

=> TODO: create a separate github ticket.
The lineage event should have a field that identifies what version of the spec it was produced with
=> TODO: create a github issue for this
TODO: Add issue to document version number semantics (SCHEMAVER)
Extend Event State notion:
where do we capture more precise state transitions like RESTART?
Discussion should happen here: https://github.com/OpenLineage/OpenLineage/issues/9
OpenLineage 0.1:
finalize a few spec details for 0.1 : a few items left to discuss.
In particular job naming
parent job model
Importing Marquez integrations in OpenLineage
Open Discussion:
connecting the consumer and producer
TODO: ticket to track distribution mechanism
options:
Would we need a consumption client to make it easy for consumers to get events from Kafka for example?
OpenLineage provides client libraries to serialize/deserialize events as well as sending them.
proxy similar to OpenTelemetry Collector.
Send to Kafka: https://github.com/OpenLineage/OpenLineage/issues/70
We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.
Do we have a mutual third party or the client know where to send?
Source code location finalization
job naming convention
you don't always have a nested execution
can call a parent
parent job
You can have a job calling another one.
always distinguish a job and its run
need a separate notion for job dependencies
need to capture event driven: TODO: create ticket.

TODO(Julien): update job naming ticket to have the discussion.
...
Attendees:
TSC:
Julien Le Dem: Marquez, Datakin
Drew Banin: dbt, CPO at fishtown analytics
Maciej Obuchowski: Marquez, GetIndata consulting company
Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
Daniel Henneberger: building a database, interested in lineage
Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
Willy Lulciuc: co-creator of Marquez
Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
And:
Kedar Rajwade: consulting on distributed systems.
Barr Yaron: dbt, PM at Fishtown analytics on metadata.
Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
Excused: Ryan Blue, James Campbell
Meeting recording:
Widget Connector
url http://youtube.com/watch?v=er2GDyQtm5M
Meeting notes:
Agenda:
project communication
Technical charter review
medium term roadmap discussion
Notes:
project communication
github: for specs, designs, reviews and building consensus (issues and PRs)
email: for announcements, notes, etc
Slack: transient discussions, does not maintain history. Any decision making or notes should go to persistent medium (email and github)
monthly meeting: recorded, notes and recording published on the wiki
Technical Charter review:
https://docs.google.com/document/d/11xo2cPtuYHmqRLnR-vt9ln4GToe0y60H/edit#
TODO: Finalize the mission statement. TSC members to comment in the doc.
Roadmap discussion:
https://docs.google.com/document/d/1ANXLKON3TN55XuNxYuTWe_CusfV66Gh0FyN8C2x9ayA/edit
TODO: please comment in the doc. Julien to update the OpenLineage project in github: https://github.com/OpenLineage/OpenLineage/projects/1
...

Version	Old Version 225	New Version Current
Changes made by	Sheeri Cabral	Sheeri Cabral
Saved on	Dec 17, 2024	about 4 hours ago

Page Comparison

Versions Compared

Key

Next meeting: February 19th, 2025 (9:30am PT)

January 15th, 2025 (9:30am PT)

2024

December 18th, 2024 (9:30am PT)

November 20th, 2024 (9:30am PT)

August 14th, 2024 (9:30am PT)

July 10th, 2024 (9:30am PT)

2023

December 14, 2023 (10am PT)

2022

December 8, 2022 (10am PT)

May 19th, 2022 (10am PT)

2021

Dec 8th 2021 (9am PT)

Aug 11th 2021