Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: formatting changes

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

Next meeting: December 18th, 2024 (9:30am PT)

November 20th, 2024 (9:30am PT)

October 16th, 2024 (9:30am PT)

September 18th, 2024 (9:30am PT)

August 14th, 2024 (9:30am PT)

July 10th, 2024 (9:30am PT)

...

the third Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

Next meeting: February 19th, 2025 (9:30am PT)

January 15th, 2025 (9:30am PT)

Attendees:

TSC:
- Julien LeDem, Datadog, OpenLineage Project Lead
- Michael Robinson, OpenLineage Community
- Maciej Obuchowski, Software Engineer, GetInData
- Sheeri Cabral, Product Manager, Capital One Software
  
And:
- Dan Rolles, Founder/CEO, BearingNode
- Leo Godin, Data Engineer, NewRelic
Notes:
  • Recent Releases 
  • Presentations
    • Data and Information Observability - Dan Rolles
      • BCBS239 - Only 2 out of 31 banks fully comply with BCBS239 even though it's 10 years old. It's about Risk management.
      • Dan presents a Data & Information Observability Framework (slide screenshot forthcoming)
        • Tried not to duplicate capabilities - e.g. Risk Management and Compliance are covered by Data Governance
      • Discussion points - for a working group
        • Standardizing Financial Data Lineage Events
        • Unstructured Data and LLM Pipeline Observability
        • Value-Aligned Dataset Consumption Patterns
    • OpenLineage in Airflow 3
      • Airflow 3 is rewriting its architecture and eliminating direct connection between workers and the Airflow ?, will be using API now
      • In Airflow 2, users could manually mark tasks/DAG runs as successful or failure, but this was not emitted out with other OpenLineage information. This will be fixed in Airflow 3
      • Future features:
  • Open Discussion
    • Github releases are up-to-date but documentation release notes are not automatically updated.
    • Tagging - on a per-integration basis. Key/value pairs. Discussion of olin vs. ol. Leo will put a proposal in for dbt tags.
  
Meeting:
video links (forthcoming)

2024

December 18th, 2024 (9:30am PT)

November 20th, 2024 (9:30am PT)

October 16th, 2024 (9:30am PT)

September 18th, 2024 (9:30am PT)

August 14th, 2024 (9:30am PT)

Attendees:

TSC:
- Michael Robinson, Astronomer
- Maciej Obuchowski, Software Engineer, GetInData
- Tomasz Nazarewicz, Software Engineer, GetInData
- Sheeri Cabral, Product Manager, Collibra
  
And:
- Dan Rolles, Founder/CEO, BearingNode
- Jakub Moravec, Product Manager, IBM MANTA
- Leo Gomez, Lead SA Datazone, AWS
- Mohit, Sr. Software Engineer, AWS
- David Goss, Software Engineer, Matillion
- Priya Tiruthani, Product Manager, AWS
- Abel S., Software Engineer, AWS
- Rahul Maden, Atlan Software Engineer
- Erik Veleker, Atlan
- Chris, Software Engineer, Matillion
Notes:
  • Announcements
    • Meetup - San Francisco, Sept 12th, during Airflow Summir (link to meetup)
    • New committers - Jens Pfau (Google), Sheeri Cabral (Collibra)
    • New integrations - Amazon DataZone, Trino
  • Recent Releases 
  • - AWS DataZone Integration Update - Priya
  • - OpenLineage consumer - specifically AWS Glue on Redshift
  • - Implementation of compliance/acceptance tests - Tomasz
  • - Framework for consumers and producers to make their OpenLineage compatibility public. LINK TO GITHUB
  • - Discussion Items
  • - Proposal: deprecate support for Spark 2.4 - Maciej
  • - Does anyone have use cases? Let us know in Slack.
  • - Open Discussion
  
Meeting:
Slides and video links (forthcoming)

July 10th, 2024 (9:30am PT)

Attendees:

TSC:
- Michael Robinson, Astronomer
- Maciej Obuchowski, Software Engineer, GetInData
- Julien LeDem, Project Lead Datadog
- Minkyu Park, Software Engineer, Moloco
- Harel Shein, Engineering Manager, Datadog
  
And:
- Mark Soule, Principal Engineer, Improving
- Jens Pfau, Engineering Mgr Google
- Jakub Moravec, Product Manager, IBM MANTA
- Sheeri Cabral, Product Manager, Collibra
- Erik Veleker, Atlan
- Ellen Zhao, Product, Alteryx
- Mohan, Data Engineer, Sketchers
Agenda:
- Announcements
- Recent Releases 
- OpenLineage 1.17.1
- Discussion Items
- Certification Process Proposal
- Open Discussion
  
Notes:
- Announcements - Michael
- Rahul Maden will present "Ensuring Data Quality Using Contracts and Lineage" at the Fifth Elephant Conference
- Recent Releases - Michael
- Includes fix to ColumnLineage in Spark
- Includes facet registry - Thanks to Harel @harels for implementing it and Natalia @ngorchakova for registering the first custom facet - GcpCommonJobFacet
- Spark experience update, and CLI verifier
- New extractors
- Certification process proposal - Sheeri
- Purpose: to be able to see producer and consumer compatibility - with the current or future spec, or with each other.
- LINK to github issue #2163
- Discussion of the certification process
  
Meeting:
Slides and video links (forthcoming)

June 12th, 2024 (9:30am PT)

Attendees:
TSC:
- Maciej Obuchowski, Software Engineer, GetInData
- Minkyu Park, Software Engineer, Moloco
- Harel Shein, Director of Engineering, Datadog
  
And:
- Sophie Ly, Data Engineer, Decathlon
- Sheeri Cabral, Product Manager, Collibra
- Mark Soule, Principal Engineer, Improving
- Jakub Moravec, Product Manager, IBM MANTA
- Abdallah T. Data Engineer at Decathlon
  
Agenda:
- Announcements
- Recent Releases - Harel
- Dataset Namespace Resolver - Maciej
- Discussion Items
  - Spark 4.0 upcoming
- Open Discussion
  
Notes:
  • Announcements:
  • Recent Releases - Harel
  • Dataset Namespace Resolver
    • Naming convention for datasets - when hostname is used for uniquely identifying a dataset, redundant servers can be an issue.
    • Resolves several namespaces into the same dataset - e.g. kafka1, kafka2, kafka3.
    • Discussed the use case where there are several names for the same dataset in different technologies (e.g. dbt, Athena, Spark and Databricks all list different namespaces for the same dataset). The tool may or may not know what the underlying data location is (e.g. S3 bucket).
      • The Dataset Namespace Resolver will work in this case
      • Discussed other features to solve this, e.g. a facet for physical location. No decision was made.
  • Airflow Integration Updates - Maciej
    • See slides for what has been done and what's coming.
  
Meeting:
Slides (forthcoming)
Widget Connector
urlhttp://youtube.com/watch?v=SQ43PnhzuhU

May 8, 2024 (9:30am PT)

Attendees:

...

Slides (link forthcoming)

Widget Connector
urlhttp://youtube.com/watch?v=5KVwtjOMhXk

April 10, 2024 (9:30am PT)

...

  • Announcements
  • Recent release 1.9.1 highlights
  • Scala 2.13 support in Spark overview by @Damien Hawes
  • Circuit breaker in Spark & Flink, built-in lineage in Spark @Paweł Leszczyński
  • Discussion items
  • Open discussion

Widget Connector
urlhttp://youtube.com/watch?v=5KVwtjOMhXk

February 8, 2024 (10am PT)

...

Integration matrix
    - Jens suggests expanding on the integration matrix and mentions issues with iceberg support in Spark.
    - Eric reflects on Jens' suggestion.
    - Michael Robinson thanks Jens for the input.

2023

December 14, 2023 (10am PT)

...

  • TSC:
    • Mike Collado, Staff Software Engineer, Astronomer
    • Julien Le Dem, OpenLineage Project lead
    • Willy Lulciuc, Co-creator of Marquez
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
    • Mandy Chessell, Egeria Project Lead
    • Daniel Henneberger, Database engineer
    • Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
  • And:
    • Petr Hajek, Information Management Professional, Profinit
    • Harel Shein, Director of Engineering, Astronomer
    • Minkyu Park, Senior Software Engineer, Astronomer
    • Sam Holmberg, Software Engineer, Astronomer
    • Ernie Ostic, SVP of Product, MANTA
    • Sheeri Cabral, Technical Product Manager, Lineage, Collibra
    • John Thomas, Software Engineer, Dev. Rel., Astronomer
    • Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics

...

  • Announcements
    • OpenLineage earned Incubation status with the LFAI & Data Foundation at their December TAC meeting!
      • Represents our maturation in terms of governance, code quality assurance practices, documentation, more
      • Required earning the OpenSSF Silver Badge, sponsorship, at least 300 GitHub stars
      • Next up: Graduation (expected in early summer)
  • Recent release 0.19.2 [Michael R.]
  • Column-level lineage update [Maciej]
    • What is the OpenLineage SQL parser?
      • At its core, it’s a Rust library that parses SQL statements and extracts lineage data from it 
      • 80/20 solution - we’ll not be able to parse all possible SQL statements - each database has custom extensions and different syntax, so we focus on standard SQL.
      • Good example of complicated extension: Snowflake COPY INTO https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html
      • We primarily use the parser in Airflow integration and Great Expectations integration
      • Why? Airflow does not “understand” a lot of what some operators do, for example PostgreSqlOperator
      • We also have Java support package for parser   
    • What changed previously?
      • Parser in current release can emit column-level lineage!
      • Last OL meeting Piotr Wojtczak, primary author of this change presented new core of parser that enabled that functionality
        https://www.youtube.com/watch?v=Lv_bODeAVYQ
      • Still, the fact that Rust code can do that does not mean we have it for free everywhere
    • What has changed recently?
      • We wrote “glue code” that allows us to use new parser constructs in Airflow integration
      • Error handling just got way easier: SQL parser can “partially” parse SQL construct, and report errors it encountered, with particular statements that caused it.
    • Usage
      • Airflow integration extractors based on SqlExtractor (ex. PostgreSqlExtractor, SnowflakeExtractor, TrinoExtractor…) are now able to extract column-level lineage
      • Close future: Spark will be able to extract lineage from JDBCRelation.
  • Recent improvements to the Airflow integration [Kuba]
    • OpenLineage facets
      • Facets are pieces of metadata that can be attached to the core entities: run, job or dataset
      • Facets provide context to OpenLineage events
      • They can be defined as either part of the OpenLineage spec or custom facets
    • Airflow generic facet
      • Previously multiple custom facets with no standard
        • AirflowVersionRunFacet as an example of rapidly growing facet with version unrelated information
      • Introduced AirflowRunFacet with Task, DAG, TaskInstance and DagRun properties
      • Old facets are going to be deprecated soon. Currently both old and new facets are emitted
        • AirflowRunArgsRunFacet, AirflowVersionRunFacet, AirflowMappedTaskRunFacet will be removed
        • All information from above is moved to AirflowRunFacet
    • Other improvements (added in 0.19.2)
      • SQL extractors now send column-level lineage metadata
      • Further facets standardization

        • Introduced ProcessingEngineRunFacet
          • provides processing engine information, e.g. Airflow or Spark version
        • Improved support for nominal start & end times
          • makes use of data interval (introduced in Airflow 2.x)
          • nominal end time now matches next schedule time
        • DAG owner added to OwnershipJobFacet
        • Added support for S3FileTransformOperator and TrinoOperator (@sekikn’s great contribution)
  • Discussion: what does it mean to implement the spec? [Sheeri]
    • What is it mean to meet the spec?
      • 100% compliance is not required
      • OL ecosystem page
        • doesn't say what exactly it does
        • operational lineage not well defined
        • what does a payload look like? hard to find this info
      • Compatibility between producers/consumers is unclear
    • Important if standard is to be adopted widely [Mandy]
      • Egeria: uses compliance test with reports and badging; clarifies compatibility
      • test and test cases available in the Egeria repo, including profiles and clear rules about compliant ways to support Egeria
      • a badly behaving producer or consumer will create problems
      • have to be able to trust what you get
    • What about consumers? [Mike C.]
      • can we determine if they have done the correct thing with facets? [John]
      • what do we call "compliant"?
      • custom facets shouldn't be subject to this – they are by definition custom (and private) [Maciej]
      • only complete events (not start events) should be required – start events not desired outside of operational use cases [Maciej]
    • There's a simple baseline on the one hand and facets on the other [Julien]
    • Note: perfection isn't the goal
      • instead: shared test cases, data such as sample schema that can be tested against
    • Marquez doesn't explain which facets it's using or how [Willy]
      • communication by consumers could be better
    • Effort at documenting this: matrix [Julien]
    • How would we define failing tests? [Maciej]
      • at a minimum we could have a validation mode [Julien]
      • challenge: the spec is always moving, growing [Maciej]
      • ex: in the case of JSON schema validation, facets are versioned individually but there's a reference schema that is versioned that might not be the current schema. Facets can be dereferenced, but the right way to do this is not clear [Danny]
      • one solution could be to split out base times, or we could add a tool that would force us to clean this up
      • client-side proxy presents same problem; tried different validators in Go; a workaround is to validate against the main doc first; by continually validating against the client proxy we can make sure it stays compliant with the spec [Minkyu]
      • Mandy: if Marquez says it's "OK," it's OK; we've been doing it manually [Mandy]
      • Marquez doesn't do any validation for consumers [Mike C.]
      • manual validation is not good enough [Mandy]
      • I like the idea of compliance badges – it would be cool if we had a way to validate consumers and there were a way to prove this and if we could extend validation to integrations like the Airflow integration [Mike C.]
    • Let's follow up on Slack and use the notes from this discussion to collaborate on a proposal [Julien]

2022

December 8, 2022 (10am PT)

...

  • Release 0.9.0 [Michael R.]
    • We added:
    • For the bug fixes and more information, see the Github repo.
    • Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
  • Snowflake Blog Post [Ross]
    • topic: a new integration between OL and Snowflake
    • integration is the first OL extractor to process query logs
    • design:
      • an Airflow pipeline processes queries against Snowflake
      • separate job: pulls access history and assembles lineage metadata
      • two angles: Airflow sees it, Snowflake records it
    • the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
    • result: you can study the transformation by asking Snowflake AND Airflow about it
    • required: having access history enabled in your Snowflake account (which requires special access level)
    • Q & A
      • Howard: is the access history task part of the DAG?
      • Ross: yes, there's a separate DAG that pulls the view and emits the events
      • Howard: what's the scope of the metadata?
      • Ross: the account level
      • Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
      • Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
  • Great Expectations integration [Michael C.]
    • validation actions in GE execute after validation code does
    • metadata extracted from these and transformed into facets
    • recent update: the integration now supports version 3 of the GE API
    • some configuration ongoing: currently you need to set up validation actions in GE
    • Q & A
      • Willy: is the metadata emitted as facets?
      • Michael C.: yes, two
  • dbt integration [Willy]
    • a demo on getting started with the OL-dbt library
      • pip install the integration library and dbt
      • configure the dbt profile
      • run seed command and run command in dbt
      • the integration extracts metadata from the different views
      • in Marquez, the UI displays the input/output datasets, job history, and the SQL
  • Open discussion
    • Howard: what is the process for becoming a committer?
      • Maciej: nomination by a committer then a vote
      • Sheeri: is coding beforehand recommended?
      • Maciej: contribution to the project is expected
      • Willy: no timeline on the process, but we are going to try to hold a regular vote
      • Ross: project documentation covers this but is incomplete
      • Michael C.: is this process defined by the LFAI?
    • Ross: contributions to the website, workshops are welcome!
    • Michael R.: we're in the process of moving the meeting recordings to our YouTube channel

May 19th, 2022 (10am PT)

Agenda:

...

  • TSC:
    • Mike Collado: Staff Software Engineer, Datakin
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Ernie Ostic: SVP of Product, Manta 
    • Sandeep Adwankar: Senior Technical Product Manager, AWS
    • Paweł Leszczyński, Software Engineer, GetinData
    • Howard Yoo: Staff Product Manager, Astronomer
    • Michael Robinson: Developer Relations Engineer, Astronomer
    • Ross Turk: Senior Director of Community, Astronomer
    • Minkyu Park: Senior Software Engineer, Astronomer
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft

Meeting:

Widget Connector
urlhttp://youtube.com/watch?v=X0ZwMotUARA

Notes:

  • Releases
  • Communication reminders [Julien]
  • Agenda [Julien]
  • Column-level lineage [Paweł]
    • Linked to 4 PRs, the first being a proposal
    • The second has been merged, but the core mechanism is turned off
    • 3 requirements:
      • Outputs labeled with expression IDs
      • Inputs with expression IDs
      • Dependencies
    • Once it is turned on, each OL event will receive a new JSON field
    • It would be great to be able to extend this API (currently on the roadmap)
    • Q & A
      • Will: handling user-defined functions: is the solution already generic enough?
        • The answer will depend on testing, but I suspect that the answer is yes
        • The team at Microsoft would be excited to learn that the solution will handle UDFs
      • Julien: the next challenge will be to ensure that all the integrations support column-level lineage
  • Open discussion
    • Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
      • Julien: lineage endpoint for col-level lineage to layer on top of what already exists
      • Willy: this makes sense – we could use the method for input and output datasets as a model
      • Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
      • Willy: how do we expect this to be visualized?
        • Julien: not quite sure
        • Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields 

...

  • 0.6.2 release overview [Michael R.]
  • Transports in OpenLineage clients [Maciej]
  • Airflow integration update [Maciej]
  • Dagster integration retrospective [Dalin]
  • Open discussion

Meeting info:

Widget Connector
urlhttp://youtube.com/watch?v=MciFCgrQaxk

Notes:

  • Introductions
  • Communication channels overview [Julien]
  • Agenda overview [Julien]
  • 0.6.2 release overview [Michael R.]

...

  • New committers [Julien]
    • 4 new committers were voted in last week
    • We had fallen behind
    • Congratulations to all
  • Release overview (0.6.0-0.6.1) [Michael R.]
    • Added
      • Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
      • Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
        • These first two additions are similar to SQL facet
        • Offer the ability to see top-level code
      • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
        • Captures when someone is conducting dataset operations (overwrite, create, etc.)
      • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
        • Collects environment variables
        • Depends on Databricks runtime but can be reused in other environments
      • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
        • The first iteration of the Dagster integration to get lineage from Dagster
      • Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
        • Small addition to Java client feat. better types; was string
    • Fixed
      • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
        • The former was a particular issue with the Great Expectations integration
      • Reduce logging level for import errors to info @rossturk (0.6.0)
        • Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
        • This fix reduced the level to Info
      • Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
        • Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
        • Some keys are Snowflake-specific, but more can be added from other data sources
      • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
        • Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
      • Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
        • Previously when an OL event failed to emit, this could break an integration
        • This fix catches possible failures and logs them
  • Process for blog posts [Ross]
    • Moving the process to Github Issues
    • Follow release tracker there

    • Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts

    • No one will have a monopoly

    • Proposals for blog posts also welcome and we can support your efforts with outlines, feedback

    • Throw your ideas on the issue tracker on Github

  • Retrospective: Spark integration [Willy et al.]
    • Willy: originally this part of Marquez – the inspiration behind OL

      • OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)

      • Donated the integration to OL

    • Srikanth: #559 very helpful to Azure

    • Pawel: is anything missing from the Spark integration? E.g., column-level lineage?

    • Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome

    • Maciej: should be more active about tracking projects we have integrations with; add to test matrix 

    • Julien: let’s open some issues to address these

  • Open Discussion
    • Flink updates? [Julien]
      • Maciej: initial exploration is done

        • challenge: Flink has 4 APIs

        • prioritizing Kafka lineage currently because most jobs are writing to/from Kafka

        • track this on Github milestones, contribute, ask questions there

      • Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage? 

      • Maciej: trying to model entire Flink run as one event

      • Srikanth: proposed two separate streams, one for data updates and one for metadata

      • Julien: do we have an issue on this topic in the repo?

      • Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc

      • Julien: see notes for ticket number; MC will create the ticket

      • Srikanth: we can collaborate offline

...

  • OpenLineage recent release overview (0.5.1) [Julien]
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
  • Apache Flink integration [Julien]
  • Dagster integration demo [Dalin]
  • Open Discussion

Meeting:

Slides

Widget Connector
urlhttp://youtube.com/watch?v=cIrXmC0zHLg

Notes:

  • OpenLineage recent release overview (0.5.1) [Julien]
    • No 0.5.0 due to bug
    • Support for dbt-spark adapter
    • New backend to proxy OL events
    • Support for custom facets
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
    • Integration runs on worker side
    • Will be in next OL release of airflow (2.3)
    • Thanks to Maciej for his work on this
  • Apache Flink integration [Julien]
    • Ticket for discussion available
    • Integration test setup
    • Early stages
  • Dagster integration demo [Dalin]
    • Initiated by Dalin Kim
    • OL used with Dagster on orchestration layer
    • Utilizes Dagster sensor
    • Introduces OL sensor that can be added to Dagster repo definition
    • Uses cursor to keep track of ID
    • Looking for feedback after review complete
    • Discussion:
      • Dalin: needed: way to interpret Dagster asset for OL
      • Julien: common code from Great Expectations/Dagster integrations
      • Michael C: do you pass parent run ID in child job when sending the job to MZ?
      • Hierarchy can be extended indefinitely – parent/child relationship can be modeled
      • Maciej: the sensor kept failing – does this mean the events persisted despite being down?
      • Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
      • Dalin: hoping for more feedback
      • Julien: slides will be posted on slack channel, also tickets
  • Open discussion
    • Will: how is OL ensuring consistency of datasets across integrations? 
    • Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
    • Julien: need for tutorial on creating integrations
    • Srikanth: have done some of this work in Atlas
    • Kevin: are there libraries on the horizon to play this role? (Julien: yes)
    • Srikanth: it would be good to have model spec to provide enforceable standard
    • Julien: agreed; currently models are based on the JSON schema spec
    • Julien: contributions welcome; opening a ticket about this makes sense
    • Will: Flink integration: MZ focused on batch jobs
    • Julien: we want to make sure we need to add checkpointing
    • Julien: there will be discussion in OLMZ communities about this
      • In MZ, there are questions about what counts as a version or not
    • Julien: a consistent model is needed
    • Julien: one solution being looked into is Arrow
    • Julien: everyone should feel welcome to propose agenda items (even old projects)
    • Srikanth: who are you working with on the Flink comms side? Will get back to you.

...

...

Proposal to convert licenses to SPDX [Michael]: no objections

2021

Dec 8th 2021 (9am PT)

Attendees:

...

  • Attendees: 
    • TSC:
      • Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria

      • Michael Collado: Datakin, OpenLineage

      • Maciej Obuchowski: GetInData. OpenLineage integrations
      • Willy Lulciuc: Marquez co-creator.
      • Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
    • And:
      • Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
      • Minkyu Park: Datakin. learning about OpenLineage
      • Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=Gk0CwFYm9i4

  • Meeting notes:
    • agenda: 
      • Update on OpenLineage latest release (0.2.1)

        • dbt integration demo

      • OpenLineage 0.3 scope discussion

        • Facet versioning mechanism (Issue #153)

        • OpenLineage Proxy Backend (Issue #152)

        • OpenLineage implementer test data and validation

        • Kafka client

      • Roadmap

        • Iceberg integration
      • Open discussion

    • Slides 

    • Discussions:
      • added to the agenda a Discussion of Iceberg requirements for OpenLineage.

    • Demo of dbt:

      • really easy to try

      • when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'

    • Presentation of Proxy Backend design:

      • summary of discussions in Egeria
        • Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage

        • Two ways to use Egeria with OpenLineage

          • receives HTTP events and forwards to Kafka

          • A consumer receives the Kafka events in Egeria

      • Proxy Backend in OpenLineage:

        • direct HTTP endpoint implementation in Egeria

      • Depending on the user they might pick one or the other and we'll document

    • Use a direct OpenLineage endpoint (like Marquez)

      • Deploy the Proxy Backend to write to a queue (ex: Kafka)

      • Follow up items:

...

Aug 11th 2021

  • Attendees: 
    • TSC:
      • Ryan Blue

      • Maciej Obuchowski

      • Michael Collado

      • Daniel Henneberger

      • Willy Lulciuc

      • Mandy Chessell

      • Julien Le Dem

    • And:
      • Peter Hicks

      • Minkyu Park

      • Daniel Avancini

  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=bbAwz-rzo3I

...

  • Attendees: 
    • TSC:
      • Julien Le Dem
      • Mandy Chessel
      • Michael Collado
      • Willy Lulciuc
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=kYzFYrzSpzg

  • Meeting notes
    • Agenda:
    • Notes: 

      Mission statement:

      Spec versioning mechanism:

      • The goal is to commit to compatible changes once 0.1 is published

      • We need a follow up to separate core facet versioning


      => TODO: create a separate github ticket.
      • The lineage event should have a field that identifies what version of the spec it was produced with

        • => TODO: create a github issue for this

      • TODO: Add issue to document version number semantics (SCHEMAVER)

      Extend Event State notion:

      OpenLineage 0.1:

      • finalize a few spec details for 0.1 : a few items left to discuss.

        • In particular job naming

        • parent job model

      • Importing Marquez integrations in OpenLineage

      Open Discussion:

      • connecting the consumer and producer

        • TODO: ticket to track distribution mechanism

        • options:

          • Would we need a consumption client to make it easy for consumers to get events from Kafka for example?

          • OpenLineage provides client libraries to serialize/deserialize events as well as sending them.

        • We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.

        • Do we have a mutual third party or the client know where to send?

      • Source code location finalization

      • job naming convention

        • you don't always have a nested execution

          • can call a parent

        • parent job

        • You can have a job calling another one.

        • always distinguish a job and its run

      • need a separate notion for job dependencies

      • need to capture event driven: TODO: create ticket.


      TODO(Julien): update job naming ticket to have the discussion.

...

  • Attendees: 
    • TSC:
      Julien Le Dem: Marquez, Datakin
      Drew Banin: dbt, CPO at fishtown analytics
      Maciej Obuchowski: Marquez, GetIndata consulting company
      Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
      Daniel Henneberger: building a database, interested in lineage
      Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
      Willy Lulciuc: co-creator of Marquez
      Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
    • And:
      Kedar Rajwade: consulting on distributed systems.
      Barr Yaron: dbt, PM at Fishtown analytics on metadata.
      Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
    • Excused: Ryan Blue, James Campbell
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=er2GDyQtm5M

...