Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The OpenLineage Technical Steering Committee meetings are Monthly on the third Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

...

  • TSC:
    • Mike Collado, Staff Software Engineer, Astronomer
    • Julien Le Dem, OpenLineage Project lead
    • Willy Lulciuc, Co-creator of Marquez
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
    • Mandy Chessell, Egeria Project Lead
    • Daniel Henneberger, Database engineer
    • Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
  • And:
    • Petr Hajek, Information Management Professional, Profinit
    • Harel Shein, Director of Engineering, Astronomer
    • Minkyu Park, Senior Software Engineer, Astronomer
    • Sam Holmberg, Software Engineer, Astronomer
    • Ernie Ostic, SVP of Product, MANTA
    • Sheeri Cabral, Technical Product Manager, Lineage, Collibra
    • John Thomas, Software Engineer, Dev. Rel., Astronomer
    • Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics

...

  • Release 0.9.0 [Michael R.]
    • We added:
    • For the bug fixes and more information, see the Github repo.
    • Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
  • Snowflake Blog Post [Ross]
    • topic: a new integration between OL and Snowflake
    • integration is the first OL extractor to process query logs
    • design:
      • an Airflow pipeline processes queries against Snowflake
      • separate job: pulls access history and assembles lineage metadata
      • two angles: Airflow sees it, Snowflake records it
    • the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
    • result: you can study the transformation by asking Snowflake AND Airflow about it
    • required: having access history enabled in your Snowflake account (which requires special access level)
    • Q & A
      • Howard: is the access history task part of the DAG?
      • Ross: yes, there's a separate DAG that pulls the view and emits the events
      • Howard: what's the scope of the metadata?
      • Ross: the account level
      • Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
      • Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
  • Great Expectations integration [Michael C.]
    • validation actions in GE execute after validation code does
    • metadata extracted from these and transformed into facets
    • recent update: the integration now supports version 3 of the GE API
    • some configuration ongoing: currently you need to set up validation actions in GE
    • Q & A
      • Willy: is the metadata emitted as facets?
      • Michael C.: yes, two
  • dbt integration [Willy]
    • a demo on getting started with the OL-dbt library
      • pip install the integration library and dbt
      • configure the dbt profile
      • run seed command and run command in dbt
      • the integration extracts metadata from the different views
      • in Marquez, the UI displays the input/output datasets, job history, and the SQL
  • Open discussion
    • Howard: what is the process for becoming a committer?
      • Maciej: nomination by a committer then a vote
      • Sheeri: is coding beforehand recommended?
      • Maciej: contribution to the project is expected
      • Willy: no timeline on the process, but we are going to try to hold a regular vote
      • Ross: project documentation covers this but is incomplete
      • Michael C.: is this process defined by the LFAI?
    • Ross: contributions to the website, workshops are welcome!
    • Michael R.: we're in the process of moving the meeting recordings to our YouTube channel

May 19th, 2022 (10am PT)

Agenda:

...

  • TSC:
    • Mike Collado: Staff Software Engineer, Datakin
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Ernie Ostic: SVP of Product, Manta 
    • Sandeep Adwankar: Senior Technical Product Manager, AWS
    • Paweł Leszczyński, Software Engineer, GetinData
    • Howard Yoo: Staff Product Manager, Astronomer
    • Michael Robinson: Developer Relations Engineer, Astronomer
    • Ross Turk: Senior Director of Community, Astronomer
    • Minkyu Park: Senior Software Engineer, Astronomer
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft

Meeting:

Widget Connector
urlhttp://youtube.com/watch?v=X0ZwMotUARA

Notes:

  • Releases
  • Communication reminders [Julien]
  • Agenda [Julien]
  • Column-level lineage [Paweł]
    • Linked to 4 PRs, the first being a proposal
    • The second has been merged, but the core mechanism is turned off
    • 3 requirements:
      • Outputs labeled with expression IDs
      • Inputs with expression IDs
      • Dependencies
    • Once it is turned on, each OL event will receive a new JSON field
    • It would be great to be able to extend this API (currently on the roadmap)
    • Q & A
      • Will: handling user-defined functions: is the solution already generic enough?
        • The answer will depend on testing, but I suspect that the answer is yes
        • The team at Microsoft would be excited to learn that the solution will handle UDFs
      • Julien: the next challenge will be to ensure that all the integrations support column-level lineage
  • Open discussion
    • Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
      • Julien: lineage endpoint for col-level lineage to layer on top of what already exists
      • Willy: this makes sense – we could use the method for input and output datasets as a model
      • Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
      • Willy: how do we expect this to be visualized?
        • Julien: not quite sure
        • Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields 

...

  • 0.6.2 release overview [Michael R.]
  • Transports in OpenLineage clients [Maciej]
  • Airflow integration update [Maciej]
  • Dagster integration retrospective [Dalin]
  • Open discussion

Meeting info:

Widget Connector
urlhttp://youtube.com/watch?v=MciFCgrQaxk

Notes:

  • Introductions
  • Communication channels overview [Julien]
  • Agenda overview [Julien]
  • 0.6.2 release overview [Michael R.]

...

  • New committers [Julien]
    • 4 new committers were voted in last week
    • We had fallen behind
    • Congratulations to all
  • Release overview (0.6.0-0.6.1) [Michael R.]
    • Added
      • Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
      • Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
        • These first two additions are similar to SQL facet
        • Offer the ability to see top-level code
      • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
        • Captures when someone is conducting dataset operations (overwrite, create, etc.)
      • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
        • Collects environment variables
        • Depends on Databricks runtime but can be reused in other environments
      • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
        • The first iteration of the Dagster integration to get lineage from Dagster
      • Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
        • Small addition to Java client feat. better types; was string
    • Fixed
      • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
        • The former was a particular issue with the Great Expectations integration
      • Reduce logging level for import errors to info @rossturk (0.6.0)
        • Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
        • This fix reduced the level to Info
      • Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
        • Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
        • Some keys are Snowflake-specific, but more can be added from other data sources
      • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
        • Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
      • Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
        • Previously when an OL event failed to emit, this could break an integration
        • This fix catches possible failures and logs them
  • Process for blog posts [Ross]
    • Moving the process to Github Issues
    • Follow release tracker there

    • Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts

    • No one will have a monopoly

    • Proposals for blog posts also welcome and we can support your efforts with outlines, feedback

    • Throw your ideas on the issue tracker on Github

  • Retrospective: Spark integration [Willy et al.]
    • Willy: originally this part of Marquez – the inspiration behind OL

      • OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)

      • Donated the integration to OL

    • Srikanth: #559 very helpful to Azure

    • Pawel: is anything missing from the Spark integration? E.g., column-level lineage?

    • Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome

    • Maciej: should be more active about tracking projects we have integrations with; add to test matrix 

    • Julien: let’s open some issues to address these

  • Open Discussion
    • Flink updates? [Julien]
      • Maciej: initial exploration is done

        • challenge: Flink has 4 APIs

        • prioritizing Kafka lineage currently because most jobs are writing to/from Kafka

        • track this on Github milestones, contribute, ask questions there

      • Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage? 

      • Maciej: trying to model entire Flink run as one event

      • Srikanth: proposed two separate streams, one for data updates and one for metadata

      • Julien: do we have an issue on this topic in the repo?

      • Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc

      • Julien: see notes for ticket number; MC will create the ticket

      • Srikanth: we can collaborate offline

...

  • OpenLineage recent release overview (0.5.1) [Julien]
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
  • Apache Flink integration [Julien]
  • Dagster integration demo [Dalin]
  • Open Discussion

Meeting:

Slides

Widget Connector
urlhttp://youtube.com/watch?v=cIrXmC0zHLg

Notes:

  • OpenLineage recent release overview (0.5.1) [Julien]
    • No 0.5.0 due to bug
    • Support for dbt-spark adapter
    • New backend to proxy OL events
    • Support for custom facets
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
    • Integration runs on worker side
    • Will be in next OL release of airflow (2.3)
    • Thanks to Maciej for his work on this
  • Apache Flink integration [Julien]
    • Ticket for discussion available
    • Integration test setup
    • Early stages
  • Dagster integration demo [Dalin]
    • Initiated by Dalin Kim
    • OL used with Dagster on orchestration layer
    • Utilizes Dagster sensor
    • Introduces OL sensor that can be added to Dagster repo definition
    • Uses cursor to keep track of ID
    • Looking for feedback after review complete
    • Discussion:
      • Dalin: needed: way to interpret Dagster asset for OL
      • Julien: common code from Great Expectations/Dagster integrations
      • Michael C: do you pass parent run ID in child job when sending the job to MZ?
      • Hierarchy can be extended indefinitely – parent/child relationship can be modeled
      • Maciej: the sensor kept failing – does this mean the events persisted despite being down?
      • Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
      • Dalin: hoping for more feedback
      • Julien: slides will be posted on slack channel, also tickets
  • Open discussion
    • Will: how is OL ensuring consistency of datasets across integrations? 
    • Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
    • Julien: need for tutorial on creating integrations
    • Srikanth: have done some of this work in Atlas
    • Kevin: are there libraries on the horizon to play this role? (Julien: yes)
    • Srikanth: it would be good to have model spec to provide enforceable standard
    • Julien: agreed; currently models are based on the JSON schema spec
    • Julien: contributions welcome; opening a ticket about this makes sense
    • Will: Flink integration: MZ focused on batch jobs
    • Julien: we want to make sure we need to add checkpointing
    • Julien: there will be discussion in OLMZ communities about this
      • In MZ, there are questions about what counts as a version or not
    • Julien: a consistent model is needed
    • Julien: one solution being looked into is Arrow
    • Julien: everyone should feel welcome to propose agenda items (even old projects)
    • Srikanth: who are you working with on the Flink comms side? Will get back to you.

...

...

  • Attendees: 
    • TSC:
      • Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria

      • Michael Collado: Datakin, OpenLineage

      • Maciej Obuchowski: GetInData. OpenLineage integrations
      • Willy Lulciuc: Marquez co-creator.
      • Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
    • And:
      • Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
      • Minkyu Park: Datakin. learning about OpenLineage
      • Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=Gk0CwFYm9i4

  • Meeting notes:
    • agenda: 
      • Update on OpenLineage latest release (0.2.1)

        • dbt integration demo

      • OpenLineage 0.3 scope discussion

        • Facet versioning mechanism (Issue #153)

        • OpenLineage Proxy Backend (Issue #152)

        • OpenLineage implementer test data and validation

        • Kafka client

      • Roadmap

        • Iceberg integration
      • Open discussion

    • Slides 

    • Discussions:
      • added to the agenda a Discussion of Iceberg requirements for OpenLineage.

    • Demo of dbt:

      • really easy to try

      • when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'

    • Presentation of Proxy Backend design:

      • summary of discussions in Egeria
        • Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage

        • Two ways to use Egeria with OpenLineage

          • receives HTTP events and forwards to Kafka

          • A consumer receives the Kafka events in Egeria

      • Proxy Backend in OpenLineage:

        • direct HTTP endpoint implementation in Egeria

      • Depending on the user they might pick one or the other and we'll document

    • Use a direct OpenLineage endpoint (like Marquez)

      • Deploy the Proxy Backend to write to a queue (ex: Kafka)

      • Follow up items:

...

Aug 11th 2021

  • Attendees: 
    • TSC:
      • Ryan Blue

      • Maciej Obuchowski

      • Michael Collado

      • Daniel Henneberger

      • Willy Lulciuc

      • Mandy Chessell

      • Julien Le Dem

    • And:
      • Peter Hicks

      • Minkyu Park

      • Daniel Avancini

  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=bbAwz-rzo3I

...

  • Attendees: 
    • TSC:
      • Julien Le Dem
      • Mandy Chessel
      • Michael Collado
      • Willy Lulciuc
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=kYzFYrzSpzg

  • Meeting notes
    • Agenda:
    • Notes: 

      Mission statement:

      Spec versioning mechanism:

      • The goal is to commit to compatible changes once 0.1 is published

      • We need a follow up to separate core facet versioning


      => TODO: create a separate github ticket.
      • The lineage event should have a field that identifies what version of the spec it was produced with

        • => TODO: create a github issue for this

      • TODO: Add issue to document version number semantics (SCHEMAVER)

      Extend Event State notion:

      OpenLineage 0.1:

      • finalize a few spec details for 0.1 : a few items left to discuss.

        • In particular job naming

        • parent job model

      • Importing Marquez integrations in OpenLineage

      Open Discussion:

      • connecting the consumer and producer

        • TODO: ticket to track distribution mechanism

        • options:

          • Would we need a consumption client to make it easy for consumers to get events from Kafka for example?

          • OpenLineage provides client libraries to serialize/deserialize events as well as sending them.

        • We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.

        • Do we have a mutual third party or the client know where to send?

      • Source code location finalization

      • job naming convention

        • you don't always have a nested execution

          • can call a parent

        • parent job

        • You can have a job calling another one.

        • always distinguish a job and its run

      • need a separate notion for job dependencies

      • need to capture event driven: TODO: create ticket.


      TODO(Julien): update job naming ticket to have the discussion.

...

  • Attendees: 
    • TSC:
      Julien Le Dem: Marquez, Datakin
      Drew Banin: dbt, CPO at fishtown analytics
      Maciej Obuchowski: Marquez, GetIndata consulting company
      Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
      Daniel Henneberger: building a database, interested in lineage
      Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
      Willy Lulciuc: co-creator of Marquez
      Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
    • And:
      Kedar Rajwade: consulting on distributed systems.
      Barr Yaron: dbt, PM at Fishtown analytics on metadata.
      Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
    • Excused: Ryan Blue, James Campbell
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=er2GDyQtm5M

...