Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The OpenLineage Technical Steering Committee meetings are Monthly on the Second Wednesday from 9:30am to 10:30am US Pacific. Here's the meeting info.

All are welcome.

Table of Contents

Next meeting: December 18th, 2024 (9:30am PT)


November 20th, 2024 (9:30am PT)

October 16th, 2024 (9:30am PT)

September 18th, 2024 (9:30am PT)

August 14th, 2024 (9:30am PT)

July 10th, 2024 (9:30am PT)

June 12th, 2024 (9:30am PT)


May 8, 2024 (9:30am PT)

Attendees:

  • TSC:
    • Julien Le Dem, OpenLineage project lead, LF AI & Data
    • Michael Robinson, Community Manager, Astronomer
    • Harel Shein, Lineage at Datadog
    • Pawel Leszczynski, Software Engineer, GetInData
    • Maciej Obuchowski, Software Engineer, GetInData, OpenLineage committer
  • And:
    • Mark Soule, Principal Engineer, Improving
    • Sheeri Cabral, Product Manager, ETL, Collibra
    • Ernie Ostic, IBM/Manta
    • Rahul Madan, Atlan

Agenda:

  • Announcements
  • Recent releases - 1.13.1
  • Protobuf support in Flink - Pawel
  • Improved Project Management on GitHub
  • Rahul Pre-flight configuration check DAG
  • Discussion items
  • Open discussion


Notes:

Accenture panel (presented by Confluent): Open Standards for Data Lineage, shared participation stats (see slides) including some companies and geographical data

  • Protobuf support in Flink - Pawel
    • Protocol buffers are platform and language neutral extensible mechanisms for serializing structured data. OpenLineage can now extract schemas from Flink jobs reading/writing in protobuf from Kafka.
  • Improved Project Management on GitHub
    • Recording from Kacper Muda
    • Adding 4 new issue templates: Bug report, documentation issue, Feature request, and General (everything else) - Issue 2666; adds new “needs triage” label - Issue 2664
    • Future: Naming conventions for issues and PRs - eg knowing which PR goes with which version it went into. And then reviving milestones.
    • These can be discussed in #dev-discuss on Slack
  • Rahul Pre-flight configuration check DAG
    • After setting up Airflow with OpenLineage, there’s no way to verify correctness of setup. Proposal: an Airflow DAG that is run and checks the config and makes sure it’s OK. https://openlineage.io/docs/integrations/airflow/preflight-check-dag/ - with demo
    • checks: env + config variables - makes sure OpenLineage is enabled, library is installed, compatible versions, checks connectivity with Marquez if applicable.
    • Were able to know about and fix #2596 due to the DAG.


Meeting:

Slides (link forthcoming)

April 10, 2024 (9:30am PT)

...

  • TSC:
    • Mike Collado, Staff Software Engineer, Astronomer
    • Julien Le Dem, OpenLineage Project lead
    • Willy Lulciuc, Co-creator of Marquez
    • Michael Robinson, Software Engineer, Dev. Rel., Astronomer
    • Maciej Obuchowski, Software Engineer, GetInData, OpenLineage contributor
    • Mandy Chessell, Egeria Project Lead
    • Daniel Henneberger, Database engineer
    • Will Johnson, Senior Cloud Solution Architect, Azure Cloud, Microsoft
    • Jakub "Kuba" Dardziński, Software Engineer, GetInData, OpenLineage contributor
  • And:
    • Petr Hajek, Information Management Professional, Profinit
    • Harel Shein, Director of Engineering, Astronomer
    • Minkyu Park, Senior Software Engineer, Astronomer
    • Sam Holmberg, Software Engineer, Astronomer
    • Ernie Ostic, SVP of Product, MANTA
    • Sheeri Cabral, Technical Product Manager, Lineage, Collibra
    • John Thomas, Software Engineer, Dev. Rel., Astronomer
    • Bramha Aelem, BigData/Cloud/ML and AI Architect, Tiger Analytics

...

  • Announcements [Julien]
    • OpenLineage earned the OSSF Core Infrastructure Silver Badge!
    • Happening soon: OpenLineage to apply formally for Incubation status with the LFAI
    • Blog: a post by Ernie Ostic about MANTA’s OpenLineage integration
    • Website: a new Ecosystem page
    • Workshops repo: An Intro to Dataset Lineage with Jupyter and Spark
    • Airflow docs: guidance on creating custom extractors to support external operators
    • Spark docs: improved documentation of column lineage facets and extensions
  • Recent release 0.16.1 [Michael R.] 
    • Added

      • Airflow: add dag_run information to Airflow version run facet #1133 @fm100
        Adds the Airflow DAG run ID to the taskInfo facet, making this additional information available to the integration.
      • Airflow: add LoggingMixin to extractors #1149 @JDarDagran
        Adds a LoggingMixin class to the custom extractor to make the output consistent with general Airflow and OpenLineage logging settings.
      • Airflow: add default extractor #1162 @mobuchowski
        Adds a DefaultExtractor to support the default implementation of OpenLineage for external operators without the need for custom extractors.
      • Airflow: add on_complete argument in DefaultExtractor #1188 @JDarDagran
        Adds support for running another method on extract_on_complete.
      • SQL: reorganize the library into multiple packages #1167 @StarostaGit @mobuchowski
        Splits the SQL library into a Rust implementation and foreign language bindings, easing the process of adding language interfaces. Also contains a CI fix.

      Changed

      • Airflow: move get_connection_uri as extractor's classmethod #1169 @JDarDagran
        The get_connection_uri method allowed for too many params, resulting in unnecessarily long URIs. This changes the logic to whitelisting per extractor.
      • Airflow: change get_openlineage_facets_on_start/complete behavior #1201 @JDarDagran
        Splits up the method for greater legibility and easier maintenance.
    • Removed

      • Airflow: remove support for Airflow 1.10 #1128 @mobuchowski
        Removes the code structures and tests enabling support for Airflow 1.10.
    • Bug fixes and more details 

  • Update on LFAI & Data progress [Michael R.]
    • LFAI & Data: a single funding effort to support technical projects hosted under the [Linux] foundation
    • Current status: applying soon for Incubation, will be ready to apply for Graduation soon (dates TBD).
    • Incubation stage requirements:

      • 2+ organizations actively contributing to the project

        23 organizations

        A sponsor who is an existing LFAI & Data member

        To do

        300+ stars on GitHub

        1.1K GitHub stars

        A Core Infrastructure Initiative Best Practices Silver Badge

        Silver Badge earned on November 2

        Affirmative vote of the TAC and Governing Board

        Pending

        A defined TSC with a chairperson

        TSC with chairperson: Julien Le Dem

        Graduation stage requirements:


      • 5+ organizations actively contributing to the project

        23 organizations 

        Substantial flow of commits for 12 months

        Commit growth rate (12 mo.): 155.53%

        Avg commits pushed by active contributors (12 mo.): 2.18K

        1000+ stars on GitHub

        1.1K GitHub stars

        Core Infrastructure Initiative Best Practices Gold Badge

        Gold Badge in progress (57%)

        Affirmative vote of the TAC and Governing Board

        Pending

        1+ collaboration with another LFAI project

        Marquez, Egeria, Amundsen

        Technical lead appointed on the TAC

        To do


  • Implementing OpenLineage proposal and discussion [Julien]
    • Procedure for implementing OpenLineage is under-documented
    • Goal: provide a better guide on the multiple approaches that exist
    • Contributions are welcome
    • Expect more information about this at the next meeting
  • MANTA integration update [Petr]
    • Project: MANTA OpenLineage Connector
    • Straightforward solution:
      • Agent installed on customer side to setup an API endpoint for MANTA
      • MANTA Agent will hand over OpenLineage events to the MANTA OpenLineage Extractor, which will save the data in a MANTA OpenLineage Event Repository
      • Use the MANTA Admin UI to run/schedule the MANTA OpenLineage Reader to generator an OpenLineage Graph and produce the final MANTA Graph using a MANTA OpenLineage Generator
      • The whole process will be parameterized
    • Demo:
      • Example dataset produced by Keboola integration
      • All dependencies visualized in UI
      • Some information about columns is available, but not true column lineage
      • Possible to draw lineage across range of tools
    • Looking for volunteers willing to test the integration
    • Q&A
      • Are you using the Column-level Lineage Facet from OpenLineage?
        • Not yet, but we would like to test it
        • Find a good example of this in the OpenLineage/workshops/Spark GitHub repo
        • What would be great would be a real example/real environment for testing
  • Linking CMF (a common ML metadata framework) and OpenLineage [Suparna & Ann Mary]
    • https://github.com/HewlettPackard/cmf
    • Where CMF will fit in the OpenLineage ecosystem
      • linkage needed between forms of metadata for conducting AI experiments
      • concept: "git for AI metadata" consumable by tools such as Marquez and Egeria after publication by an OpenLineage-CMF publisher
      • challenges:
        • multiple stages with interlinked dependencies
        • executing asynchronously
        • data centricity requires artifact lineage and tracking influence of different artifacts and data slices on model performance
        • pipelines should be Reproducible, Auditable and Traceable
        • end-to-end visibility is necessary to identify biases, etc.
      • AI for Science example:
        • training loop in complex pipeline with multiple models optimized concurrently
          • e.g., an embedding model, edge selection model and graph neural model in same pipeline
          • CMF used to capture metadata across pipeline stages
      • Manufacturing quality monitoring pipeline
        • iterative retraining with new samples added to the dataset every iteration
        • CMF tracks lineage across training and deployment stages
        • Q: is the recording of metadata automatic, or does the data scientist have control over it?
          • there both explicit (e.g., APIs) and implicit modes of tracking
          • the data scientist can choose which "branches" to "push" a la Git
      • 3 columns of reproducibility
        • metadata store (MLMD/MLFlow)
        • Artifact Store (DVC/Others)
        • Query Cache Layer (Graph Database)
        • GIT
        • optimization
      • Comparison with other AI metadata infrastructure
        • Git-like support and ability to collaborate across teams distinguish CMF from alternatives
        • Metrics and lineage also make CMF comparable to model-centric and pipeline-centric tools
      • Lineage tracking and decentralized usage model
        • complete view of data model lineage for reproducibility, optimization, explainability
        • decentralized usage model, easily cloned in any environment
      • What does it look like?
        • explicit tracking via Python library
        • tracking of dataset, model and metrics
        • offers end-to-end visibility
      • API
        • abstractions: pipeline state, context/stage of execution, execution
      • Automated logging, heterogeneous SQ stand distributed teams
        • enables collaboration of distributed teams of scientists using a diverse set of libraries
        • automatic logging in command line interface
      • POC implementations
        • allows for integration with existing frameworks
        • compatible with ML/DL frameworks and ML tracking platforms
      • Translation between CMF and OpenLineage
        • export of metadata in OpenLineage format
        • mapping of abstractions onto OpenLineage
        • Run ~ Execution with Run facet
        • Job ~ Context with Job facet
        • Dataset ~ Dataset with Dataset facet
        • Namespace ~ Pipeline
      • Q&A
        • Pipeline might map to Job name
        • Context might map to Pipeline as Parent job
        • Model could map to a Dataset as well as Dataset
        • Metric as a model could map to a Dataset facet
        • 2 levels of dataset facet, one static and one tied to Job Runs

...

  • Release 0.9.0 [Michael R.]
    • We added:
    • For the bug fixes and more information, see the Github repo.
    • Shout out to new contributor Jakub Dardziński, who contributed a bug fix to this release!
  • Snowflake Blog Post [Ross]
    • topic: a new integration between OL and Snowflake
    • integration is the first OL extractor to process query logs
    • design:
      • an Airflow pipeline processes queries against Snowflake
      • separate job: pulls access history and assembles lineage metadata
      • two angles: Airflow sees it, Snowflake records it
    • the meat of the integration: a view that does untold SQL madness to emit JSON to send to OL
    • result: you can study the transformation by asking Snowflake AND Airflow about it
    • required: having access history enabled in your Snowflake account (which requires special access level)
    • Q & A
      • Howard: is the access history task part of the DAG?
      • Ross: yes, there's a separate DAG that pulls the view and emits the events
      • Howard: what's the scope of the metadata?
      • Ross: the account level
      • Michael C: in Airflow integration, there's a parent/child relationship; is this captured?
      • Ross: there are 2 jobs/runs, and there's work ongoing to emit metadata from Airflow (task name)
  • Great Expectations integration [Michael C.]
    • validation actions in GE execute after validation code does
    • metadata extracted from these and transformed into facets
    • recent update: the integration now supports version 3 of the GE API
    • some configuration ongoing: currently you need to set up validation actions in GE
    • Q & A
      • Willy: is the metadata emitted as facets?
      • Michael C.: yes, two
  • dbt integration [Willy]
    • a demo on getting started with the OL-dbt library
      • pip install the integration library and dbt
      • configure the dbt profile
      • run seed command and run command in dbt
      • the integration extracts metadata from the different views
      • in Marquez, the UI displays the input/output datasets, job history, and the SQL
  • Open discussion
    • Howard: what is the process for becoming a committer?
      • Maciej: nomination by a committer then a vote
      • Sheeri: is coding beforehand recommended?
      • Maciej: contribution to the project is expected
      • Willy: no timeline on the process, but we are going to try to hold a regular vote
      • Ross: project documentation covers this but is incomplete
      • Michael C.: is this process defined by the LFAI?
    • Ross: contributions to the website, workshops are welcome!
    • Michael R.: we're in the process of moving the meeting recordings to our YouTube channel

May 19th, 2022 (10am PT)

Agenda:

...

  • TSC:
    • Mike Collado: Staff Software Engineer, Datakin
    • Maciej Obuchowski: Software Engineer, GetInData, OpenLineage contributor
    • Julien Le Dem: OpenLineage Project lead
    • Willy Lulciuc: Co-creator of Marquez
  • And:
    • Ernie Ostic: SVP of Product, Manta 
    • Sandeep Adwankar: Senior Technical Product Manager, AWS
    • Paweł Leszczyński, Software Engineer, GetinData
    • Howard Yoo: Staff Product Manager, Astronomer
    • Michael Robinson: Developer Relations Engineer, Astronomer
    • Ross Turk: Senior Director of Community, Astronomer
    • Minkyu Park: Senior Software Engineer, Astronomer
    • Will Johnson: Senior Cloud Solution Architect, Azure Cloud, Microsoft

Meeting:

Widget Connector
urlhttp://youtube.com/watch?v=X0ZwMotUARA

Notes:

  • Releases
  • Communication reminders [Julien]
  • Agenda [Julien]
  • Column-level lineage [Paweł]
    • Linked to 4 PRs, the first being a proposal
    • The second has been merged, but the core mechanism is turned off
    • 3 requirements:
      • Outputs labeled with expression IDs
      • Inputs with expression IDs
      • Dependencies
    • Once it is turned on, each OL event will receive a new JSON field
    • It would be great to be able to extend this API (currently on the roadmap)
    • Q & A
      • Will: handling user-defined functions: is the solution already generic enough?
        • The answer will depend on testing, but I suspect that the answer is yes
        • The team at Microsoft would be excited to learn that the solution will handle UDFs
      • Julien: the next challenge will be to ensure that all the integrations support column-level lineage
  • Open discussion
    • Willy: in Mqz we need to start handling col-level lineage, and has anyone thought about how this might work?
      • Julien: lineage endpoint for col-level lineage to layer on top of what already exists
      • Willy: this makes sense – we could use the method for input and output datasets as a model
      • Michael C.: I don't know that we need to add an endpoint – we could augment the existing one to do something with the data
      • Willy: how do we expect this to be visualized?
        • Julien: not quite sure
        • Michael C.: there are a number of different ways we could do this, including isolating relevant dataset fields 

...

  • 0.6.2 release overview [Michael R.]
  • Transports in OpenLineage clients [Maciej]
  • Airflow integration update [Maciej]
  • Dagster integration retrospective [Dalin]
  • Open discussion

Meeting info:

Widget Connector
urlhttp://youtube.com/watch?v=MciFCgrQaxk

Notes:

  • Introductions
  • Communication channels overview [Julien]
  • Agenda overview [Julien]
  • 0.6.2 release overview [Michael R.]

...

  • New committers [Julien]
    • 4 new committers were voted in last week
    • We had fallen behind
    • Congratulations to all
  • Release overview (0.6.0-0.6.1) [Michael R.]
    • Added
      • Extract source code of PythonOperator code similar to SQL facet @mobuchowski (0.6.0)
      • Airflow: extract source code from BashOperator @mobuchowski (0.6.0)
        • These first two additions are similar to SQL facet
        • Offer the ability to see top-level code
      • Add DatasetLifecycleStateDatasetFacet to spec @pawel-big-lebowski (0.6.0)
        • Captures when someone is conducting dataset operations (overwrite, create, etc.)
      • Add generic facet to collect environmental properties (EnvironmentFacet) @harishsune (0.6.0)
        • Collects environment variables
        • Depends on Databricks runtime but can be reused in other environments
      • OpenLineage sensor for OpenLineage-Dagster integration @dalinkim (0.6.0)
        • The first iteration of the Dagster integration to get lineage from Dagster
      • Java-client: make generator generate enums as well @pawel-big-lebowski (0.6.0)
        • Small addition to Java client feat. better types; was string
    • Fixed
      • Airflow: increase import timeout in tests, fix exit from integration @mobuchowski (0.6.0)
        • The former was a particular issue with the Great Expectations integration
      • Reduce logging level for import errors to info @rossturk (0.6.0)
        • Airflow users were seeing warnings about missing packages if they weren't using a part of an integration
        • This fix reduced the level to Info
      • Remove AWS secret keys and extraneous Snowflake parameters from connection URI @collado-mike (0.6.0)
        • Parses Snowflake connection URIs to exclude some parameters that broke lineage or posed security concerns (e.g., login data)
        • Some keys are Snowflake-specific, but more can be added from other data sources
      • Convert to LifecycleStateChangeDatasetFacet @pawel-big-lebowski (0.6.0)
        • Mandates the LifecycleStateChange facet from the global spec rather than the custom tableStateChange facet used in the past
      • Catch possible failures when emitting events and log them @mobuchowski (0.6.1)
        • Previously when an OL event failed to emit, this could break an integration
        • This fix catches possible failures and logs them
  • Process for blog posts [Ross]
    • Moving the process to Github Issues
    • Follow release tracker there

    • Go to https://github.com/OpenLineage/website/tree/main/contents/blog to create posts

    • No one will have a monopoly

    • Proposals for blog posts also welcome and we can support your efforts with outlines, feedback

    • Throw your ideas on the issue tracker on Github

  • Retrospective: Spark integration [Willy et al.]
    • Willy: originally this part of Marquez – the inspiration behind OL

      • OL was prototyped in Marquez with a few integrations, one of which was Spark (other: Airflow)

      • Donated the integration to OL

    • Srikanth: #559 very helpful to Azure

    • Pawel: is anything missing from the Spark integration? E.g., column-level lineage?

    • Will: yes to column-level; also, delta tables are an issue due to complexity; Spark 3.2 support also welcome

    • Maciej: should be more active about tracking projects we have integrations with; add to test matrix 

    • Julien: let’s open some issues to address these

  • Open Discussion
    • Flink updates? [Julien]
      • Maciej: initial exploration is done

        • challenge: Flink has 4 APIs

        • prioritizing Kafka lineage currently because most jobs are writing to/from Kafka

        • track this on Github milestones, contribute, ask questions there

      • Will: can you share thoughts on the data model? How would this show up in MZ? How often are you emitting lineage? 

      • Maciej: trying to model entire Flink run as one event

      • Srikanth: proposed two separate streams, one for data updates and one for metadata

      • Julien: do we have an issue on this topic in the repo?

      • Michael C.: only a general proposal doc, not one on the overall strategy; this worth a proposal doc

      • Julien: see notes for ticket number; MC will create the ticket

      • Srikanth: we can collaborate offline

...

  • OpenLineage recent release overview (0.5.1) [Julien]
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
  • Apache Flink integration [Julien]
  • Dagster integration demo [Dalin]
  • Open Discussion

Meeting:

Slides

Widget Connector
urlhttp://youtube.com/watch?v=cIrXmC0zHLg

Notes:

  • OpenLineage recent release overview (0.5.1) [Julien]
    • No 0.5.0 due to bug
    • Support for dbt-spark adapter
    • New backend to proxy OL events
    • Support for custom facets
  • TaskInstanceListener now official way to integrate with Airflow [Julien]
    • Integration runs on worker side
    • Will be in next OL release of airflow (2.3)
    • Thanks to Maciej for his work on this
  • Apache Flink integration [Julien]
    • Ticket for discussion available
    • Integration test setup
    • Early stages
  • Dagster integration demo [Dalin]
    • Initiated by Dalin Kim
    • OL used with Dagster on orchestration layer
    • Utilizes Dagster sensor
    • Introduces OL sensor that can be added to Dagster repo definition
    • Uses cursor to keep track of ID
    • Looking for feedback after review complete
    • Discussion:
      • Dalin: needed: way to interpret Dagster asset for OL
      • Julien: common code from Great Expectations/Dagster integrations
      • Michael C: do you pass parent run ID in child job when sending the job to MZ?
      • Hierarchy can be extended indefinitely – parent/child relationship can be modeled
      • Maciej: the sensor kept failing – does this mean the events persisted despite being down?
      • Dalin: yes - the sensor’s cursor is tracked, so even if repo goes down it should be able to pick up from last cursor
      • Dalin: hoping for more feedback
      • Julien: slides will be posted on slack channel, also tickets
  • Open discussion
    • Will: how is OL ensuring consistency of datasets across integrations? 
    • Julien: (jokingly) Read the docs! Naming conventions for datasets can be found there
    • Julien: need for tutorial on creating integrations
    • Srikanth: have done some of this work in Atlas
    • Kevin: are there libraries on the horizon to play this role? (Julien: yes)
    • Srikanth: it would be good to have model spec to provide enforceable standard
    • Julien: agreed; currently models are based on the JSON schema spec
    • Julien: contributions welcome; opening a ticket about this makes sense
    • Will: Flink integration: MZ focused on batch jobs
    • Julien: we want to make sure we need to add checkpointing
    • Julien: there will be discussion in OLMZ communities about this
      • In MZ, there are questions about what counts as a version or not
    • Julien: a consistent model is needed
    • Julien: one solution being looked into is Arrow
    • Julien: everyone should feel welcome to propose agenda items (even old projects)
    • Srikanth: who are you working with on the Flink comms side? Will get back to you.

...

...

  • Attendees: 
    • TSC:
      • Mandy Chessell: Egeria Lead. Integrating OpenLineage in Egeria

      • Michael Collado: Datakin, OpenLineage

      • Maciej Obuchowski: GetInData. OpenLineage integrations
      • Willy Lulciuc: Marquez co-creator.
      • Ryan Blue: Tabular, Iceberg. Interested in collecting lineage across iceberg user with OpenLineage
    • And:
      • Venkatesh Tadinada: BMC workflow automation looking to integrate with Marquez
      • Minkyu Park: Datakin. learning about OpenLineage
      • Arthur Wiedmer: Apple, lineage for Siri and AI ML. Interested in implementing Marquez and OpenLineage
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=Gk0CwFYm9i4

  • Meeting notes:
    • agenda: 
      • Update on OpenLineage latest release (0.2.1)

        • dbt integration demo

      • OpenLineage 0.3 scope discussion

        • Facet versioning mechanism (Issue #153)

        • OpenLineage Proxy Backend (Issue #152)

        • OpenLineage implementer test data and validation

        • Kafka client

      • Roadmap

        • Iceberg integration
      • Open discussion

    • Slides 

    • Discussions:
      • added to the agenda a Discussion of Iceberg requirements for OpenLineage.

    • Demo of dbt:

      • really easy to try

      • when running from airflow, we can use the wrapper 'dbt-ol run' instead of 'dbt run'

    • Presentation of Proxy Backend design:

      • summary of discussions in Egeria
        • Egeria is less interested in instances (runs) and will keep track of OpenLineage events separately as Operational lineage

        • Two ways to use Egeria with OpenLineage

          • receives HTTP events and forwards to Kafka

          • A consumer receives the Kafka events in Egeria

      • Proxy Backend in OpenLineage:

        • direct HTTP endpoint implementation in Egeria

      • Depending on the user they might pick one or the other and we'll document

    • Use a direct OpenLineage endpoint (like Marquez)

      • Deploy the Proxy Backend to write to a queue (ex: Kafka)

      • Follow up items:

...

Aug 11th 2021

  • Attendees: 
    • TSC:
      • Ryan Blue

      • Maciej Obuchowski

      • Michael Collado

      • Daniel Henneberger

      • Willy Lulciuc

      • Mandy Chessell

      • Julien Le Dem

    • And:
      • Peter Hicks

      • Minkyu Park

      • Daniel Avancini

  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=bbAwz-rzo3I

...

  • Attendees: 
    • TSC:
      • Julien Le Dem
      • Mandy Chessel
      • Michael Collado
      • Willy Lulciuc
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=kYzFYrzSpzg

  • Meeting notes
    • Agenda:
    • Notes: 

      Mission statement:

      Spec versioning mechanism:

      • The goal is to commit to compatible changes once 0.1 is published

      • We need a follow up to separate core facet versioning


      => TODO: create a separate github ticket.
      • The lineage event should have a field that identifies what version of the spec it was produced with

        • => TODO: create a github issue for this

      • TODO: Add issue to document version number semantics (SCHEMAVER)

      Extend Event State notion:

      OpenLineage 0.1:

      • finalize a few spec details for 0.1 : a few items left to discuss.

        • In particular job naming

        • parent job model

      • Importing Marquez integrations in OpenLineage

      Open Discussion:

      • connecting the consumer and producer

        • TODO: ticket to track distribution mechanism

        • options:

          • Would we need a consumption client to make it easy for consumers to get events from Kafka for example?

          • OpenLineage provides client libraries to serialize/deserialize events as well as sending them.

        • We can have documentation on how to send to backends that are not Marquez using HTTP and existing gateway mechanism to queues.

        • Do we have a mutual third party or the client know where to send?

      • Source code location finalization

      • job naming convention

        • you don't always have a nested execution

          • can call a parent

        • parent job

        • You can have a job calling another one.

        • always distinguish a job and its run

      • need a separate notion for job dependencies

      • need to capture event driven: TODO: create ticket.


      TODO(Julien): update job naming ticket to have the discussion.

...

  • Attendees: 
    • TSC:
      Julien Le Dem: Marquez, Datakin
      Drew Banin: dbt, CPO at fishtown analytics
      Maciej Obuchowski: Marquez, GetIndata consulting company
      Zhamak Dehghani: Datamesh, Open protocol of observability for data ecosystem is a big piece of Datamesh
      Daniel Henneberger: building a database, interested in lineage
      Mandy Chessel: Lead of Egeria, metadata exchange. lineage is a great extension that volunteers lineage
      Willy Lulciuc: co-creator of Marquez
      Michael Collado: Datakin, OpenLineage end-to-end holistic approach.
    • And:
      Kedar Rajwade: consulting on distributed systems.
      Barr Yaron: dbt, PM at Fishtown analytics on metadata.
      Victor Shafran: co-founder at databand.ai pipeline monitoring company. lineage is a common issue
    • Excused: Ryan Blue, James Campbell
  • Meeting recording:

Widget Connector
urlhttp://youtube.com/watch?v=er2GDyQtm5M

...