Skip to content
This repository has been archived by the owner on Feb 19, 2024. It is now read-only.

Latest commit

 

History

History
355 lines (268 loc) · 13.7 KB

CONTRIBUTING.md

File metadata and controls

355 lines (268 loc) · 13.7 KB

Contributing

In this document, you will find some guidelines on contributing to Apache Iceberg. Please keep in mind that none of these are hard rules and they're meant as a collection of helpful suggestions to make contributing as seamless of an experience as possible.

If you are thinking of contributing but first would like to discuss the change you wish to make, we welcome you to head over to the Community page on the official Iceberg documentation site to find a number of ways to connect with the community, including slack and our mailing lists. Of course, always feel free to just open a new issue in the GitHub repo.

Pull Request Process

Pull requests are the preferred mechanism for contributing to Iceberg

  • PRs are automatically labeled based on the content by our github-actions labeling action
  • It's helpful to include a prefix in the summary that provides context to PR reviewers, such as Build:, Docs:, Spark:, Flink:, Core:, API:
  • If a PR is related to an issue, adding Closes #1234 in the PR description will automatically close the issue and helps keep the project clean
  • If a PR is posted for visibility and isn't necessarily ready for review or merging, be sure to convert the PR to a draft

Building the Project Locally

Please refer to the Building section of the main readme for instructions on how to build iceberg locally.

Website and Documentation Updates

The Iceberg website and documentations are hosted in a different repository iceberg-docs. Read the repository README for contribution guidelines for the website and documentation.

Semantic Versioning

Apache Iceberg leverages semantic versioning to ensure compatibility for developers and users of the iceberg libraries as APIs and implementations evolve. The requirements and guarantees provided depend on the subproject as described below:

Major Version Deprecations Required

Modules iceberg-api

The API subproject is the main interface for developers and users of the Iceberg API and therefore has the strongest guarantees. Evolution of the interfaces in this subproject are enforced by Revapi and require explicit acknowledgement of API changes.

All public interfaces and classes require one major version for deprecation cycle. Any backward incompatible changes should be annotated as @Deprecated and removed for the next major release. Backward compatible changes are allowed within major versions.

Minor Version Deprecations Required

Modules iceberg-common iceberg-core iceberg-data iceberg-orc iceberg-parquet

Changes to public interfaces and classes in the subprojects listed above require a deprecation cycle of one minor release. These projects contain common and internal code used by other projects and can evolve within a major release. Minor release deprecation will provide other subprojects and external projects notice and opportunity to transition to new implementations.

Minor Version Deprecations Discretionary

modules (All modules not referenced above)

Other modules are less likely to be extended directly and modifications should make a good faith effort to follow a minor version deprecation cycle. If there are significant structural or design changes that result in deprecations being difficult to orchestrate, it is up to the committers to decide if deprecation is necessary.

Deprecation Notices

All interfaces, classes, and methods targeted for deprecation must include the following:

  1. @Deprecated annotation on the appropriate element
  2. @depreceted javadoc comment including: the version for removal, the appropriate alternative for usage
  3. Replacement of existing code paths that use the deprecated behavior

Example:

  /**
   * Set the sequence number for this manifest entry.
   *
   * @param sequenceNumber a sequence number
   * @deprecated since 1.0.0, will be removed in 1.1.0; use dataSequenceNumber() instead.
   */
  @Deprecated
  void sequenceNumber(long sequenceNumber);

Adding new functionality without breaking APIs

Ideally, we'd want to add new functionality without breaking existing APIs, especially within the scope of the API modules that are being checked by Revapi.

Let's assume we'd want to add a createBranch(String name) method to the ManageSnapshots API.

The most straight-forward way would be to add the below code:

public interface ManageSnapshots extends PendingUpdate<Snapshot> {
  // existing code...

  // adding this method introduces an API-breaking change
  ManageSnapshots createBranch(String name);
}

And then add the implementation:

public class SnapshotManager implements ManageSnapshots {
  // existing code...

  @Override
  public ManageSnapshots createBranch(String name, long snapshotId) {
    updateSnapshotReferencesOperation().createBranch(name, snapshotId);
    return this;
  }
}

Checking for API breakages

Running ./gradlew revapi will flag this as an API-breaking change:

./gradlew revapi
> Task :iceberg-api:revapi FAILED
> Task :iceberg-api:showDeprecationRulesOnRevApiFailure FAILED

1: Task failed with an exception.
-----------
* What went wrong:
Execution failed for task ':iceberg-api:revapi'.
> There were Java public API/ABI breaks reported by revapi:

  java.method.addedToInterface: Method was added to an interface.

  old: <none>
  new: method org.apache.iceberg.ManageSnapshots org.apache.iceberg.ManageSnapshots::createBranch(java.lang.String)

  SOURCE: BREAKING, BINARY: NON_BREAKING, SEMANTIC: POTENTIALLY_BREAKING

  From old archive: <none>
  From new archive: iceberg-api-1.4.0-SNAPSHOT.jar

  If this is an acceptable break that will not harm your users, you can ignore it in future runs like so for:

    * Just this break:
        ./gradlew :iceberg-api:revapiAcceptBreak --justification "{why this break is ok}" \
          --code "java.method.addedToInterface" \
          --new "method org.apache.iceberg.ManageSnapshots org.apache.iceberg.ManageSnapshots::createBranch(java.lang.String)"
    * All breaks in this project:
        ./gradlew :iceberg-api:revapiAcceptAllBreaks --justification "{why this break is ok}"
    * All breaks in all projects:
        ./gradlew revapiAcceptAllBreaks --justification "{why this break is ok}"
  ----------------------------------------------------------------------------------------------------

Adding a default implementation

In order to avoid breaking the API, we can add a default implementation that throws an UnsupportedOperationException:

public interface ManageSnapshots extends PendingUpdate<Snapshot> {
  // existing code...

  // introduces new code without breaking the API
  default ManageSnapshots createBranch(String name) {
    throw new UnsupportedOperationException(this.getClass().getName() + " doesn't implement createBranch(String)");
  }
}

Style

For Java styling, check out the section Setting up IDE and Code Style from the documentation site.

Java style guidelines

Line breaks

Continuation indents are 2 indents (4 spaces) from the start of the previous line.

Try to break long lines at the same semantic level to make code more readable.

  • Don't use the same level of indentation for arguments to different methods
  • Don't use the same level of indentation for arguments and chained methods
  // BAD: hard to see arguments passed to the same method
  doSomething(new ArgumentClass(1,
      2),
      3);

  // GOOD: break lines at the same semantic level
  doSomething(
      new ArgumentClass(1, 2),
      3);

  // BAD: arguments and chained methods mixed
  SomeObject myNewObject = SomeObject.builder(schema, partitionSpec
      sortOrder)
      .withProperty("x", "1")
      .build()

  // GOOD: method calls at the same level, arguments indented
  SomeObject myNewObject = SomeObject
      .builder(schema, partitionSpec,
          sortOrder)
      .withProperty("x", "1")
      .build()

Method naming

  1. Make method names as short as possible, while being clear. Omit needless words.
  2. Avoid get in method names, unless an object must be a Java bean.
    • In most cases, replace get with a more specific verb that describes what is happening in the method, like find or fetch.
    • If there isn't a more specific verb or the method is a getter, omit get because it isn't helpful to readers and makes method names longer.
  3. Where possible, use words and conjugations that form correct sentences in English when read
    • For example, Transform.preservesOrder() reads correctly in an if statement: if (transform.preservesOrder()) { ... }

Boolean arguments

Avoid boolean arguments to methods that are not private to avoid confusing invocations like sendMessage(false). It is better to create two methods with names and behavior, even if both are implemented by one internal method.

  // prefer exposing suppressFailure in method names
  public void sendMessageIgnoreFailure() {
    sendMessageInternal(true);
  }

  public void sendMessage() {
    sendMessageInternal(false);
  }

  private void sendMessageInternal(boolean suppressFailure) {
    ...
  }

When passing boolean arguments to existing or external methods, use inline comments to help the reader understand actions without an IDE.

  // BAD: it is not clear what false controls
  dropTable(identifier, false);

  // GOOD: these uses of dropTable are clear to the reader
  dropTable(identifier, true /* purge data */);
  dropTable(identifier, purge);

Config naming

  1. Use - to link words in one concept
    • For example, preferred convection access-key-id rather than access.key.id
  2. Use . to create a hierarchy of config groups
    • For example, s3 in s3.access-key-id, s3.secret-access-key

Testing

AssertJ

Prefer using AssertJ assertions as those provide a rich and intuitive set of strongly-typed assertions. Checks can be expressed in a fluent way and AssertJ provides rich context when assertions fail. Additionally, AssertJ has powerful testing capabilities on collections and exceptions. Please refer to the usage guide for additional examples.

// bad: will only say true != false when check fails
assertTrue(x instanceof Xyz);

// better: will show type of x when check fails
assertThat(x).isInstanceOf(Xyz.class);

// bad: will only say true != false when check fails
assertTrue(catalog.listNamespaces().containsAll(expected));

// better: will show content of expected and of catalog.listNamespaces() if check fails
assertThat(catalog.listNamespaces()).containsAll(expected);
// ok
assertNotNull(metadataFileLocations);
assertEquals(metadataFileLocations.size(), 4);

// better: will show the content of metadataFileLocations if check fails
assertThat(metadataFileLocations).isNotNull().hasSize(4);

// or
assertThat(metadataFileLocations).isNotNull().hasSameSizeAs(expected).hasSize(4);
// bad
try {
    catalog.createNamespace(deniedNamespace);
    Assert.fail("this should fail");
} catch (Exception e) {
    assertEquals(AccessDeniedException.class, e.getClass());
    assertEquals("User 'testUser' has no permission to create namespace", e.getMessage());
}

// better
assertThatThrownBy(() -> catalog.createNamespace(deniedNamespace))
    .isInstanceOf(AccessDeniedException.class)
    .hasMessage("User 'testUser' has no permission to create namespace");

Checks on exceptions should always make sure to assert that a particular exception message has occurred.

Awaitility

Avoid using Thread.sleep() in tests as it leads to long test durations and flaky behavior if a condition takes slightly longer than expected.

deleteTablesAsync();
Thread.sleep(3000L);
assertThat(tables()).isEmpty();

A better alternative is using Awaitility to make sure tables() are eventually empty. The below example will run the check with a default polling interval of 100 millis:

deleteTablesAsync();
Awaitility.await("Tables were not deleted")
    .atMost(5, TimeUnit.SECONDS)
    .untilAsserted(() -> assertThat(tables()).isEmpty());

Please refer to the usage guide of Awaitility for more usage examples.

JUnit4 / JUnit5

Iceberg currently uses a mix of JUnit4 (org.junit imports) and JUnit5 (org.junit.jupiter.api imports) tests. To allow an easier migration to JUnit5 in the future, new test classes that are being added to the codebase should be written purely in JUnit5 where possible.