Jump to content

Tech learning roadmaps…


dasari4kntr

Recommended Posts

Distributed Transaction Management in Microservices — Part 1

Dineshchandgr

 

 

1*N7yz1Lzt9jNAVrAXqqzvkg.png Image Source: https://miro.medium.com/max/690/1*ZbA4HrE9XKF4FziPs2MNfQ.png

Hello everyone. In this article, we are going to see about distributed transaction management across Microservices.

 

What is a transaction?

A transaction is nothing but a series of actions that must execute successfully. Even if one of the operations fails, the entire steps must be rolled back in order to leave the application in the previous stable state. A transaction has the following ACID properties

 

1*NiBUO-GUctIeR6bgHTb51g.png Image Source: https://media.geeksforgeeks.org/wp-content/cdn-uploads/20191121102921/ACID-Properties.jpg

 

Transactions in Monolith and Microservices

In a traditional Monolithic application, there will be a single large application connecting to a single large database and such applications stick to ACID properties.

The transaction boundary starts inside the service layer and can be committed or rolled back based on the outcome of all the steps in that transaction. In the case of Microservices, each microservice runs a specific business area and maintains the Single Repository Principle(SRP), which means each microservice maintains its own database and another service should not the other service’s database directly. So the transactions are distributed across the microservices.

Example: Let us consider an online order processing for the Monolith and Microservices Architecture for the below scenario

  1. The user adds a product to a cart on an e-commerce site and buys it
  2. An order is created for the user with an order number generated
  3. The item stock is reduced by 1 as the user has bought the product
  4. An invoice is generated for the item
  5. Payment has been completed successfully
  6. The invoice is sent to the user via email
 

1*koQ4Bdk4zrdEyHXiiRAogw.png Image Source: https://www.sam-solutions.com/blog/wp-content/uploads/2017/10/Monolithic-vs-Microservices-architecture-image-704x540.png

In the Monolith application, all the steps take place inside the single application and single database. All the steps are executed from a service class; if any of the steps fail, the whole transaction can be rolled back.

In the case of a Microservice application, each of the above steps takes place individually inside the specific microservice and its specific database

  • Order will be processed in the Order service
  • Stocks are checked and calculated inside the Account Service
  • Invoice is processed by Invoice Service
  • Payment is processed in the Payment service
  • Email is triggered by the Notification service

Since each of the steps runs inside a different microservice and its database, maintaining the ACID principle for the entire transaction is extremely difficult and complicated. It is better to avoid distributed transaction management completely if possible.

If not, then there are some standard patterns for the distributed transaction management

Patterns for Distributed Transaction Management

  1. Synchronous Patterns
  • Two-Phase Commit
  • Three Phase Commit

2. Asynchronous Pattern

  • Orchestration-Based Saga Pattern
  • Choreography-Based Saga Pattern
 

Synchronous Patterns

Two-Phase Commit (2 PC)

2 Phase Commit is a standard protocol to handle distributed transactions using 2 stages namely Prepare stage and the Commit stage. There is a transaction coordinator component that coordinates the entire transaction by talking to all the services

 

1*WCY8ypG1pEpKIIn9IfGbqA.png Image Source: https://thorben-janssen.com/wp-content/uploads/2020/02/Distributed-Transactions-2phase-commit-1024x576.png

Success Scenario

  1. The transaction coordinator instructs each service to prepare for commit and every service then checks if the commit can be done without any issue
  2. After checking, each service sends a Prepared response to the coordinator.
  3. Once the Coordinator receives all the Prepared responses, it tells all the services to commit the data into the database
  4. Now the transaction is successful and all the changes get committed across the services

Rollback Scenario

  1. The transaction coordinator instructs each service to prepare for commit and every service then checks it the commit can be done without any issue
  2. After checking, imagine that one service responds with failed status
  3. The Coordinator will send an abort command to abort the transaction to rollback any changes performed in the transaction to maintain the ACID principles

Drawbacks of 2PC

  1. It is very slow as the coordinator waits for all the responses and the transaction takes a long time to commit
  2. The data in every DB is locked until the commit or abort command is issued. These locks will slow down the system and causes a degradation in performance.

Three Phase Commit (3 PC)

A two-phase commit protocol cannot recover from a failure of both the coordinator and a cohort member during the Commit phase.

The 3 PC is an extension of 2 Phase Commit and the commit phase is divided into 2 phases. 3 Phase commit is designed for fault-tolerance when the coordinator or any other services go down by using the prepare-to-commit phase.

 

1*i4BbUqO7wbnU7TRisKrqxg.png Image Source: https://xenovation.com/images/articles/development/java/2and3PhaseCommit/3-phase-commit-protocol.png

  1. If the transaction coordinator fails before sending the prepared-to-commit command to the microservices, then the other services will imagine that the operation was aborted
  2. The coordinator will not send a doCommit message to all the services until they have sent ACK for prepared-to-commit
  3. This will make sure that none of the services are blocked and waiting for other services

Failure Scenario

  1. The pre-commit stages help the system to be recovered when the coordinator or a service or both fails during commit phase
  2. When the new transaction coordinator takes over after the coordinator has failed during the commit phase, it queries all the services to see which state they are in
  3. If the services are in commit phase, then the new coordinator will know that the previous coordinator has issued the commit command before crashing
  4. If any of the services did not receive prepare-to-commit command, then the new coordinator will know that the previous coordinator has crashed even before it completed prepare-to-commit phase
  5. Hence it can safely abort the transaction

Drawbacks of Three-Phase Commit

  • The 3 PC has to be implemented carefully to make sure that the network partitioning does not cause inconsistencies in the transaction
  • 3 PC has more overhead as it involves one more step
 

The need for asynchronous pattern

While the Two-Phase commit and Three Phase commit work for distributed transactions across microservices, they are not efficient as it is blocking and synchronous in nature.

Generally, a database system completes a transaction within 50 ms. But microservices have a long delay as the transaction hops through different services through RPS and hence the locking for a long time becomes a bottle beck to the system. The deadlock could also arise between the transactions in the synchronous pattern.

All these drawbacks paved the way for the asynchronous way using the Saga pattern which relies on eventual consistency and does not maintain atomicity

 

In this article, we saw what is a transaction and how it works in Monolith and Microservices. We also saw what are Two-Phase commit and Three Phase commits and their drawbacks.

In the next article, we will be exploring the Saga-based patterns which are asynchronous and reactive

 

Thanks for reading and stay tuned!!!

If you like to get more updates from me, please follow me on Medium and subscribe to email alert. If you are considering to buy a medium membership, please buy through my referral link https://dineshchandgr.medium.com/membership
 
Link to comment
Share on other sites

Distributed Transaction Management in Microservices — Part 2 — Saga Pattern

In this article, we are going to see about the Saga pattern which is an asynchronous pattern that performs a sequence of transactions in each microservice and publishes messages or events to proceed to the next step. If any step fails in between, compensating steps will be executed by the Saga pattern to reverse the transaction

 

1*Xf7IsGyhejaCgB_-0fgrhQ.png Image Source: https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/saga/images/saga-overview.png

We can see from the above diagram that the Saga patterns perform a sequence of local transactions in each service. Each service updates its database and then publishes a message or event which will trigger the next local transaction.

 

1*4SlnJ3nE9ZNS33V5k2Sagw.png Image Source: https://www.baeldung.com/wp-content/uploads/sites/4/2021/04/saga-pattern.png

So we have to write the logic to commit the transaction and also have a mechanism to reverse the flow when something goes wrong anywhere in the transaction. All the transactions and compensating transactions will happen through the listeners that listen to an event or a message from the queue

The compensating transaction must be idempotent and should be able to retry if the previous try fails

The saga pattern can be implemented using 2 ways namely

  • Choreography-Based Saga Pattern
  • Orchestration-Based Saga Pattern
 

Choreography-Based Saga Pattern

In the Choreography pattern, there is no centralized point of control and the services will listen to a message broker to read the message and emit events/messages back to the queue for other services to listen and consume

 

1*jq0oqOlv-WHWB2kM1nCARA.png Image Source: https://chrisrichardson.net/i/sagas/Create_Order_Saga.png

The following steps are executed in the Choreography pattern

  1. Order Service gets the request from a user to create an order and hence the order is created in PENDING state
  2. After creating the order in Pending state, it emits an event in the Order Events Channel or Queue
  3. The Customer Service listens to the event and attempts to reserve the credit
  4. Then it emits the event to Order Service
  5. The Order Service now approves the Order or Rejects it based on the outcome of the event received from the Customer Service

Advantages of Choreography Pattern

  • The Choreography pattern is easy to implement
  • This is good for use cases with few services participating in the transaction
  • There is no Single point of failure as the transaction is distributed across services

Drawbacks of Choreography Pattern

  • When more services are added to the transaction, then Choreography patterns will get complicated
  • Eg: If Order service needs to event to Customer service and Customer Service emits an event to Payment service and so on, then the workflow gets bigger and will be difficult to track and debug
  • There might be scenarios where two services might be waiting for each other’s events and end up in a deadlock state
 

Orchestrator-Based Saga Pattern

In the Orchestration-based saga pattern, there is a centralized component that controls the transaction and this component is called the Orchestrator. It issues commands to the services and decides to commit or abort the transaction based on the outcome received from the saga participants.

If any of the microservices fail, the orchestrator will invoke the compensating transactions. The Orchestrator can exist inside the microservice that triggers the flow of transactions or the orchestrator can exist as a separate component outside the service

  1. Orchestrator as a stand-alone component

As we see in the diagram below, the Orchestrator is a stand-alone service that interacts with other services that are involved in the transaction. Every service receives the command from the orchestrator and emits the event response back to the orchestrator.

 

1*1hqTFjLEdUBpFqDOF-Td6w.png Image Source: https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/saga/images/orchestrator.png

2. Orchestrator as a component inside Order Service

As we see in the below diagram, the Create Order Saga is the Orchestrator component that resides inside the Order Service

 

1*wYCHZGkY9DFFTWcfmwCGIQ.png Image Source: https://chrisrichardson.net/i/sagas/Create_Order_Saga_Orchestration.png

The following steps are executed in the orchestrator pattern

  1. When the user places the order, the endpoint in Order controller is invoked, which then calls the Order Service
  2. The order service creates the Create Order Saga Orchestrator, which is nothing but an orchestrator (object) that tells the microservices what local transactions to execute
  3. The saga orchestrator creates the Order in PENDING state
  4. It will then send the Reserve Credit command to the Customer Service
  5. The Customer Service attempts to reserve the credit and sends the outcome back to the Saga orchestrator
  6. The saga orchestrator approves or rejects the order based on the outcome

Advantages of Orchestrator Pattern

  • Unlike the Choreography pattern, the Orchestrator pattern can be used for complex workflows and any new services can be added as part of the transaction and can be managed
  • Having an orchestrator as a transaction coordinator helps to control the transaction and the flow of activities
  • No possibility of cyclic dependencies
  • Saga participants i.e microservices are independent and do not know about other services and hence separation of business logic

Drawbacks of Orchestrator Pattern

  • There is a design complexity when compared to the Choreography pattern as we have the additional implementation of a coordination logic.
  • There is a single point of failure here because the orchestrator manages the complete workflow.
 

In this article, we saw what is an asynchronous pattern for transaction management in microservices and also explored the Saga pattern and its two variants namely Choreography based and Orchestrator based. We went in-depth about the 2 patterns and then discussed their benefits and drawbacks. Based on the use case, one of the patterns has been chosen for your application.

Saga pattern enables an application to maintain data consistency across multiple services by using local transactions instead of distributed transactions. But this programming model is a bit complicated and takes more time as we have to write the compensating transactions logic across all the services

Overall, the asynchronous Saga patterns are much better and more efficient for microservices when compared to synchronous patterns like Two-Phase Commit (2 PC) and Three-Phase Commit (3 PC). But it has its own drawbacks and handling distributed transactions across multiple microservices is always problematic in terms of ACID compliance hence the distributed transactions should be avoided if possible

 

Thanks for reading and stay tuned!!!

  • Upvote 1
Link to comment
Share on other sites

  • nanibabu pinned this topic

DevOps 🔥🔥🔥

HERE ARE YOUTUBE LINKS TO LEARN DEVOPS

HAPPY LEARNING 📌


1) Linux :
Basic Linux commands are necessary before jumping into shell scripting.

https://lnkd.in/dbXSCPdz
https://lnkd.in/dVxnu9Jm
https://lnkd.in/dbqcD3Qm

2. Shell Scripting:

https://lnkd.in/d84mpfyW
https://lnkd.in/d2veZUtu

3. Python: This will help you in automation

https://lnkd.in/dFtNz_9D
https://lnkd.in/d6cRpFrY
https://lnkd.in/d-EhshQz

4. Networking

https://lnkd.in/dqTx6jmN
https://lnkd.in/dRqCzbkn

5. Git & Github

https://lnkd.in/d9gw-9Ds
https://lnkd.in/dEp3KrTJ

6. YAML
https://lnkd.in/duvmhd5X
https://lnkd.in/dNqrXjmV

7. Containers — Docker:

https://lnkd.in/dY2ZswMZ
https://lnkd.in/d_EySpbh
https://lnkd.in/dPddbJTf

8. Continuous Integration & Continuous Deployment (CI/CD):

https://lnkd.in/dMHv9T8U

9. Container Orchestration — Kubernetes:
https://lnkd.in/duGZwHYX

10. Monitoring:

https://lnkd.in/dpXhmVqs
https://lnkd.in/dStQbpRX
https://lnkd.in/de4H5QVz
https://lnkd.in/dEtTSsbB

11. Infrastructure Provisioning & Configuration Management (IaC): Terraform, Ansible, Pulumi

https://lnkd.in/dvpzNT5M
https://lnkd.in/dNugwtVW
https://lnkd.in/dn5m2NKQ
https://lnkd.in/dhknHJXp
https://lnkd.in/ddNxd8vU

12. CI/CD Tools: Jenkins, GitHub Actions, GitLab CI, Travis CI, AWS CodePipeline + AWS CodeBuild, Azure DevOps, etc

https://lnkd.in/dTmSXNzv
https://lnkd.in/dAnxpVTe
https://lnkd.in/daMFG3Hq
https://lnkd.in/dqf-zzrx
https://lnkd.in/diWP7Tm7
https://lnkd.in/dYDCSiiC

13. AWS:

https://lnkd.in/dmi-TMv9
https://lnkd.in/de3-dAB6
https://lnkd.in/dh2zXZAB
https://lnkd.in/dQMyCBWy


14. Learn how to SSH
SSH using mobaxterm:

https://lnkd.in/gx-T_FU8

15. SSH using Putty :

https://lnkd.in/gGgW7Ns9


Also, Learn Web Development from W3Schools.com


Like 👍 and Follow Samson O. Sanyaolu for more.

HAPPY LEARNING 📌

👑 DEVOPS ENGINEER 👑 CLOUD ARCHITECT

Link to comment
Share on other sites

Retrofitting null-safety onto Java at Meta

https://engineering.fb.com/2022/11/22/developer-tools/meta-java-nullsafe/

 

 
  • We developed a new static analysis tool called Nullsafe that is used at Meta to detect NullPointerException (NPE) errors in Java code.
  • Interoperability with legacy code and gradual deployment model were key to Nullsafe’s wide adoption and allowed us to recover some null-safety properties in the context of an otherwise null-unsafe language in a multimillion-line codebase.
  • Nullsafe has helped significantly reduce the overall number of NPE errors and improved developers’ productivity. This shows the value of static analysis in solving real-world problems at scale.

Null dereferencing is a common type of programming error in Java. On Android, NullPointerException (NPE) errors are the largest cause of app crashes on Google Play. Since Java doesn’t provide tools to express and check nullness invariants, developers have to rely on testing and dynamic analysis to improve reliability of their code. These techniques are essential but have their own limitations in terms of time-to-signal and coverage.

In 2019, we started a project called 0NPE with the goal of addressing this challenge within our apps and significantly improving null-safety of Java code through static analysis.

Over the course of two years, we developed Nullsafe, a static analyzer for detecting NPE errors in Java, integrated it into the core developer workflow, and ran a large-scale code transformation to make many million lines of Java code Nullsafe-compliant.

 

nullsafe Figure 1: Percent null-safe code over time (approx.).

Taking Instagram, one of Meta’s largest Android apps, as an example, we observed a 27 percent reduction in production NPE crashes during the 18 months of code transformation. Moreover, NPEs are no longer a leading cause of crashes in both alpha and beta channels, which is a direct reflection of improved developer experience and development velocity.

The problem of nulls

Null pointers are notorious for causing bugs in programs. Even in a tiny snippet of code like the one below, things can go wrong in a number of ways:

Listing 1: buggy getParentName method

Path getParentName(Path path) {
  return path.getParent().getFileName();
}
  1. getParent() may produce null and cause a NullPointerException locally in getParentName(…).
  2. getFileName() may return null which may propagate further and cause a crash in some other place.

The former is relatively easy to spot and debug, but the latter may prove challenging — especially as the codebase grows and evolves. 

Figuring out nullness of values and spotting potential problems is easy in toy examples like the one above, but it becomes extremely hard at the scale of millions of lines of code. Then adding thousands of code changes a day makes it impossible to manually ensure that no single change leads to a NullPointerException in some other component. As a result, users suffer from crashes and application developers need to spend an inordinate amount of mental energy tracking nullness of values.

The problem, however, is not the null value itself but rather the lack of explicit nullness information in APIs and lack of tooling to validate that the code properly handles nullness.

Java and nullness

In response to these challenges Java 8 introduced java.util.Optional<T> class. But its performance impact and legacy API compatibility issues meant that Optional could not be used as a general-purpose substitute for nullable references.

At the same time, annotations have been used with success as a language extension point. In particular, adding annotations such as @Nullable and @NotNull to regular nullable reference types is a viable way to extend Java’s types with explicit nullness while avoiding the downsides of Optional. However, this approach requires an external checker.

An annotated version of the code from Listing 1 might look like this:

Listing 2: correct and annotated getParentName method

// (2)                          (1)
@Nullable Path getParentName(Path path) {
  Path parent = path.getParent(); // (3)
  return parent != null ? parent.getFileName() : null;
            // (4)
}

Compared to a null-safe but not annotated version, this code adds a single annotation on the return type. There are several things worth noting here:

  1. Unannotated types are considered not-nullable. This convention greatly reduces the annotation burden but is applied only to first-party code.
  2. Return type is marked @Nullable because the method can return null.
  3. Local variable parent is not annotated, as its nullness must be inferred by the static analysis checker. This further reduces the annotation burden.
  4. Checking a value for null refines its type to be not-nullable in the corresponding branch. This is called flow-sensitive typing, and it allows writing code idiomatically and handling nullness only where it’s really necessary.

Code annotated for nullness can be statically checked for null-safety. The analyzer can protect the codebase from regressions and allow developers to move faster with confidence.

Kotlin and nullness

Kotlin is a modern programming language designed to interoperate with Java. In Kotlin, nullness is explicit in the types, and the compiler checks that the code is handling nullness correctly, giving developers instant feedback. 

We recognize these advantages and, in fact, use Kotlin heavily at Meta. But we also recognize the fact that there is a lot of business-critical Java code that cannot — and sometimes should not — be moved to Kotlin overnight. 

The two languages – Java and Kotlin – have to coexist, which means there is still a need for a null-safety solution for Java.

Static analysis for nullness checking at scale

Meta’s success building other static analysis tools such as Infer, Hack, and Flow and applying them to real-world code-bases made us confident that we could build a nullness checker for Java that is: 

  1. Ergonomic: understands the flow of control in the code, doesn’t require developers to bend over backward to make their code compliant, and adds minimal annotation burden. 
  2. Scalable: able to scale from hundreds of lines of code to millions.
  3. Compatible with Kotlin: for seamless interoperability.

In retrospect, implementing the static analysis checker itself was probably the easy part. The real effort went into integrating this checker with the development infrastructure, working with the developer communities, and then making millions of lines of production Java code null-safe.

We implemented the first version of our nullness checker for Java as a part of Infer, and it served as a great foundation. Later on, we moved to a compiler-based infrastructure. Having a tighter integration with the compiler allowed us to improve the accuracy of the analysis and streamline the integration with development tools. 

This second version of the analyzer is called Nullsafe, and we will be covering it below.

Null-checking under the hood

Java compiler API was introduced via JSR-199. This API gives access to the compiler’s internal representation of a compiled program and allows custom functionality to be added at different stages of the compilation process. We use this API to extend Java’s type-checking with an extra pass that runs Nullsafe analysis and then collects and reports nullness errors.

Two main data structures used in the analysis are the abstract syntax tree (AST) and control flow graph (CFG). See Listing 3 and Figures 2 and 3 for examples.

  • The AST represents the syntactic structure of the source code without superfluous details like punctuation. We get a program’s AST via the compiler API, together with the type and annotation information.
  • The CFG is a flowchart of a piece of code: blocks of instructions connected with arrows representing a change in control flow. We’re using the Dataflow library to build a CFG for a given AST.

The analysis itself is split into two phases:

  1. The type inference phase is responsible for figuring out nullness of various pieces of code, answering questions such as:
    • Can this method invocation return null at program point X?
    • Can this variable be null at program point Y?
  2. The type checking phase is responsible for validating that the code doesn’t do anything unsafe, such as dereferencing a nullable value or passing a nullable argument where it’s not expected.

Listing 3: example getOrDefault method

String getOrDefault(@Nullable String str, String defaultValue) {
  if (str == null) { return defaultValue; }
  return str;
}

 

Nullsafe Figure 2: CFG for code from Listing 3.

 

nullsafe Figure 3: AST for code from Listing 3

Type-inference phase 

Nullsafe does type inference based on the code’s CFG. The result of the inference is a mapping from expressions to nullness-extended types at different program points.

state = expression x program point → nullness – extended type

The inference engine traverses the CFG and executes every instruction according to the analysis’ rules. For a program from Listing 3 this would look like this:

  1. We start with a mapping at <entry> point: 
    • {str @Nullable String, defaultValue String}.
  2. When we execute the comparison str == null, the control flow splits and we produce two mappings:
    • THEN: {str @Nullable String, defaultValue String}.
    • ELSE: {str String, defaultValue String}.
  3. When the control flow joins, the inference engine needs to produce a mapping that over-approximates the state in both branches. If we have @Nullable String in one branch and String in another, the over-approximated type would be @Nullable String.

 

Nullsafe Figure 4: CFG with the analysis results

The main benefit of using a CFG for inference is that it allows us to make the analysis flow-sensitive, which is crucial for an analysis like this to be useful in practice.

The example above demonstrates a very common case where nullness of a value is refined according to the control flow. To accommodate real-world coding patterns, Nullsafe has support for more advanced features, ranging from contracts and complex invariants where we use SAT solving to interprocedural object initialization analysis. Discussion of these features, however, is outside the scope of this post.

Type-checking phase

Nullsafe does type checking based on the program’s AST. By traversing the AST, we can compare the information specified in the source code with the results from the inference step.

In our example from Listing 3, when we visit the return str node we fetch the inferred type of str expression, which happens to be String, and check whether this type is compatible with the return type of the method, which is declared as String.

 

nullsafe Figure 5: Checking types during AST traversal.

When we see an AST node corresponding to an object dereference, we check that the inferred type of the receiver excludes null. Implicit unboxing is treated in a similar way. For method call nodes, we check that the inferred types of the arguments are compatible with method’s declared types. And so on.

Overall, the type-checking phase is much more straightforward than the type-inference phase. One nontrivial aspect here is error rendering, where we need to augment a type error with a context, such as a type trace, code origin, and potential quick fix.

Challenges in supporting generics

Examples of the nullness analysis given above covered only the so-called root nullness, or nullness of a value itself. Generics add a whole new dimension of expressivity to the language and, similarly, nullness analysis can be extended to support generic and parameterized classes to further improve the expressivity and precision of APIs.

Supporting generics is obviously a good thing. But extra expressivity comes as a cost. In particular, type inference gets a lot more complicated.

Consider a parameterized class Map<K, List<Pair<V1, V2>>>. In the case of non-generic nullness checker, there is only the root nullness to infer:

// NON-GENERIC CASEMap<K, List<Pair<V1, V2>>
// ^
// \--- Only the root nullness needs to be inferred

The generic case requires a lot more gaps to fill on top of an already complex flow-sensitive analysis:

// GENERIC CASEMap<K,List<Pair<V1,V2>>
// ^     ^    ^      ^      ^      ^
// \-----|----|------|------|------|--- All these need to be inferred

This is not all. Generic types that the analysis infers must closely follow the shape of the types that Java itself inferred to avoid bogus errors. For example, consider the following snippet of code:

interface Animal {}
class Cat implements Animal {}
class Dog implements Animal {}

void targetType(@Nullable Cat catMaybe) {
  List<@Nullable Animal> animalsMaybe = List.of(catMaybe);
}

List.<T>of(T…) is a generic method and in isolation the type of List.of(catMaybe) could be inferred as List<@Nullable Cat>. This would be problematic because generics in Java are invariant, which means that List<Animal> is not compatible with List<Cat> and the assignment would produce an error.

The reason this code type checks is that the Java compiler knows the type of the target of the assignment and uses this information to tune how the type inference engine works in the context of the assignment (or a method argument for the matter). This feature is called target typing, and although it improves the ergonomics of working with generics, it doesn’t play nicely with the kind of forward CFG-based analysis we described before, and it required extra care to handle.

In addition to the above, the Java compiler itself has bugs (e.g., this) that require various workarounds in Nullsafe and in other static analysis tools that work with type annotations.

Despite these challenges, we see significant value in supporting generics. In particular:

  • Improved ergonomics. Without support for generics, developers cannot define and use certain APIs in a null-aware way: from collections and functional interfaces to streams. They are forced to circumvent the nullness checker, which harms reliability and reinforces a bad habit. We have found many places in the codebase where lack of null-safe generics led to brittle code and bugs.
  • Safer Kotlin interoperability. Meta is a heavy user of Kotlin, and a nullness analysis that supports generics closes the gap between the two languages and significantly improves the safety of the interop and the development experience in a heterogeneous codebase.

Dealing with legacy and third-party code

Conceptually, the static analysis performed by Nullsafe adds a new set of semantic rules to Java in an attempt to retrofit null-safety onto an otherwise null-unsafe language. The ideal scenario is that all code follows these rules, in which case diagnostics raised by the analyzer are relevant and actionable. The reality is that there’s a lot of null-safe code that knows nothing about the new rules, and there’s even more null-unsafe code. Running the analysis on such legacy code or even newer code that calls into legacy components would produce too much noise, which would add friction and undermine the value of the analyzer.

To deal with this problem in Nullsafe, we separate code into three tiers:

  • Tier 1: Nullsafe compliant code. This includes first-party code marked as @Nullsafe and checked to have no errors. This also includes known good annotated third-party code or third-party code for which we have added nullness models.
  • Tier 2: First-party code not compliant with Nullsafe. This is internal code written without explicit nullness tracking in mind. This code is checked optimistically by Nullsafe.
  • Tier 3: Unvetted third-party code. This is third-party code that Nullsafe knows nothing about. When using such code, the uses are checked pessimistically and developers are urged to add proper nullness models.

The important aspect of this tiered system is that when Nullsafe type-checks Tier X code that calls into Tier Y code, it uses Tier Y’s rules. In particular:

  1. Calls from Tier 1 to Tier 2 are checked optimistically,
  2. Calls from Tier 1 to Tier 3 are checked pessimistically,
  3. Calls from Tier 2 to Tier 1 are checked according to Tier 1 component’s nullness.

Two things are worth noting here:

  1. According to point A, Tier 1 code can have unsafe dependencies or safe dependencies used unsafely. This unsoundness is the price we had to pay to streamline and gradualize the rollout and adoption of Nullsafe in the codebase. We tried other approaches, but extra friction rendered them extremely hard to scale. The good news is that as more Tier 2 code is migrated to Tier 1 code, this point becomes less of a concern.
  2. Pessimistic treatment of third-party code (point B) adds extra friction to the nullness checker adoption. But in our experience, the cost was not prohibitive, while the improvement in the safety of Tier 1 and Tier 3 code interoperability was real.

 

Nullsafe Figure 6: Three tiers of null-safety rules.

Deployment, automation, and adoption

A nullness checker alone is not enough to make a real impact. The effect of the checker is proportional to the amount of code compliant with this checker. Thus a migration strategy, developer adoption, and protection from regressions become primary concerns.

We found three main points to be essential to our initiative’s success:

  1. Quick fixes are incredibly helpful. The codebase is full of trivial null-safety violations. Teaching a static analysis to not only check for errors but also to come up with quick fixes can cover a lot of ground and give developers the space to work on meaningful fixes.
  2. Developer adoption is key. This means that the checker and related tooling should integrate well with the main development tools: build tools, IDEs, CLIs, and CI. But more important, there should be a working feedback loop between application and static analysis developers.
  3. Data and metrics are important to keep the momentum. Knowing where you are, the progress you’ve made, and the next best thing to fix really helps facilitate the migration.

Longer-term reliability impact

As one example, looking at 18 months of reliability data for the Instagram Android app:

  • The portion of the app’s code compliant with Nullsafe grew from 3 percent to 90 percent.
  • There was a significant decrease in the relative volume of NullPointerException (NPE) errors across all release channels (see Figure 7). Particularly, in production, the volume of NPEs was reduced by 27 percent.

This data is validated against other types of crashes and shows a real improvement in reliability and null-safety of the app. 

At the same time, individual product teams also reported significant reduction in the volume of NPE crashes after addressing nullness errors reported by Nullsafe. 

The drop in production NPEs varied from team to team, with improvements ranging from 35 percent to 80 percent.

One particularly interesting aspect of the results is the drastic drop in NPEs in the alpha-channel. This directly reflects the improvement in the developer productivity that comes from using and relying on a nullness checker.

Our north star goal, and an ideal scenario, would be to completely eliminate NPEs. However, real-world reliability is complex, and there are more factors playing a role:

  • There is still null-unsafe code that is, in fact, responsible for a large percentage of top NPE crashes. But now we are in a position where targeted null-safety improvements can make a significant and lasting impact.
  • The volume of crashes is not the best metric to measure reliability improvement because one bug that slips into production can become very hot and single-handedly skew the results. A better metric might be the number of new unique crashes per release, where we see n-fold improvement.
  • Not all NPE crashes are caused by bugs in the app’s code alone. A mismatch between the client and the server is another major source of production issues that need to be addressed via other means.
  • The static analysis itself has limitations and unsound assumptions that let certain bugs slip into production.

It is important to note that this is the aggregate effect of hundreds of engineers using Nullsafe to improve the safety of their code as well as the effect of other reliability initiatives, so we can’t attribute the improvement solely to the use of Nullsafe. However, based on reports and our own observations over the course of the last few years, we’re confident that Nullsafe played a significant role in driving down NPE-related crashes.

 

Nullsafe-image-7-update.webp?w=1024 Figure 7: Percent NPE crashes by release channel.

Beyond Meta

The problems outlined above are hardly specific to Meta. Unexpected null-dereferences have caused countless problems in different companies. Languages like C# evolved into having explicit nullness in their type system, while others, like Kotlin, had it from the very beginning. 

When it comes to Java, there were multiple attempts to add nullness, starting with JSR-305, but none was widely successful. Currently, there are many great static analysis tools for Java that can check nullness, including CheckerFramework, SpotBugs, ErrorProne, and NullAway, to name a few. In particular, Uber walked the same path by making their Android codebase null-safe using NullAway checker. But in the end, all the checkers perform nullness analysis in different and subtly incompatible ways. The lack of standard annotations with precise semantics has constrained the use of static analysis for Java throughout the industry.

This problem is exactly what the JSpecify workgroup aims to address. The JSpecify started in 2019 and is a collaboration between individuals representing companies such as Google, JetBrains, Uber, Oracle, and others. Meta has also been part of JSpecify since late 2019.

Although the standard for nullness is not yet finalized, there has been a lot of progress on the specification itself and on the tooling, with more exciting announcements following soon. Participation in JSpecify has also influenced how we at Meta think about nullness for Java and about our own codebase evolution.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...