What?

Root Cause Analysis (RCA) is a method of problem-solving used for identifying the root causes of faults or problems. In software testing, it is used to identify the root causes of defects or problems and preventing them rather than treating the symptoms.

There are two types of RCA in Software Defects:

  1. RCA of appearing defects: To answer the question of why these defects happen during the development phase.
  2. RCA of escaped defects: To answer the question of why these defects happen after releasing or promoting next environments (Staging or Production)

Why?

  • To recognize the root causes of issues to get lessons learned to prevent them from happening again in the future. The principle is “Prevention is better than the cure“.
  • Cat the net wide: dig up the roots, investigate and fix the underlying cause instead of just fixing the problem at the point of discovery.
  • Via RCA of escaped defects, your testing skills will be improved by recognizing these patterns. 

When?

  • It depends on the context
  • As a good practice: Should be carried out every sprint, or every release, or whenever there are leakage defects after deployment. 
  • Indicators to trigger RCA sessions:
    • Bugs arise consistently
    • Many bugs in the system
    • A need to improve the development process
    • Business needs: customer complaints, stakeholder requests

How?

Set up and categorize the Root Cause 

Basically, to collect data, the RCA should be added into the defect tracking system such as Jira, Bugzilla. 

Depending on your context, normally, there are several following categories of RCA.

 

Root Cause Definition Potential recommendation for prevention actions
RCA for appearing defects during the development phase
Requirement Issue

Missing ACs/items mentioned in the requirementUnclear or ambiguous requirement: This makes the tester’s and developer’s view different.

  • Ask the PO to provide more clear requirement.
  • The development team (Developer, tester) should be more proactive to ask questions, challenge assumptions.
  • Apply the Grooming meeting.
Design Issue Wrong architecture design, or complex, or overkill, etc.
  • Review code and architecture design.
  • Validate design.
Coding Issue Wrong implementation, missed items in ACs or lack of unit test.
  • Apply the Unit Test.
  • Use a checklist.
  • Apply test demo using testing notes.
  • Define the development process.
Testing issue Missing bugs in the program; finding bugs are not in the program; poor tracking and follow up.
  • Provide training sessions for testers.
  • Lean “Common Software Errors” and “Bug Advocacy” courses.
Environment issue Differences, inconsistency, a special case in a particular environment, etc.
  • Double-check the test environment before running the tests.
  • Have a back-up and restore images.
  • Use Docker containers.
Deployment issue Related to Infrastructure, configuration, wrong build, missing steps in deployment, CI/CD issue, Scaling issue, Docker container issue, etc.
  • Provide clear guidelines for each deployment.
  • Run smoke test suites.
Out of Scope

The issue has not required any actions to close:

  • Rejected by the Product team
  • Information missing
  • Cannot be reproducible or not occur
  • Conflict the requirement
  • Tester’s mistakes
  • Provide enough training for testers (domain, product knowledge, common software errors, HICCUPPS).
  • Set up a process and learn key steps before submitting a defect in the Bug Advocacy course.
RCA for escaped defects after releasing or promoting next environments 
Known Issue This issue already reported on the bug tracking system.
  • Review the bug review process
  • Escalate to a higher level for a discussion
Missed in SCRUM

The issue happens because of missing test cases, not executed test cases or missing bug on executed test cases on new implementations or change requests from the SCRUM team.

  • Add more test cases.
  • Apply cross-review test case process.
  • Leverage more automation tests to increase test coverage.
  • May conduct sharing sessions to have a better awareness of integration and improve test design.
Missed in Regression Test or Unknown Impacted Areas

The issues didn’t occur in the previous deployments and it happens in the current version.

The team has the wrong detection of risk impact analysis or does not have enough resources for regression testing due to time or budget constraints.

  • Leverage more automation test to increase test coverage
  • Closely work together with developers to define better impact areas.
  • Add or update manual test cases
  • Apply Exploratory Testing with risk-based testing
Missing Requirement

Ambiguous/missing/unclear requirement

  • Ask the PO to provide more clear requirement.
  • The development team (Dev, tester) should be more proactive to ask questions, challenge assumptions.
  • Apply the Grooming meeting.
Poor Coverage or poor Test design

This tends to be tester’s mistakes.

The issue comes from unimplemented APIs, missing test cases for implemented API or API changes.

The issue is not caught by existed test scripts due to test script design issues or missing test data.

  • Add more e2e tests if missing test cases.
  • Vary test data, configuration, settings, conditions.
  • Review validation points and fix existing test scripts to improve test coverage.
  • Add more negative cases, uncovered cases and more validation points for API tests.
Edge Case

Some cases that rarely occur but can occur. Testing outside of the base assumptions, finding different ways to use a feature that was not intended.

  • Add more test cases
  • Challenge assumptions and ask questions
  • Provide performance testing for stability risks
  • Play as a white-hacker
Enhancement The issue relates to Usability, Chrismas, or user’s view, etc. to make the product usable, useful and successful.
  • Ask the designer, PO to provide better UX/UI.
  • Have training and sharing sessions to improve UX/UI knowledge and skills for the whole team.
  • Review mockup thoroughly.
Other team or 3rd Party

The issue comes from an integrated system or other teams that should be in charge.

  • Should have a better evaluation of 3rd parties before using them.
  • In Microservices or services implemented by other teams, use Mock services or PACT testing or service virtualization to test isolation.
Configuration and Infrastructure

The issue comes from deployment issues: Flipped flag, configurations, special settings, etc.

Scaling Issue – AWS memory, CI/CD Issue, Docker Container Issue or Performance Issue

  • Build a monitoring mechanism to monitor and control the system.
  • Set up Flags to turn on/turn off failure features or services.
  • Establish performance testing with Scalability testing regularly.
  • Push a smoke test automation suite.
Invalid Issue The issue has not required any actions to close N/A

 

Where to get data

Basically, pulling data from defect tracking tools. If you are using JIRA tool, you can easily create queries to filter data. 

Process

  1. Determine problem
  2. Identify possible causes
  3. Propose preventive and corrective actions
  4. Analyze and monitor the effectiveness and usefulness of actions

Tips

  • Collect the right information
  • Create a fear-free incident reporting environment, don’t need to use MIP (Mention in Person)
  • Get everyone to buy in but make sure that RCA is to uncover faults in the process rather than the individuals. Build a positive atmosphere of collaboration to figure out the root and appropriate actions, and avoid dwelling on the cause and individuals.
  • Ask questions: use the 5-Whys technique and/or Fishbone diagram
  • Apply Pareto: 80-20 rule. About 80% of your issues will be caused by 20% of your problems. Therefore, focus your attention on where you can have the greatest impacts.
  • Leverage technology rather than manual with too cumbersome to have the most effectiveness
  • To reduce the escaped defects and increase bug detection capability, you can refer to lessons learned in the two following books to generate test ideas better.

Thao Vo

3 responses to “Apply Root Cause Analysis to Software Defects”

  1. Thanks for any other wonderful post. Where else may just anybody get that type of information in such an ideal way of writing?
    I’ve a presentation subsequent week, and I’m on the
    search for such info.

    • Thank you, vivaldiaudio. I love to hear that you found it helpful. I captured these knowledge from my testing experience and testing consulting over the time.

Leave a Reply

Your email address will not be published. Required fields are marked *