How to automate the daily routine of QA engineers – An AI based solution example

26.03.2024

Introduction

As part of our commitment to enhancing the QA workflow at our client MTV, our team embarked on a project to present an AI-based solution that would automate the daily routine of QA engineers, i.e. help Test managers and QA engineers to create test cases based on the requirements of the project, and finally export those as a .cvs file to test management applications.

This document outlines the task at hand, the intended product, and the process of comparing, selecting, and developing scripts which use modern AI models.

To achieve this goal, our team conducted an in-depth analysis of the available modern AI models, evaluating their capabilities, performance, and suitability for our specific requirements. This comparative study involved an assessment of each model’s strengths and weaknesses, as well as its potential to integrate seamlessly with the existing QA infrastructure.

After this, we proceeded to the evaluation phase. During this stage, various factors were considered, such as efficiency, ease of implementation, and long-term viability, to determine which model(s) will form the foundation of our AI-based QA solution.

The introduction has been generated by Claude 3

Approaches

In our opinion, there were two plausible options for the technology approach:

1. Using one of the commercial AI-assisted software testing options “out of the box”

While software like TestRigor, Tricentis, Testim, Functionize, and Applitools provide promising solutions, they often come with a heavy approach, at least for this specific need. These tools are made for QA purposes. They have a lot of features and specialized functionalities for the needs of quality assurance. These tools will develop rapidly and it will be very interesting to follow up how and when. On the other hand, to effectively benefit from these kind of tools, you have to adapt your processes for them. You should be committed to the use of the software and be dependent on the vendor in a longer run.

2. Leveraging AI large language models, for example MistralAI, Llama2, Claude, GPT4, and Copilot

AI models are mathematical algorithms trained on data to make predictions or decisions without being explicitly programmed. They offer more flexibility and customization, for example, you can set up your own parameters or edit the prompt you use to interact with the model.

Thus, after careful consideration, we decided to go with the number 2 approach, i.e. direct interaction with the AI models by ourselves.

Requirements to the Case Solution

The proposed solution can be served as a proof-of-concept for the product which can be potentially used in production

Because of this, there were two main requirements:

An output should be compatible with Testrail (a Jira-compatible test management tool). An official instruction from the website can be used as a reference. So, as a result, there should be a .csv file with all the necessary columns for the test cases (test case title, precondition, description, steps, etc)
Security. Any input (requirements, prompts, design pictures) should be protected from getting leaked.

Models Overview

At this point, we conducted an in-depth analysis of leading AI models, assessing their capabilities and performance.

Our comparison of the basic features of the large language models chosen.

MistralAI provides a huge variety of Large Language models. The Medium one will be used. It has emerged as a top contender, offering impressive results, ease of integration and flexibility. Subscription is required to get the API key. The service is not free, however it is not too expensive. The Mistral medium is in the top of the leaderboard, see https://chat.lmsys.org/
Llama2, while resource-intensive, provides thorough test case generation. Llama2 provides a lot of different models available for free. Unfortunately, some of them (for example, 13-b-chat) require way too much resources. However, for a 7b-instruct model is enough for a machine with 18gb of RAM, although it is quite slow. The most convenient way to install LLama’s models is to use ollama to your Linux machine.
Claude boasts superior reasoning abilities. Claude is an AI assistant created by Anthropic. It requires an API key, however, there is a free option worth 5 euros. Claude 3 Opus model, which will be used for the task, was introduced in March, 2024. According to the model description, Its undergraduate level knowledge and graduate level reasoning are outperforming. Claude is SOC II Type 2 certified.
GPT4 is a swiss knife, when it comes to AI. It excels in writing tests based on images. You can ask GPT4 to write tests based on an image, and then ask it to create a TestRail compatible csv file from those tests. Only additional info required will be the extra columns required. By default won’t prioritize different tests or list automation types. The model manages these when asked.
Copilot offers convenient copying and exporting features, albeit with occasional stability issues. Copilot enables you to copy from a button of the results as well as exporting it to excel, however you are not accepted to place that excel back into Copilot for further questions. Copilot accepts a jpg format prompt to be placed in it and further functions requested of it.

Software Description and Materials

To properly gauge different AI models’ capabilities, we tasked them to create a list of test cases for the example of the Login window. It also allows users to restore a password and create a new account. This feature was chosen due to how common it is.

To test these AI models, we provided inputs such as text prompts, design screenshots, lists of use cases, and requirements. These inputs simulate real-world scenarios, allowing us to evaluate the models’ efficacy.

The inputs we used:

A short text prompt, which is a brief query to the AI model. For example, “I need to test this login screen. Help me write test cases. Write me 20 test cases that you think are the most essential when testing a login screen like this”.
A long text prompt. It is quite thorough and contains all the info about the desired output. The prompt is accessible, for example, here.
A design screenshot. Although some models (at least Mistral and Llama2) do not contain a built-in “text reader”, this can be solved by using other libraries, for example Cloudmersive OCR API.
Use cases and requirements.

Output evaluation criteria

Then we evaluated the outputs based on the following criteria:

How good does the model understand the task or, in other words, how does the result correlate to the prompt (not related to the prompt at all/somehow related/a good answer which can be similar to the expert one)
Speed of generation (instant/several minutes/you can go have a cup of tea)
How easy is it to begin (just open a website and start chatting/you have to create an API key and run a script/you have to download a model, develop a script, install some libraries…)
How convenient it is to import to testrail. (CSV is ready/some preprocessing is needed/not ready at all, it is easier to write by a human)
Test coverage, are all the main cases covered (yes, there is nothing to add /some of the main cases are covered/not really)? Are all the testing types involved (functional, negative, usability, compatibility, security)?
Prioritisation and how correct is it given, main functional cases are in the beginning of the list (yes/sometimes/no, cases are mixed up)

Each criteria can be evaluated from 0 to 2.

Results

Finally, we compared the output of each model, based on the above criteria, including understanding of the task, speed of generation, ease of import into TestRail, test coverage, and prioritisation.

As a baseline to compare with, a list of the test cases written by an expert was used. The main difference is that the expert keeps in mind all the testing types (also load and accessibility tests), while AI models mostly focus on positive and functional tests. However, it can be fixed by specifying more testing types in the prompt and using more tokens when setting up the AI model query.

After thorough testing and scoring, in the end, GPT4 and Claude emerged as the most productive models, offering comprehensive and relevant results.

	Relevant results	Speed	Easiness to import	Easiness to start	Test coverage	Priorities	Sum
MistralAI Medium	2	1	2	1	2	2	10
Llama2 7b	1	0	0	0	1	1	3
GPT4	2	1	2	2	2	2	11
Copilot	0	2	2	2	1	0	7
Claude	2	2	2	1	2	2	11

Results comparison of the AI models we tested

After a comparison between the model results, the following patterns can be noticed:

Until you specify a desired output in detail, the output will be quite common and not ready for the import. The same goes with the amount of test cases.
Llama2 does not follow the requested .csv structure (for example, it does not use semicolon as a separator and adds some extra text to the output) in spite of the unambiguous prompt.
The most thorough output is provided if you give a list of use cases as a part of the input.
For some reason, Copilot does not always work in a stable way: for example, after giving it a .pdf file, it does not generate all the requested test cases.

Conclusions

From left, VALA’s Joonas Luukkonen, Tommy Johansson and Ivan Osipov presenting their findings at MTV.

Looks like we have achieved our goals. Solutions have been found that satisfy the requirements: the most productive AI models are GPT-4 and Claude.

By using a comprehensive prompt with all the required columns mentioned, you can generate a .csv file that can be easily imported into TestRail. In other words, AI models can automate a huge part of the daily routine if you need to, for example, write steps for each test case.

However, expert supervision may still be required at times. For example, AI models can be buggy, may not fully understand the purpose of the task, or be overly precise (such as providing 5 test cases for browser compatibility testing).

The most important lesson learned is that an AI model is still a machine and cannot think instead of you. The more detailed the task description you provide, the more relevant the result you will receive.

Possible next steps could involve developing a user-friendly interface for seamless interaction with AI models and exploring their capabilities in test automation. This being a proof-of-concept, we could only do this minor assignment, but in a full-sized project next steps would be to integrate the AI as part of the project management software and let AI also help with the actual automation of the test cases themselves.

Gen AI is great for testing. But to use it well, we need to plan and work together. Every organization is different, so we need to figure out how to use Gen AI for our needs. We should set some rules and be ready to learn as things change fast in AI. It’s important to keep up and be open to new ideas. By doing this, we can make testing easier and better.

Cookie	Duration	Description
__cf_bm	1 hour	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
_cfuvid	session	Calendly sets this cookie to track users across sessions to optimize user experience by maintaining session consistency and providing personalized services
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
pll_language	1 year	The pll _language cookie is used by Polylang to remember the language selected by the user when returning to the website, and also to get the language information when not available in another way.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_DXHYWY32G0	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
AnalyticsSyncHistory	1 month	No description
attribution_user_id	1 year	This cookie is set by Typeform for usage statistics and is used in context with the website's pop-up questionnaires and messengering.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_lfa	2 years	This cookie is set by the provider Leadfeeder to identify the IP address of devices visiting the website, in order to retarget multiple users routing from the same IP address.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	2 years	LinkedIn sets this cookie to store performed actions on the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
rl_anonymous_id	never	RudderStack set this cookie to store statistical data of users' behaviour on the website, which can be used for internal analytics by the website operator.
rl_group_id	never	RudderStack sets this cookie to collect user activity on the web.
rl_group_trait	never	Rudderstack sets this cookie, which is used to store performed actions on the website.
rl_trait	never	Rudderstack sets this cookie, which is used to store performed actions on the website.
rl_user_id	never	RudderStack set this cookie to store a unique user ID for the Marketing/Tracking purpose.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__Secure-ROLLOUT_TOKEN	6 months	Description is currently not available.
_dc_gtm_UA-54427192-1	1 minute	No description
_lfa_test_cookie_stored	past	No description
AWSALBTG	7 days	No description available.
AWSALBTGCORS	7 days	No description available.
debug	never	No description available.
li_gc	2 years	No description
tf_respondent_cc	6 months	Description is currently not available.

How to automate the daily routine of QA engineers – An AI based solution example

Introduction

Approaches

Requirements to the Case Solution

Models Overview

Software Description and Materials

Output evaluation criteria

Results

Conclusions

Want our help with AI in testing? Or want us to come speak at your event?

Share

Read next

What makes VALA a great place to work?

Strategy progress update – Becoming the greatest quality company

AI-Assisted Development Can Be Intimidating – Quality Assurance Helps

Search