Skip to content


As part of our commitment to enhancing the QA workflow at our client MTV, our team embarked on a project to present an AI-based solution that would automate the daily routine of QA engineers, i.e. help Test managers and QA engineers to create test cases based on the requirements of the project, and finally export those as a .cvs file to test management applications.

This document outlines the task at hand, the intended product, and the process of comparing, selecting, and developing scripts which use modern AI models.

To achieve this goal, our team conducted an in-depth analysis of the available modern AI models, evaluating their capabilities, performance, and suitability for our specific requirements. This comparative study involved an assessment of each model’s strengths and weaknesses, as well as its potential to integrate seamlessly with the existing QA infrastructure.

After this, we proceeded to the evaluation phase. During this stage, various factors were considered, such as efficiency, ease of implementation, and long-term viability, to determine which model(s) will form the foundation of our AI-based QA solution.

The introduction has been generated by Claude 3


In our opinion, there were two plausible options for the technology approach:

1. Using one of the commercial AI-assisted software testing options “out of the box”

While software like TestRigor, Tricentis, Testim, Functionize, and Applitools provide promising solutions, they often come with a heavy approach, at least for this specific need. These tools are made for QA purposes. They have a lot of features and specialized functionalities for the needs of quality assurance. These tools will develop rapidly and it will be very interesting to follow up how and when. On the other hand, to effectively benefit from these kind of tools, you have to adapt your processes for them. You should be committed to the use of the software and be dependent on the vendor in a longer run.

2. Leveraging AI large language models, for example MistralAI, Llama2, Claude, GPT4, and Copilot

AI models are mathematical algorithms trained on data to make predictions or decisions without being explicitly programmed. They offer more flexibility and customization, for example, you can set up your own parameters or edit the prompt you use to interact with the model.

Thus, after careful consideration, we decided to go with the number 2 approach, i.e. direct interaction with the AI models by ourselves. 

Requirements to the Case Solution

The proposed solution can be served as a proof-of-concept for the product which can be potentially used in production

Because of this, there were two main requirements:

  1. An output should be compatible with Testrail (a Jira-compatible test management tool). An official instruction from the website can be used as a reference. So, as a result, there should be a .csv file with all the necessary columns for the test cases (test case title, precondition, description, steps, etc)
  2. Security. Any input (requirements, prompts, design pictures) should be protected from getting leaked.

Models Overview

At this point, we conducted an in-depth analysis of leading AI models, assessing their capabilities and performance.

Our comparison of the basic features of the large language models chosen.
  • MistralAI provides a huge variety of Large Language models. The Medium one will be used. It has emerged as a top contender, offering impressive results, ease of integration and flexibility. Subscription is required to get the API key. The service is not free, however it is not too expensive. The Mistral medium is in the top of the leaderboard, see
  • Llama2, while resource-intensive, provides thorough test case generation. Llama2 provides a lot of different models available for free. Unfortunately, some of them (for example, 13-b-chat) require way too much resources. However, for a 7b-instruct model is enough for a machine with 18gb of RAM, although it is quite slow. The most convenient way to install LLama’s models is to use ollama to your Linux machine.
  • Claude boasts superior reasoning abilities. Claude is an AI assistant created by Anthropic. It requires an API key, however, there is a free option worth 5 euros. Claude 3 Opus model, which will be used for the task, was introduced in March, 2024. According to the model description, Its undergraduate level knowledge and graduate level reasoning are outperforming. Claude is SOC II Type 2 certified.
  • GPT4 is a swiss knife, when it comes to AI. It excels in writing tests based on images. You can ask GPT4 to write tests based on an image, and then ask it to create a TestRail compatible csv file from those tests. Only additional info required will be the extra columns required. By default won’t prioritize different tests or list automation types. The model manages these when asked.
  • Copilot offers convenient copying and exporting features, albeit with occasional stability issues. Copilot enables you to copy from a button of the results as well as exporting it to excel, however you are not accepted to place that excel back into Copilot for further questions. Copilot accepts a jpg format prompt to be placed in it and further functions requested of it.

Software Description and Materials

To properly gauge different AI models’ capabilities, we tasked them to create a list of test cases for the example of the Login window. It also allows users to restore a password and create a new account. This feature was chosen due to how common it is.

To test these AI models, we provided inputs such as text prompts, design screenshots, lists of use cases, and requirements. These inputs simulate real-world scenarios, allowing us to evaluate the models’ efficacy.

The inputs we used:

  • A short text prompt, which is a brief query to the AI model. For example, “I need to test this login screen. Help me write test cases. Write me 20 test cases that you think are the most essential when testing a login screen like this”.
  • A long text prompt. It is quite thorough and contains all the info about the desired output. The prompt is accessible, for example, here.
  • A design screenshot. Although some models (at least Mistral and Llama2) do not contain a built-in “text reader”, this can be solved by using other libraries, for example Cloudmersive OCR API.
  • Use cases and requirements.

Output evaluation criteria

Then we evaluated the outputs based on the following criteria:  

  • How good does the model understand the task or, in other words, how does the result correlate to the prompt (not related to the prompt at all/somehow related/a good answer which can be similar to the expert one)
  • Speed of generation (instant/several minutes/you can go have a cup of tea)
  • How easy is it to begin (just open a website and start chatting/you have to create an API key and run a script/you have to download a model, develop a script, install some libraries…)
  • How convenient it is to import to testrail. (CSV is ready/some preprocessing is needed/not ready at all, it is easier to write by a human)
  • Test coverage, are all the main cases covered (yes, there is nothing to add /some of the main cases are covered/not really)? Are all the testing types involved (functional, negative, usability, compatibility, security)?
  • Prioritisation and how correct is it given, main functional cases are in the beginning of the list (yes/sometimes/no, cases are mixed up)

Each criteria can be evaluated from 0 to 2.


Finally, we compared the output of each model, based on the above criteria, including understanding of the task, speed of generation, ease of import into TestRail, test coverage, and prioritisation. 

As a baseline to compare with, a list of the test cases written by an expert was used. The main difference is that the expert keeps in mind all the testing types (also load and accessibility tests), while AI models mostly focus on positive and functional tests. However, it can be fixed by specifying more testing types in the prompt and using more tokens when setting up the AI model query.

After thorough testing and scoring, in the end, GPT4 and Claude emerged as the most productive models, offering comprehensive and relevant results.

Relevant resultsSpeedEasiness to importEasiness to startTest coveragePrioritiesSum
MistralAI Medium21212210
Llama2 7b1000113
Results comparison of the AI models we tested

After a comparison between the model results, the following patterns can be noticed:

  • Until you specify a desired output in detail, the output will be quite common and not ready for the import. The same goes with the amount of test cases.
  • Llama2 does not follow the requested .csv structure (for example, it does not use semicolon as a separator and adds some extra text to the output) in spite of the unambiguous prompt.
  • The most thorough output is provided if you give a list of use cases as a part of the input.
  • For some reason, Copilot does not always work in a stable way: for example, after giving it a .pdf file, it does not generate all the requested test cases.


From left, VALA’s Joonas Luukkonen, Tommy Johansson and Ivan Osipov presenting their findings at MTV.

Looks like we have achieved our goals. Solutions have been found that satisfy the requirements: the most productive AI models are GPT-4 and Claude. 

By using a comprehensive prompt with all the required columns mentioned, you can generate a .csv file that can be easily imported into TestRail. In other words, AI models can automate a huge part of the daily routine if you need to, for example, write steps for each test case.

However, expert supervision may still be required at times. For example, AI models can be buggy, may not fully understand the purpose of the task, or be overly precise (such as providing 5 test cases for browser compatibility testing).

The most important lesson learned is that an AI model is still a machine and cannot think instead of you. The more detailed the task description you provide, the more relevant the result you will receive.

Possible next steps could involve developing a user-friendly interface for seamless interaction with AI models and exploring their capabilities in test automation. This being a proof-of-concept, we could only do this minor assignment, but in a full-sized project next steps would be to integrate the AI as part of the project management software and let AI also help with the actual automation of the test cases themselves. 

Gen AI is great for testing. But to use it well, we need to plan and work together. Every organization is different, so we need to figure out how to use Gen AI for our needs. We should set some rules and be ready to learn as things change fast in AI. It’s important to keep up and be open to new ideas. By doing this, we can make testing easier and better.

Want our help with AI in testing? Or want us to come speak at your event?