[CAM-5284] I can use a long error message with an External Task Created: 21/Jan/16  Updated: 20/Nov/18  Resolved: 13/Jul/16

Status: Closed
Project: camunda BPM
Component/s: engine
Affects Version/s: None
Fix Version/s: 7.6.0, 7.6.0-alpha3

Type: Feature Request Priority: L3 - Default
Reporter: Hans Hübner Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: SUPPORT
Remaining Estimate: 0 minutes
Time Spent: Not Specified
Original Estimate: 0 minutes

Attachments: File catalina.out     File postgres_engine_7.4.1-ee.sql    
Issue Links:
Depedendency
Related
is related to CAM-8832 document error details of external ta... Closed

 Description   

While working on a new model, our service tasks generated errors with backtraces which were too large to be stored in the various message related fields in the database. The engine crashed very ungracefully when that happened. Our fix was to change the table definitions so that instead of making message fields to have type `varchar(4000)`, they're defined as `text`.

I'm not sure if that is really the best fix. It might be better to have a limit and enforce it by the engine. In any case, running into a database error causes trouble that requires an engine restart, which should be avoided.

I tried to attach our changed `sql/create/postgres_engine_7.4.1-ee.sql` file for your consideration, but JIRA gave me the error message "No project could be found with id '10330'. Something on JIRA's end seems to be broken for file uploads. The only change that I made was the column type change as described above.



 Comments   
Comment by Robert Gimbel [ 21/Jan/16 ]

Hi Hans,

Thank you for your feedback.

You raised the issue in our Extensions project but I think it is realted to the camunda platform directly. So I moved it there.

Can you try to attach again please.

Thanks
Robert

Comment by Hans Hübner [ 21/Jan/16 ]

Attaching the file worked now, see above. I could not file the issue into the camunda BPM project, probably I don't have permission for that?

Comment by Hans Hübner [ 23/Jan/16 ]

This problem occurs in other places as well, see the attached backtrace. We're going to modify the Postgres schema for our needs, but it seems to me that the problem should be addressed at a different level because updating the database schema won't be a good option once an application is deployed to production.

Comment by Daniel Meyer [ 25/Jan/16 ]

Hi Hans,

In any case, running into a database error causes trouble that requires an engine restart, which should be avoided.

that should not be the case, why did you have to restart the process engine?

Back to the core of the issue:
I gathered from the stacktrace that you are implementing an external task. The external task feature is quite new (introduced with 7.4). We will polish it based on user feedback. What we could do is employ the same pattern for External task that is also used for Jobs:

  • the message is a short(er) description of the error and has some maximum size.
  • a larger error trace can be provided as well. This has no size restriction and is stored in a separate table (ACT_GE_BYTEARRAY)
    In addition: the api could verify the maxlength of the string instead of relying on the database.

This is how it is done for jobs and we could do it in the same way for external tasks.

All the best,
Daniel

Comment by Hans Hübner [ 25/Jan/16 ]

Daniel,

thank you for getting back. We are in fact using external tasks, and the
problem with columns of insufficient length occurred in different
situations. When the first incident occurred, the engine completely lost
its mind and all subsequent database operations failed. I agree with you
that this should not happen, but it did. Niall looked on the screen
together with me and he can confirm.

I'm not quite as sure whether this problem can safely be addressed on a
case-by-case basis. If your code does not know how large the underlying
database columns are, for all fields, it is always possible that something
slips through and causes unrecoverable errors in your persistence layer.
For that reason, I would suggest that making those columns that do not have
a well-defined bounded maximum length in your application code be defined
as being of unlimited size in the database layout. That is at least what I
have done now, and as the issue occurred with different tables and columns,
I feel much safer this way until you can confirm that the application code
makes sure that column length overflow do not occur.

Thanks,
Hans

Comment by Daniel Meyer [ 27/Jan/16 ]

Hi Hans,

thank you for getting back to us.

> When the first incident occurred, the engine completely lost its mind and all subsequent database operations failed

Would it be possible for you to reproduce this in a unit test?
Then we can reproduce the problem an comment on it in a better way.

As an aside: camunda is architected in a way that, in theory, this cannot happen. But I wnt to check...

Tha nks,
Daniel

Comment by Hans Hübner [ 27/Jan/16 ]

Hi Daniel,

I am unable to reproduce the complete crash, but if I encounter it again, I will let you know. It seemed to be a follow-on problem to the original issue caused by the lack of space in a column. I am currently working with a modified schema that does not have a size restriction on the various message fields (which I prefer to truncating the message anyway), so it is unlikely that I'm going to run into this by accident now.

We can live with using a non-standard schema for now, yet this is only a temporary solution and we'd hope that this bug be fixed for external tasks so that we can safely upgrade our database schema from your scripts.

Comment by Daniel Meyer [ 27/Jan/16 ]

Hi Hans,

we can validate the max-length of strings in the Java code instead of only in the DB. We will discuss internally whether if and when we will do this.

Concerning the

When the first incident occurred, the engine completely lost its mind and all subsequent database operations failed."

The "org.postgresql.util.PSQLException: ERROR: value too long for type character varying(255)" will only rollback the current transaction. It will not have any effects on subsequent transactions. What could have happened is that some other component like your custom infrastructure for performing external tasks retried submitting the same error message which is too long repeatedly. After the fix I proposed above you would then still see many exceptions but ProcessEngineExceptions instead of database errors.

How did you fetch and complete the external tasks?

Does that make sense to you.

All the best,
Daniel

Comment by Hans Hübner [ 27/Jan/16 ]

Hi Daniel,

it is unfortunate that we have not kept the original error message when the engine completely crashed. The message was rather scary and not the same as the ones that we've been seeing for the overly long column values. Sorry about this.

With respect to external task fetching and completion, we use a sequence like this:

GET /external-task to select a topic that we want to work on
POST /external-task/fetchAndLock to actually fetch and lock the task that we've selected
GET /process-instance/<process-instance-id>/variables to get the list of variables of the process (we don't know them in advance so we can't supply the list in the fetchAndLock call)
POST /external-task/<external-task-id>/complete to complete the task

The reason why we're using additional GETs is that we use the topic name as configuration parameter for the external task. Our external task handler executes shell commands, and the topic begins with "shell" and then has the command to execute appended. Likewise, as we do not know what variables the shell command wants to look at, we are not sending any variables in the fetchAndLock call, but rather fetch all process variables and make them available to the command in the environment.

We'd prefer to be able to annotate the model with task parameters (i.e. have "shell" be the topic name, put the command to execute into a named external task parameter and maybe also be able to have other task parameters that'd allow us to choose one of several executors for the task), but that does not seem to be possible both from a modeler and from a model perspective at this point.

Thanks,
Hans

Comment by Hans Hübner [ 12/Jun/16 ]

Hello,

are there any plans to change the `varchar(4000)`s in the Postgres schema to `text`? We're currently doing that manually, and I see no reason why one would want error messages and other textual information to be cut at an arbitrary size.

Thanks,
Hans

Comment by Matthijs Burke [ 13/Jun/16 ]

Good morning Hans,

Thanks for your input on this issue.

A short question: would you like us to move this issue to our support project?
The difference between the Support project and the Camunda BPM project is that when raising issues in the Camunda BPM project, they are not subject to the agreed SLAs and they can be viewed by all users. In contrast, issues raised in the Support project can only be seen by your authorized support contacts and us. You can find more information in our documentation.
If you would like us to move this issue, please let us know.

Thank you and best regards,
Mat

Comment by Hans Hübner [ 13/Jun/16 ]

Hi Mat,

I frankly don't care so much where this request is being tracked, and it is also not an urgent matter in that we're currently changing the database schema creation files manually before creating the Camunda BPM database. It is simply an annoyance and something that will be disturbing in production, when either the engine crashes because it wants to write an overlong message or when someone tries to diagnose a problem, finding that the error message has been cut after 4000 characters.

Thanks,
Hans

Comment by Matthijs Burke [ 14/Jun/16 ]

Good morning Hans,

we have raised a separate issue in our Support project and have linked it to this issue: SUPPORT-2426. We will take a deeper look into this in the context of our product support and will respond in the Support issue.

Thank you and best regards,
Mat

Comment by Askar Akhmerov [ 07/Jul/16 ]

multi tenancy test is missing

Comment by Askar Akhmerov [ 07/Jul/16 ]

added missing test

Comment by Hans Hübner [ 13/Jul/16 ]

I don't get to see much of what you're doing, but I would like to point out that by "long error message" i mean something which can include a complete Java stack trace, and that easily amounts to a few kilobytes.

Generated at Wed Jun 19 13:57:12 CEST 2019 using JIRA 6.4.6#64021-sha1:33e5b454af4594f54560ac233c30a6e00459507e.