Troubleshooting#

SQS

Cloudwatch

S3

EC2/ECS

Problem

Solution

Messages in flight consistently < number of dockers running

CP never progresses beyond a certain module

CP is stalling indefinitely on a step without throwing an error. This means there is a bug in CP.

The module that is stalling is the one after the last module that got logged. Check the Issues in the CP Github repo for reports of problems with a certain module. If you don’t see a report, make one. Use different settings within the module to avoid the bug or use a different version of DCP with the bug fixed.

Jobs completing (total messages decreasing) much more quickly than expected.

“File not run due to > expected number of files”

CHECK_IF_DONE_BOOL is being triggered because the output folder for your job already has >= EXPECTED_NUMBER_OF_FILES.

If you want to overwrite previous runs, in your config, change CHECK_IF_DONE_BOOL to TRUE. If using the CHECK_IF_DONE_BOOL option to avoid reprocessing old jobs, make sure to account for any files that may already exist in the output folder. i.e. if your pipeline creates 5 files, but there are already 6 files in your output folder, make sure to set the EXPECTED_NUMBER_FILES to 11 (6+5), not 5.

Jobs completing (total messages decreasing) much more quickly than expected.

“== OUT” without proceeding through CP pipeline

Batch_data.h5 files being created instead of expected output.

Your pipeline has the CreateBatchFiles module included.

Uncheck the CreateBatchFiles module in your pipeline.

“ValueError: dictionary update sequence element #1 has length 1; 2 is required”

The syntax in the groups section of your job file is incorrect.

If you are grouping based on multiple variables, make sure there are no spaces between them in your listing in your job file. e.g. “Metadata_Plate=Plate1,Metadata_Well=A01” is correct, “Metadata_Plate=Plate1, Metadata_Well=A01” is incorrect.

Nothing happens for a long time after “cellprofiler -c -r “

1) Your input directory is set to a folder with a large number of files and CP is trying to read the whole directory before running. 2) You are loading very large images.

1) In your job file, change the input to a smaller folder. 2) Consider downscaling your images before running them in CP. Or just be more patient.

Within a single log there are multiple “cellprofiler -c -r”

Expected output seen.

A single job is being processed multiple times.

SQS_MESSAGE_VISIBILITY is set too short. See SQS_Queue_information for more information.

“ValueError: no name (Invalid arguments to routine: Bad value)” or “Encountered unrecoverable error in LoadData during startup: No name (no name)”

There is a problem with your LoadData.csv. This is usually seen when CSVs are created with a script; accidentally having an extra comma somewhere (looks like “,,”) will be invisible in Excel but generate the CP error. If you made your CSVs with pandas to_csv option, you must pass index=False or you will get this error.

Find the “,,” in your CSV and remove it. If you made your CSVs with pandas dataframe’s to_csv function, check to make sure you used the index=False parameter.

IndexError: index 0 is out of bounds for axis 0 with size 0

1) Metadata values of 0 OR that have leading zeros (ie Metadata_Site=04, rather than Metadata_Site=4) are not handled well by CP. 2) The submitted jobs don’t make sense to CP. 3) DCP is looking for your images in the wrong location. 4) CellProfiler isn’t accessing the rows of your load_data.csv that contain information about the jobs.

1) Change your LoadData.csv so that there are no Metadata values of 0 or with 0 padding. 2) Change your job file so that your jobs match your pipeline’s expected input. 3) If using LoadData, make sure the file paths are correct in your LoadData.csv and the “Base image location” is set correctly in the LoadData module. If using BatchFiles, make sure your BatchFile paths are correct. 4) Make sure that your LoadData module has “Process just a range of rows?” as No or that the range you have set do not filter out the jobs that you are submitting.

Pipeline output is not where expected

1) There is a mistake in your ExportToSpreadsheet in your pipeline. 2) There is a mistake in your job file.

1) Check that your Output File Location is as expected. Default Output Folder is typical. Default Output Folder sub-folder can cause outputs to be nested in an unusual manner. 2) Check the output path in your job file.

“Empty image set list: no images passed the filtering criteria.”

DCP doesn’t know how to load your image set.

If you are using a .cppipe and LoadData.csv, make sure that your pipeline includes the LoadData module.

Jobs completing(total messages decreasing) much more quickly than expected.

“==OUT, SUCCESS”

No outcome/saved files on s3

There is a mismatch in your metadata somewhere.

Check the Metadata_ columns in your LoadData.csv for typos or a mismatch with your jobs file. The most common sources of mismatch are case and zero padding (e.g. A01 vs a01 vs A1). Check for these mismatches and edit the job file accordingly. If you use pe2loaddata to create your csvs and the plate was imaged multiple times, pay particular attention to the Metadata_Plate column as numbering reflecting this will be automatically passed into the Load_data.csv

Your specified output structure does not match the Metadata passed.

Expected output is seen.

This is not necessarily an error. If the input grouping is different than the output grouping (e.g. jobs are run by Plate-Well-Site but are all output to a single Plate folder) then this will print in the Cloudwatch log that matches the input structure but actual job progress will print in the Cloudwatch log that matches the output structure.

Your perinstance logs have an IOError indicating that an .h5 batchfile does not exist

No outcome/saved files on s3

No batchfiles exist for your project.

Either you need to create the batch files and make sure that they are in the appropriate directory OR re-start and use MakeAnalysisJobs() instead of MakeAnalysisJobs(mode=‘batch’) in run_batch_general.py

Machines made in EC2 but they remain nameless.

A nameless machine means that the Dockers are not placed on the machines. 1) There is a mismatch in your DCP config file. OR 2) You haven’t set up permissions correctly. OR 3) Dockers are not being made in ECS

1) Confirm that the MEMORY matches the MACHINE_TYPE set in your config. Confirm that there are no typos in your DOCKERHUB_TAG set in your config. 2) Check that you have set up permissions correctly for the user or role that you have set in your config under AWS_PROFILE. Confirm that your ecsInstanceRole is able to access the S3 bucket where your ecsconfigs have been uploaded. 3) Check in ECS that you see Registered container instances.

Your perinstance logs have an IOError indicating that CellProfiler cannot open your pipeline

You have a corrupted pipeline.

Check if you can open your pipeline locally. It may have been corrupted on upload or it may have an error within the pipeline itself.

“== ERR move failed:An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.” Error may not show initially and may become more prevalent with time.

Too many jobs are finishing too quickly creating a backlog of jobs waiting to upload to S3.

You can 1) check out fewer machines at a time, 2) check out smaller machines and run fewer copies of DCP at the same time, or 3) group jobs in larger groupings (e.g. by Plate instead of Well or Site). If this happens because you have many jobs finishing at the same time (but not finishing very rapidly such that it’s not creating an increasing backlog) you can increase SECONDS_TO_START in config.py so there is more separation between jobs finishing.

“/home/ubuntu/bucket: Transport endpoint is not connected”

Cannot be accessed by fleet.

S3FS has stochastically dropped/failed to connect.

Perform your run without using S3FS by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup, you may need to increase the size of your EBS volume to account for the files being downloaded.

“SSL: certificate subject name (*.s3.amazonaws.com) does not match target host name ‘xxx.yyy.s3.amazonaws.com’”

Cannot be accessed by fleet.

S3FS fails to mount if your bucket name has a dot (.) in it.

You can bypass S3FS usage by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup, you may need to increase the size of your EBS volume to account for the files being downloaded. Alternatively, you can make your own DCP Docker and edit run-worker.sh to use_path_request_style. If your region is not us-east-1 you also need to specify endpoint. See S3FS documentation for more information.

Your logs show that files are downloading but it never moves beyond that point.

If you have set DOWNLOAD_FILES = TRUE in your config, then your files are failing to completely download because you are running out of space and it is failing silently.

Place larger volumes on your instances by increasing EBS_VOL_SIZE in your config.py

“ValueError: The Mito image is missing from the pipeline.”

The CellProfiler pipeline is referencing a channel (in this example, “Mito”) that is not being loaded in the pipeline.

Check that your load_data csv contains the FileNames and PathNames for all your images. This can sometimes happen when the load_data csv is being automatically generated or edited as part of a workflow.

“Failed to prepare run for module LoadData”, “ValueError: zero-size array to reduction operation maximum which has no identity”

CellProfiler cannot read any information from your load_data.csv.

Check that your load_data.csv contains data beyond the header. This can sometimes happen when the load_data csv is being automatically generated or edited as part of a workflow.

“CP PROBLEM: Done file reports failure.”

Something went wrong in your CellProfiler pipeline.

Read the logs above the CP PROBLEM message to see what the specific CellProfiler error is and fix that error in your pipeline.

Further hints:

  • The SSH_KEY_NAME in the config.py file contains the name of the key pair used to access AWS. This field is the name of the file with the .pem extension (SSH_KEY_NAME = “MyKeyPair.pem”). The same name is used in the fleet configuration file (e.g. exampleFleet.json) but without using the .pem extension (“KeyName”: “MyKeyPair”).

  • Input: With multi-well plate (e.g. 384-well plate), it is often better to use LoadData module in CellProfiler pipeline. Pipelines that use LoadData don’t need to worry about setting the input field in exampleJob_PlateID.json UNLESS something in the pipeline (such as FlagImage, FilterObjects, SaveImages, etc) references the “Default Input Folder”.