Troubleshooting

Troubleshooting#

Note that this is a sparse matrix. Services/behaviors that are as expected and/or not relevant for diagnosing a problem are marked N/A.

SQS	Cloudwatch	S3	EC2/ECS	Problem	Solution
Messages in flight consistently < number of dockers running	CP never progresses beyond a certain module	No outputs are written to S3	N/A	CP is stalling indefinitely on a step without throwing an error. This means there is a bug in CP.	The module that is stalling is the one after the last module that got logged. Check the Issues in the CP Github repo for reports of problems with a certain module. If you don’t see a report, make one. Use different settings within the module to avoid the bug or use a different version of DCP with the bug fixed.
Jobs completing (total messages decreasing) much more quickly than expected.	“File not run due to > expected number of files”	No new outputs are written to S3	N/A	CHECK_IF_DONE_BOOL is being triggered because the output folder for your job already has >= EXPECTED_NUMBER_OF_FILES.	If you want to overwrite previous runs, in your config, change CHECK_IF_DONE_BOOL to TRUE. If using the CHECK_IF_DONE_BOOL option to avoid reprocessing old jobs, make sure to account for any files that may already exist in the output folder. i.e. if your pipeline creates 5 files, but there are already 6 files in your output folder, make sure to set the EXPECTED_NUMBER_FILES to 11 (6+5), not 5.
Jobs completing (total messages decreasing) much more quickly than expected.	“== OUT” without proceeding through CP pipeline	Batch_data.h5 files being created instead of expected output.	N/A	Your pipeline has the CreateBatchFiles module included.	Uncheck the CreateBatchFiles module in your pipeline.
Jobs moving to dead messages	“ValueError: dictionary update sequence element #1 has length 1; 2 is required”	No outputs are written to S3	N/A	The syntax in the groups section of your job file is incorrect.	If you are grouping based on multiple variables, make sure there are no spaces between them in your listing in your job file. e.g. “Metadata_Plate=Plate1,Metadata_Well=A01” is correct, “Metadata_Plate=Plate1, Metadata_Well=A01” is incorrect.
N/A	Nothing happens for a long time after “cellprofiler -c -r “	N/A	N/A	1) Your input directory is set to a folder with a large number of files and CP is trying to read the whole directory before running. 2) You are loading very large images.	1) In your job file, change the input to a smaller folder. 2) Consider downscaling your images before running them in CP. Or just be more patient.
N/A	Within a single log there are multiple “cellprofiler -c -r”	Expected output seen.	N/A	A single job is being processed multiple times.	SQS_MESSAGE_VISIBILITY is set too short. See SQS_Queue_information for more information.
Jobs moving to dead messages	“ValueError: no name (Invalid arguments to routine: Bad value)” or “Encountered unrecoverable error in LoadData during startup: No name (no name)”	No outputs to S3	N/A	There is a problem with your LoadData.csv. This is usually seen when CSVs are created with a script; accidentally having an extra comma somewhere (looks like “,,”) will be invisible in Excel but generate the CP error. If you made your CSVs with pandas to_csv option, you must pass index=False or you will get this error.	Find the “,,” in your CSV and remove it. If you made your CSVs with pandas dataframe’s to_csv function, check to make sure you used the index=False parameter.
Jobs moving to dead messages	IndexError: index 0 is out of bounds for axis 0 with size 0	No outputs to S3	N/A	1) Metadata values of 0 OR that have leading zeros (ie Metadata_Site=04, rather than Metadata_Site=4) are not handled well by CP. 2) The submitted jobs don’t make sense to CP. 3) DCP is looking for your images in the wrong location. 4) CellProfiler isn’t accessing the rows of your load_data.csv that contain information about the jobs.	1) Change your LoadData.csv so that there are no Metadata values of 0 or with 0 padding. 2) Change your job file so that your jobs match your pipeline’s expected input. 3) If using LoadData, make sure the file paths are correct in your LoadData.csv and the “Base image location” is set correctly in the LoadData module. If using BatchFiles, make sure your BatchFile paths are correct. 4) Make sure that your LoadData module has “Process just a range of rows?” as No or that the range you have set do not filter out the jobs that you are submitting.
N/A	N/A	Pipeline output is not where expected	N/A	1) There is a mistake in your ExportToSpreadsheet in your pipeline. 2) There is a mistake in your job file.	1) Check that your Output File Location is as expected. Default Output Folder is typical. Default Output Folder sub-folder can cause outputs to be nested in an unusual manner. 2) Check the output path in your job file.
Jobs moving to dead messages	“Empty image set list: no images passed the filtering criteria.”	No outputs to S3	N/A	DCP doesn’t know how to load your image set.	If you are using a .cppipe and LoadData.csv, make sure that your pipeline includes the LoadData module.
Jobs completing (total messages decreasing) much more quickly than expected.	“==OUT, SUCCESS”	No outcome/saved files on S3	N/A	There is a mismatch in your metadata somewhere.	Check the `Metadata_` columns in your load_data.csv for typos or a mismatch with your jobs file. The most common sources of mismatch are case and zero padding (e.g. A01 vs a01 vs A1). Check for these mismatches and edit the job file accordingly. If you use pe2loaddata to create your csvs and the plate was imaged multiple times, pay particular attention to the Metadata_Plate column as numbering reflecting this will be automatically passed into the load_data.csv
N/A	Your specified output structure does not match the Metadata passed.	Expected output is seen.	N/A	This is not necessarily an error. If the input grouping is different than the output grouping (e.g. jobs are run by Plate-Well-Site but are all output to a single Plate folder) then this will print in the Cloudwatch log that matches the input structure but actual job progress will print in the Cloudwatch log that matches the output structure.	N/A
Jobs moving to dead messages	Your perinstance logs have an IOError indicating that an .h5 batchfile does not exist	No outcome/saved files on S3	N/A	Your job is configured for using a batchfile and no batchfile exists for your project	1) Create a batch file and make sure that it is in the appropriate directory 2) Make sure that you have set your batch file location correctly in your jobs 3) If using run_batch_general.py for job creation, make sure that you passed the `--use-batch` flag
No jobs are pulled from the queue	No logs are created	No outputs are written to S3	Machines made in EC2 but they remain nameless.	A nameless machine means that the Dockers are not placed on the machines. 1) There is a mismatch in your DCP config file. OR 2) You haven’t set up permissions correctly. OR 3) Dockers are not being made in ECS	1) Confirm that the MEMORY matches the MACHINE_TYPE set in your config. Confirm that there are no typos in your DOCKERHUB_TAG set in your config. 2) Check that you have set up permissions correctly for the user or role that you have set in your config under AWS_PROFILE. Confirm that your `ecsInstanceRole` is able to access the S3 bucket where your `ecsconfigs` have been uploaded. 3) Check in ECS that you see `Registered container instances`.
Jobs moving to dead messages	Your perinstance logs have an IOError indicating that CellProfiler cannot open your pipeline	No outputs are written to S3	N/A	You have a corrupted pipeline	Check if you can open your pipeline locally. It may have been corrupted on upload or it may have an error within the pipeline itself.
N/A	“== ERR move failed:An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.” Error may not show initially and may become more prevalent with time.	N/A	N/A	Too many jobs are finishing too quickly creating a backlog of jobs waiting to upload to S3.	You can 1) check out fewer machines at a time, 2) check out smaller machines and run fewer copies of DCP at the same time, or 3) group jobs in larger groupings (e.g. by Plate instead of Well or Site). If this happens because you have many jobs finishing at the same time (but not finishing very rapidly such that it’s not creating an increasing backlog) you can increase SECONDS_TO_START in config.py so there is more separation between jobs finishing.
N/A	“/home/ubuntu/bucket: Transport endpoint is not connected”	S3 cannot be accessed by fleet.	N/A	S3FS has stochastically dropped/failed to connect.	Perform your run without using S3FS by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup, you may need to increase the size of your EBS volume to account for the files being downloaded.
Jobs moving to dead messages	“SSL: certificate subject name (*.s3.amazonaws.com) does not match target host name ‘xxx.yyy.s3.amazonaws.com’”	S3 cannot be accessed by fleet.	N/A	S3FS fails to mount if your bucket name has a dot (.) in it.	You can bypass S3FS usage by setting DOWNLOAD_FILES = TRUE in your config.py. Note that, depending upon your job and machine setup, you may need to increase the size of your EBS volume to account for the files being downloaded. Alternatively, you can make your own DCP Docker and edit run-worker.sh to `use_path_request_style`. If your region is not us-east-1 you also need to specify `endpoint`. See S3FS documentation for more information.
N/A	Your logs show that files are downloading but it never moves beyond that point.	N/A	N/A	If you have set DOWNLOAD_FILES = TRUE in your config, then your files are failing to completely download because you are running out of space and it is failing silently.	Place larger volumes on your instances by increasing EBS_VOL_SIZE in your config.py
Jobs moving to dead messages	“ValueError: The Mito image is missing from the pipeline.”	No files are output to S3	N/A	The CellProfiler pipeline is referencing a channel (in this example, “Mito”) that is not being loaded in the pipeline.	Check that your load_data csv contains the FileNames and PathNames for all your images. This can sometimes happen when the load_data csv is being automatically generated or edited as part of a workflow.
Jobs moving to dead messages	“Failed to prepare run for module LoadData”, “ValueError: zero-size array to reduction operation maximum which has no identity”	No files are output to S3	N/A	CellProfiler cannot read any information from your load_data.csv.	Check that your load_data.csv contains data beyond the header. This can sometimes happen when the load_data csv is being automatically generated or edited as part of a workflow.
Jobs moving to dead messages	“CP PROBLEM: Done file reports failure.”	No files are output to S3	N/A	Something went wrong in your CellProfiler pipeline.	Read the logs above the CP PROBLEM message to see what the specific CellProfiler error is and fix that error in your pipeline.

Further hints:

The SSH_KEY_NAME in the config.py file contains the name of the key pair used to access AWS. This field is the name of the file with the .pem extension (SSH_KEY_NAME = “MyKeyPair.pem”). The same name is used in the fleet configuration file (e.g. exampleFleet.json) but without using the .pem extension (“KeyName”: “MyKeyPair”).
With multi-well plates (e.g. 384-well plate), it is often better to use LoadData module in CellProfiler pipeline. Pipelines that use LoadData don’t need to worry about setting the input field in exampleJob_PlateID.json UNLESS something in the pipeline (such as FlagImage, FilterObjects, SaveImages, etc) references the “Default Input Folder”.