My requirement:

Make the inference task run on GPU for object detection using tensorflow.

Current status:

I am using AWS GPU instance (p2.xlarge) for training as well as for inference. The training part runs well on GPU. No problem here. (Graphics card: Tesla M60)

For getting predictions, I have created a flask server encapsulating the tensorflow detection with some additional logic to it. I am going to deploy this service (Flask + tensorflow) as a docker container. The base image that I am using is tensorflow/tensorflow:1.12.0-gpu-py3 . My dockerfile looks something like this:

FROM tensorflow/tensorflow:1.12.0-gpu-py3 COPY ./app /app COPY ./requirements.txt /app RUN pip3 install -r /app/requirements.txt RUN mkdir /app/venv WORKDIR /app RUN export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim ENTRYPOINT ["python3", "/app/main.py"] ENV LISTEN_PORT 8080 EXPOSE 8080

I am able to deploy this by:

docker run --runtime=nvidia --gpus all --name <my-long-img-name> -v <somepath>:<anotherpath> -p 8080:8080 -d <my-long-img-name>

and successfully make calls to the endpoints on port 8080 from postman.

Basically, what I mean is all the drivers are setup properly.

One of the endpoint in flask is like: (For testing if GPU is being used or not)

@app.route("/testgpu", methods=["GET"]) def testgpu(): import tensorflow as tf with tf.device('/gpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a') b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b') c = tf.matmul(a, b) with tf.Session() as sess: print (sess.run(c))

When I call this endpoint I get no errors (If there was no gpu detected it would throw error). This means gpu is detected for this snippet. YAY !!

I also added these 2 lines to my main code execution flow:

from tensorflow.python.client import device_lib print(device_lib.list_local_devices())

and it outputs:

Local devices : [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 17661279486087266140 , name: "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 locality { } incarnation: 9205152708262911170 physical_device_desc: "device: XLA_GPU device" , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality { } incarnation: 3134142118233627849 physical_device_desc: "device: XLA_CPU device" , name: "/device:GPU:0" device_type: "GPU" memory_limit: 7447009690 locality { bus_id: 1 links { } } incarnation: 6613138223738633761 physical_device_desc: "device: 0, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2" ]

YAY again, the GPU is detected.

Even the logs from tensorflow is taking GPU.

2019-11-18 08:45:29.944580: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-18 08:45:29.944603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-11-18 08:45:29.944611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-11-18 08:45:29.944721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7101 MB memory) -> physical GPU (device: 0, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)

Everything seems smooth here, but the main part where GPU should be running is not taking it. It is using CPU. There is this another endpoint (let's say, /getpredictions ) along with /testgpu that is mentioned above which runs the detection and returns the output.

The problem: Whenever I call /getpredictions from postman on port 8080 instead of using GPU it takes CPU and returns the output in around ~30+ seconds.

Is there anything missing here? Any workarounds?

Let me know if I need to add some more information to the question.